CAPTCHA

Completely Automated Public Turing test to tell Computers and Humans Apart

May 2018

David Magalhães

@speeddragon

David Magalhães

About me

@speeddragon

Software Engineer @

Security Analyst @

Introduction

History

  • (Image Distortion) Captcha Invented in 1997
     
  • Website Altavista used in 1997
     
  • PayPal start using it in 2001
     
  • The term was first used in 2003

Where can we find it ?

Websites

  • Websites are common places to encounter Captchas.

CloudFlare

Apps

  • Although not common, some apps can implement Google ReCaptcha to avoid bots.

How to bypass ?

Where to start?

  • Verify if webpage correctly implements captcha.
     
  • Optical Character Recognition (OCR) software available for captchas.

Some types of attack

  • Static CAPTCHA Identifier
     
  • Fixation Attack
     
  • Re-Riding Attack
     
  • OCR Bruteforce

https://www.owasp.org/images/0/03/ASDC12-Attacking_CAPTCHAs_for_Fun_and_Profit.pdf

Image Processing

Human Workforce

  • https://anti-captcha.com/mainpage
    • 2$ USD per 1000 captchas
    • 17s solve speed
       
  • https://2captcha.com/
    • 3$ USD per 1000 captchas
    • 49s solve speed

Google ReCaptcha

Attacks and Responses

"I'm not a human: Breaking the Google reCAPTCHA"

March, 2016

https://www.blackhat.com/docs/asia-16/materials/asia-16-Sivakorn-Im-Not-a-Human-Breaking-the-Google-reCAPTCHA-wp.pdf

  • Plays with cookies / user agent / etc.
  • Trick website address with localhost.
  • 2500 checkbox captchas per hour.
  • Weekends had less blocking.
  • Leverage Google Reverse Image Search, along other machine learning software.
  • Image reused.

Voice Recognition

March, 2017

https://www.bleepingcomputer.com/news/security/researcher-breaks-recaptcha-using-googles-speech-recognition-api/

  • Usage of SpeechRecognition library from Python
    • Google Speech Recognition
    • Google Cloud Speech API
    • Houndify API
    • Microsoft Bing Voice Recognition

Bypass via HTTP parameter pollution

March, 2018

https://andresriancho.com/recaptcha-bypass-via-http-parameter-pollution/

POST /recaptcha/api/siteverify

recaptcha-response=anything%26secret%3dPUBLIC-TEST-BYPASS_TOKEN&secret=6LeYIbsSAAAAAJezaIq3Ft_hSTo0YtyeFG-JgRtu

Bypass via HTTP parameter pollution

https://andresriancho.com/recaptcha-bypass-via-http-parameter-pollution/

Around ~3% of the integrations with reCAPTCHA were vulnerable.

Google Response

  • Request frequency
     
  • Normal, clear, voice sound to imperceptible voice sound (with distorsions)
     
  • Clear image of cars, street sign, bridges, etc to noisy images, lower resolution images.
     
  • Fixed select images to multiple images appearing with added delay.

Incremental Difficulty

  • Raise number of digits in voice captcha.
     
  • Tweek Advanced Risk Analysis System.
    • Less relaxed wrong answers / image checked box.
       
  • Avoid image repetition.

Incremental Difficulty

How to implement?

Defending against possible attacks.

CloudFlare

  • Use CloudFlare DNS

https://www.cloudflare.com/case-studies/troy-hunt/

Implement on the code

  • Go to Google ReCaptcha page.
     
  • Follow instructions.
     
  • Adjust security.

Verify ReCaptcha

  • Get g-recaptcha-response from User.
     
  • Verify on the back end the token sent.
POST https://www.google.com/recaptcha/api/siteverify

secret=6LeIxAcTAAAAAGG-vFI1TnRWxMZNFuojJ4WifJWe&response=03ACgFB9smWHeHsOPEDTTb-OWMh-SgQISvttCGdp4tN4OW77W9r3bEeIHwd22EyQOmB466kdBm3SD26fMPeKByeXHJSKERi81bcH1b68ZwUU7W4m2TsAs65KzjUaE7t2uMffOR...2kMo4msFdLmj79uTeeCWaHZl2o5QqnF22qAImMSbxWMeMx5gC0O8SQINkmuPexXPHnpUmpzaqgI_WlseJI_q5VrDA

Verify ReCaptcha

{
  "success": true|false,
  "challenge_ts": timestamp,  // timestamp of the challenge load (ISO format yyyy-MM-dd'T'HH:mm:ssZZ)
  "hostname": string,         // the hostname of the site where the reCAPTCHA was solved
  "error-codes": [...]        // optional
}

Why ReCaptcha ?

  • State of the art CAPTCHA system.
     
  • Always evolving.
     
  • Easy to implement and to use.

Breaking Captcha

The Story

Once a upon a time

A website that didn't ask for captcha with valuable information.

And 24 hours later ....

... and 100.000 requests, something weird appear.

https://code.google.com/archive/p/kaptcha/

But something was weird

AJAX request didn't contain CAPTCHA response.

  • Old endpoint still enabled.
     
  • New endpoint checked captcha.

1 Year later, they fixed

And for a couple of months, I didn't have a solution for this ...

... until ...

$ aptitude search ocr

Ocrad

GNU Ocrad is an OCR (Optical Character Recognition) program based on a feature extraction method. It reads a bitmap image in pbm or pgm formats and produces text in byte (8-bit) or UTF-8 formats.

 

Ocrad includes a layout analyser able to separate the columns or blocks of text normally found on printed pages.

https://savannah.gnu.org/projects/ocrad/

Prepare Image first

  • Create better image to convert the IMAGE to TEXT

     
    • Remove background
       
    • Remove line
       
    • Connect missing space
       
    • Remove noise

Get various code of choose the best

  • NoLine Corrected
     
  • NoBackground Corrected
     
  • NoBackground Corrected with _
     
  • Validating the code obtained

Threshold

for ($i = 1; $i <= 9 ; $i++) {
    $v[$i] = self::correctCaptcha(
        trim(shell_exec("ocrad --threshold=0.".$i." ".$newFile))
    );
}

Use various threshold to obtain a better result

Improve final solution

  • Check if size is 5.
     
  • Check if characters are lowercase.
     
  • Limited alphanumeric range.
     
  • For each character find "_", and try to find another character in one of the 9 thresholds solutions.
     
  • Remove "blank" character.

Table mapping

Run on some captchas with know solution ...

? = e %% = 2y y = 2 IT = n T = 7
W = w rf = d ] = p L = c i = x
t = p lt\\ = m v = y z = 2 unicode ...

Success Rate

  • Improved from initial 4% to 20%
  • 1 success captcha solved in each 5 attempts.

But wait, we can do better.

What if we don't ask for a CAPTCHA ?

CAPTCHA marked as solved

While session is enabled, we just need to solve one captcha.

Re-Riding Attack

Distributed ReCaptcha Bot

Work in progress

Idea

  • Google allow good users to just click on "I'm not a robot"
    • Automate that click!

Idea

  • Extract and use Google ReCaptcha validation token.
    • Implement recaptcha token acceptance on crawler to simulate recaptcha success behaviour.
POST http://www.example.com/get'
id=323184&gRecaptchaResponse={{gCaptchaToken}}

Field Research

  • Crawlers tend to use TOR even more.
     
  • ReCaptcha painly slow on TOR network (for obvious reasons 😄).
     
  • Two requests are made:
    • www.google.com (colect ReCaptcha token)
    • www.example.com (to extract information)

Field Research

  • Use normal connection to extract recaptcha token.
     
  • Use TOR to request API information with above ReCaptcha token

Distributed ReCaptcha Solver

  • Develop Chrome extension to install in multiple computers.
     
  • Harvest google captcha token via commie.io
    • ​Note: The token has 120 seconds time expiration.

Inner works

  • Try to not connect to the host website.
    • Block request and modify HTML page.
    • Not successful without editing /etc/hosts.
       
  • Block all requests except the main page request.

Block all (most all) requests

chrome.webRequest.onBeforeRequest.addListener(function(data) {
  if (data.tabId == openedTabId 
           && data.url != "http://www.example.com/") {
      return {cancel: true};
    }
  }
},{'urls': ["*://*.example.com/*"]}, ["blocking"]);

Replace HTML

  • Replace HTML to only contain Google ReCaptcha.
var head = document.getElementsByTagName('head')[0];
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://www.google.com/recaptcha/api.js?hl=pt_PT';
head.appendChild(script);

var body = document.getElementsByTagName('body')[0];
while (body.firstChild) { body.removeChild(body.firstChild); }

var div = document.createElement("div");
div.setAttribute("style", "float:left;");
div.setAttribute("class", "g-recaptcha");
div.setAttribute("data-sitekey", "1XLd32hUUA522B0Gx7htcAQmanD890ZyCCo2i5T");
body.appendChild(div);

Auto Click

if (document.querySelector(".recaptcha-checkbox") != null) {
  var delay = 3000 + Math.random() * 2000; // milliseconds
    setTimeout(function() {
      if (document.querySelector(".recaptcha-checkbox") != null) {
        document.querySelector(".recaptcha-checkbox").click();
      }
    }, delay);
}
  • Inject in google iframe.

Wait for success

  • Request with successfully captcha solved

https://www.google.com/recaptcha/api2/userverify?k=X8LdChUUA3AAAABgG302AQfn69kNDSnm23lbo

Future Evolution

Of Captcha Breakers

Possible solutions

  • Machine Learning
    • Keras
    • TensorFlow

https://github.com/JackonYang/captcha-tensorflow
https://medium.com/@ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710

FunCaptcha

Thank you

Questions?

Made with Slides.com