decoding weibo captcha in python

Jingchao Hu

jingchaohu AT gmail DOT com

2013 Favbuy Corp.

The Problem

ALSO THIS

THeY seem to be hard

Not REALLY... LET's BREAK IT

Wait, BEFORE THAT

LEt's REVIEW SOME COMMON TECHniCS

↓

Steps

Removing Noises
Separating Characters
Extracting Features
Classifying Features

↓

an example

↓

After CLEANING

↓

After Splitting

↓

AFter CLASSIFYING

8452

Yay!

→

IT LOOKS SIMPLE

BUT it isN't...

... UNTIL I EXPLAIN DETAILS

CLEANING NOISES

Using PIL(Pillow), it is offten easier

than you might think

↓

>>> from PIL import Image

>>> im = Image.open('8452.png')

>>> # Covert to 256 Gray Level mode

>>> im = im.convert('L')

>>> # See how the colors are distributed

>>> im.getcolors()

↓

[(30, 0), (21, 14), (7, 15), (1, 17), (1, 19), (33, 22), (3, 23), (1, 25),

(2, 26), (2, 28), (1, 29), (5, 30), (4, 32), (1, 34), (3, 36), (15, 38), (10, 39), ...,

, (1, 245), (2, 246), (5, 247), (4, 249), (4, 251), (5, 253), (2190, 255)]

>>> # 255-> White, 0-> Black

>>> # If we remove all the "whiter" colors

>>> im = im.point(lambda x: 255 if x>128 else x)

>>> # see how this policy works

>>> im.show()

↓

>>> # the new color distribution?

>>> im.getcolors()

...

>>> # new attempts

...

↓

...

AFTER A FEw GUESS AND TRY

...

↓

We GOT THIS

def clean(im):
    im = im.convert('L')
    im = im.point(lambda x:255 if x>128 or x==0 else x)
    im = im.point(lambda x:0 if x<255 else 255)
    return im

It's surprisingly simple, isn't it?

→

EXTRACTING CHARS

this is offen harder than you imagine

↓

for example:

luckily in weibo's case,

it's quite easy

↓

>>> #Divide by Columns

>>> w, h = im.size

>>> data = im.load()

>>> jcolors = [sum(255-data[i,j] for j in range(h)) for i in range(w)]

>>> print jcolors

[0, 0, X, X, X, 0, 0, 0, X, X, X, 0, 0, 0, X, X, 0, 0]

↓

>>> # ...

>>> # cropping images according to "boxes"

>>> # (0, 0, 30, 50), (30, 0, 60, 50), ....

>>> # ...

>>> # then normalize the image, scale to the same size

→

Extracting Features

What the hell are "features"

and how we use them to classify?

↓

Imagine there's a message we want to anti-spam:

does it contains the word "sex"?
what about "buy"?
does it have links in it?
how many words in it?
....

↓

We get a lot of Trues and Falses and Values

↓

We get a so-called "feature space"

(1, 0, 2, 431, ..., 1, 0)

↓

Does this vector belong to the set of vectors we claim they are spams?

↓

We get results from classification models.

↓

So, what are features for chars images?

how many pixels in the image are black?
how many white areas in the image?
Is there curves in the image, about where?
....

Some features are very hard to extract

↓

Sometimes, naive approach is enough

We just use the "pixel value array"

    def im2array(im):
        return [ int(x!='\xff') for x in im.tobytes() ]

→

Classifying

It's hard to explain the underlying maths,

but it is easy to implement.

↓

we have features:

a list of feature vector
(vec1, vec2, vec3, ...)

we have target:

a list of target values
(tar1, tar2, tar3, ...)

We want to predict:

given a new feature vector
which target would it be like?
predict(vector)

↓

Simplest Classification Using `sklearn`

>>> # let's train a XOR operator>>> import sklearn.svm>>> clf = sklearn.svm.SVC()>>> data = [(1, 1),...         (1, 0),...         (0, 1),...         (0, 0)]>>> targets = [0, 1, 1, 0]>>> clf.fit(data, targets)>>> clf.predict((0, 0))[0]1

↓

CAPTCHA Classification

traing:

data: integer arrays
target: arrays of 0-35(represents [0-9A-Z])
clf.fit(data, target)

predicting:

array = preprocess char image into array,
code = clf.predict(array)
char = lookup code in [0-9A-Z]

Classification Methods in Brief

Bayes
Decision Tree
SVM
kNN
MLP(NN)

→

Q&A?

Decoding Weibo CAPTCHA in Python

By jingchaohu

Decoding Weibo CAPTCHA in Python

Explain how to decode CAPTCHAs using python

18,772

decoding weibo captcha in python

The Problem

ALSO THIS

THeY seem to be hard

Not REALLY... LET's BREAK IT

Wait, BEFORE THAT

LEt's REVIEW SOME COMMON TECHniCS

Steps

an example

After CLEANING

After Splitting

AFter CLASSIFYING

8452

IT LOOKS SIMPLE BUT it isN't...

... UNTIL I EXPLAIN DETAILS

CLEANING NOISES

...

AFTER A FEw GUESS AND TRY

...

We GOT THIS

EXTRACTING CHARS

Extracting Features

Classifying

Q&A?

Decoding Weibo CAPTCHA in Python

More from jingchaohu

IT LOOKS SIMPLE

BUT it isN't...