# decoding weibo captcha in python

Jingchao Hu
jingchaohu AT gmail DOT com

## The Problem ## ALSO THIS ## Steps

1. Removing Noises
2. Separating Characters
3. Extracting Features
4. Classifying Features

## an example ## After CLEANING ## After Splitting    Yay!

## IT LOOKS SIMPLE                           BUT it isN't...

## CLEANING NOISES

Using PIL(Pillow), it is offten easier
than you might think

>>> from PIL import Image
>>> im = Image.open('8452.png')

>>> # Covert to 256 Gray Level mode
>>> im = im.convert('L')

>>> # See how the colors are distributed
>>> im.getcolors()

[(30, 0), (21, 14), (7, 15), (1, 17), (1, 19), (33, 22), (3, 23), (1, 25),

(2, 26), (2, 28), (1, 29), (5, 30), (4, 32), (1, 34), (3, 36), (15, 38), (10, 39), ...,

, (1, 245), (2, 246), (5, 247), (4, 249), (4, 251), (5, 253), (2190, 255)]

>>> # 255-> White, 0-> Black

>>> # If we remove all the "whiter" colors

>>> im = im.point(lambda x: 255 if x>128 else x)

>>> # see how this policy works

>>> im.show()

## >>> # the new color distribution?

>>> im.getcolors()

...

>>> # new attempts

...

## We GOT THIS

```def clean(im):
im = im.convert('L')
im = im.point(lambda x:255 if x>128 or x==0 else x)
im = im.point(lambda x:0 if x<255 else 255)
return im
```

It's surprisingly simple, isn't it?

## EXTRACTING CHARS

this is offen harder than you imagine

for example: luckily in weibo's case,
it's quite easy

>>> #Divide by Columns
>>> w, h = im.size
>>> jcolors = [sum(255-data[i,j] for j in range(h)) for i in range(w)]

>>> print jcolors
[0, 0, X, X, X, 0, 0, 0, X, X, X, 0, 0, 0, X, X, 0, 0]

>>> # ...
>>> # cropping images according to "boxes"
>>> # (0, 0, 30, 50), (30, 0, 60, 50), ....
>>> # ...
>>> # then normalize the image, scale to the same size    ## Extracting Features

What the hell are "features"
and how we use them to classify?

Imagine there's a message we want to anti-spam:

• does it contains the word "sex"?
• does it have links in it?
• how many words in it?
• ....

We get a lot of Trues and Falses and Values

We get a so-called "feature space"
(1, 0, 2, 431, ..., 1, 0)

Does this vector belong to the set of vectors we claim they are spams?

We get results from classification models.

So, what are features for chars images?

• how many pixels in the image are black?
• how many white areas in the image?
• Is there curves in the image, about where?
• ....

Some features are very hard to extract

Sometimes, naive approach is enough
We just use the "pixel value array"

```    def im2array(im):
return [ int(x!='\xff') for x in im.tobytes() ]
```

## Classifying

It's hard to explain the underlying maths,
but it is easy to implement.

• we have features:
• a list of feature vector
• (vec1, vec2, vec3, ...)
• we have target:
• a list of target values
• (tar1, tar2, tar3, ...)
• We want to predict:
• given a new feature vector
• which target would it be like?
• predict(vector)

Simplest Classification  Using  `sklearn`
``>>> # let's train a XOR operator``>>> import sklearn.svm``>>> clf = sklearn.svm.SVC()``>>> data = [(1, 1),``...         (1, 0),``...         (0, 1),``...         (0, 0)]``>>> targets = [0, 1, 1, 0]``>>> clf.fit(data, targets)``>>> clf.predict((0, 0))``1``

• traing:
• data: integer arrays
• target:  arrays of 0-35(represents [0-9A-Z])
• clf.fit(data, target)
• predicting:
• array = preprocess char image into array,
• code = clf.predict(array)
• char = lookup code in [0-9A-Z]

Classification Methods in Brief

• Bayes
• Decision Tree
• SVM
• kNN
• MLP(NN)

By jingchaohu

# Decoding Weibo CAPTCHA in Python

Explain how to decode CAPTCHAs using python

• 16,962