decoding weibo captcha in python

Jingchao Hu
jingchaohu AT gmail DOT com
2013 Favbuy Corp.


The Problem





ALSO THIS






THeY seem to be hard





Not REALLY... LET's BREAK IT



Wait, BEFORE THAT


LEt's REVIEW SOME COMMON TECHniCS





Steps

  1. Removing Noises
  2. Separating Characters
  3. Extracting Features
  4. Classifying Features 





an example






After CLEANING






After Splitting






AFter CLASSIFYING

8452

Yay!




IT LOOKS SIMPLE               
 
            BUT it isN't...



                                

                      

... UNTIL I EXPLAIN DETAILS


CLEANING NOISES


Using PIL(Pillow), it is offten easier          
  than you might think




                         >>> from PIL import Image
                         >>> im = Image.open('8452.png')

                         >>> # Covert to 256 Gray Level mode
                         >>> im = im.convert('L')

                         >>> # See how the colors are distributed
                         >>> im.getcolors()




                  [(30, 0), (21, 14), (7, 15), (1, 17), (1, 19), (33, 22), (3, 23), (1, 25),

                     (2, 26), (2, 28), (1, 29), (5, 30), (4, 32), (1, 34), (3, 36), (15, 38), (10, 39), ...,

                           , (1, 245), (2, 246), (5, 247), (4, 249), (4, 251), (5, 253), (2190, 255)]


                      >>> # 255-> White, 0-> Black

                      >>> # If we remove all the "whiter" colors

                      >>> im = im.point(lambda x: 255 if x>128 else x)

                      

                      >>> # see how this policy works

                      >>> im.show()




                      >>> # the new color distribution?

                      >>> im.getcolors()

                      ...

                      >>> # new attempts

                      ...





...


 AFTER A FEw GUESS AND TRY

...

 


We GOT THIS


def clean(im):
    im = im.convert('L')
    im = im.point(lambda x:255 if x>128 or x==0 else x)
    im = im.point(lambda x:0 if x<255 else 255)
    return im


It's surprisingly simple, isn't it?


 


EXTRACTING CHARS


this is offen harder than you imagine


 



for example:


luckily in weibo's case, 
it's quite easy

 


             >>> #Divide by Columns
             >>> w, h = im.size
             >>> data = im.load()
             >>> jcolors = [sum(255-data[i,j] for j in range(h)) for i in range(w)]

              >>> print jcolors
              [0, 0, X, X, X, 0, 0, 0, X, X, X, 0, 0, 0, X, X, 0, 0]

 
              
              

              >>> # ... 
              >>> # cropping images according to "boxes"
              >>> # (0, 0, 30, 50), (30, 0, 60, 50), ....
              >>> # ...
              >>> # then normalize the image, scale to the same size


                                             
 


Extracting Features



What the hell are "features" 
and how we use them to classify?

 



Imagine there's a message we want to anti-spam:

  • does it contains the word "sex"?
  • what about "buy"?
  • does it have links in it?
  • how many words in it?
  • ....

 


We get a lot of Trues and Falses and Values
 
We get a so-called "feature space"
(1, 0, 2, 431, ..., 1, 0)
 
Does this vector belong to the set of vectors we claim they are spams?

We get results from classification models.


So, what are features for chars images?

  • how many pixels in the image are black?
  • how many white areas in the image?
  • Is there curves in the image, about where?
  • ....

Some features are very hard to extract




Sometimes, naive approach is enough
We just use the "pixel value array"

    def im2array(im):
        return [ int(x!='\xff') for x in im.tobytes() ]
    



Classifying


It's hard to explain the underlying maths,
but it is easy to implement.

 


  • we have features:                                                 
    • a list of feature vector
    • (vec1, vec2, vec3, ...)
  • we have target: 
    • a list of target values
    • (tar1, tar2, tar3, ...)
  • We want to predict:
    • given a new feature vector
    • which target would it be like?
    • predict(vector)
 

Simplest Classification  Using  `sklearn`
>>> # let's train a XOR operator>>> import sklearn.svm>>> clf = sklearn.svm.SVC()>>> data = [(1, 1),...         (1, 0),...         (0, 1),...         (0, 0)]>>> targets = [0, 1, 1, 0]>>> clf.fit(data, targets)>>> clf.predict((0, 0))[0]1

 


CAPTCHA Classification

  • traing:
      • data: integer arrays
      • target:  arrays of 0-35(represents [0-9A-Z])
      • clf.fit(data, target)
  • predicting:
      • array = preprocess char image into array,
      • code = clf.predict(array)
      • char = lookup code in [0-9A-Z]

Classification Methods in Brief

  • Bayes
  • Decision Tree
  • SVM
  • kNN
  • MLP(NN) 

     



Q&A?



Made with Slides.com