Do you hate CAPTCHA? 15 minutes is what you need to hack it!

in #machinelearning8 years ago

Everyone hates CAPTCHA, right?

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot. - captcha.net

The author of this article on medium hacks a Wordpress plugin called "Really Simple CAPTCHA".

This is the piece of PHP code which generates the chars:

public function __construct() {
        /* Characters available in images */
        $this->chars = 'ABCDEFGHJKLMNPQRSTUVWXYZ23456789';

        /* Length of a word in an image */
        $this->char_length = 4;

        /* Array of fonts. Randomly picked up per character */
        $this->fonts = array(
            dirname( __FILE__ ) . '/gentium/GenBkBasR.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasI.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasBI.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasB.ttf',
);

So, it generates 4-letter CAPTCHAs using a random mix of four different fonts. And we can see that it never uses “O” or “I” in the codes to avoid user confusion. That leaves us with a total of 32 possible letters and numbers that we need to recognize.

Tools

In order to get results we need to train our machine learning system. Since we have access to the WordPress plugin, we can easly manage to get thousands of PNG images with the exact output of the Captcha puzzles.

1 leaOO0EYbgKVl7MEhflIFA.png

We know that this Captcha generates only four characters, what we want now is to split every single character to let the system recognize every single letter:
1 4ScTIDYJ6rPCAtopRulzOg.png

So we’ll start with a raw CAPTCHA image:

1 EgwdO6bSUFYPTU8m4oIf4A.png

And then we’ll convert the image into pure black and white (this is called thresholding) so that it will be easy to find the continuous regions:

1 NMEsX6kq5sFqHpnCsg6nXg.png

Next, we’ll use OpenCV’s findContours() function to detect the separate parts of the image that contain continuous blobs of pixels of the same color:

1 CFJkKt857-9qv8PJBqe8JA.png

Then it’s just a simple matter of saving each region out as a separate image file. And since we know each image should contain four letters from left-to-right, we can use that knowledge to label the letters as we save them. As long as we save them out in that order, we should be saving each image letter with the proper letter name.

But sometimes the CAPTCHAs have overlapping letters like this:
1 h0yC3aLQU1as2HnogUXjVQ.png

So we extract two letters as one region like this:
1 CaBtSHUmQ77E8zNi0C7wMQ.png

If we don’t handle this problem, we’ll end up creating bad training data. We need to fix this so that we don’t accidentally teach the machine to recognize those two squashed-together letters as one letter.

A simple hack here is to say that if a single contour area is a lot wider than it is tall, that means we probably have two letters squished together. In that case, we can just split the conjoined letter in half down the middle and treat it as two separate letters:

1 ehE02z5AzBv1zt3UExB2_w.png

Now that we have a way to extract individual letters, let’s run it across all the CAPTCHA images we have. The goal is to collect different variations of each letter. We can save each letter in it’s own folder to keep things organized.

Here’s a picture of what the “W” folder looked like after the extraction of all the letters:

1 jtp8JImlEe11ViiqqFoTvg.png

Create the neural network

Defining this neural network architecture only takes a few lines of code using Keras:

# Build the neural network!
model = Sequential()

# First convolutional layer with max pooling
model.add(Conv2D(20, (5, 5), padding="same", input_shape=(20, 20, 1), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

# Second convolutional layer with max pooling
model.add(Conv2D(50, (5, 5), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

# Hidden layer with 500 nodes
model.add(Flatten())
model.add(Dense(500, activation="relu"))

# Output layer with 32 nodes (one for each possible letter/number we predict)
model.add(Dense(32, activation="softmax"))

# Ask Keras to build the TensorFlow model behind the scenes
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
# Train the neural network
model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10, verbose=1)

After 10 passes over the training data set, we hit nearly 100% accuracy. At this point, we should be able to automatically bypass this CAPTCHA whenever we want! We did it!


Read the full article here.

Sort:  

Resteemed by @resteembot! Good Luck!
The resteem was payed by @greetbot
Curious?
The @resteembot's introduction post
Get more from @resteembot with the #resteembotsentme initiative
Check out the great posts I already resteemed.

Coin Marketplace

STEEM 0.13
TRX 0.35
JST 0.034
BTC 115195.87
ETH 4539.92
SBD 0.86