Do you hate CAPTCHA? 15 minutes is what you need to hack it!

mkrvx0 (28)in #machinelearning • 8 years ago

Everyone hates CAPTCHA, right?

A CAPTCHA is a program that protects websites against bots by generating and grading tests that humans can pass but current computer programs cannot. - captcha.net

The author of this article on medium hacks a Wordpress plugin called "Really Simple CAPTCHA".

This is the piece of PHP code which generates the chars:

public function __construct() {
        /* Characters available in images */
        $this->chars = 'ABCDEFGHJKLMNPQRSTUVWXYZ23456789';

        /* Length of a word in an image */
        $this->char_length = 4;

        /* Array of fonts. Randomly picked up per character */
        $this->fonts = array(
            dirname( __FILE__ ) . '/gentium/GenBkBasR.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasI.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasBI.ttf',
            dirname( __FILE__ ) . '/gentium/GenBkBasB.ttf',
);

So, it generates 4-letter CAPTCHAs using a random mix of four different fonts. And we can see that it never uses “O” or “I” in the codes to avoid user confusion. That leaves us with a total of 32 possible letters and numbers that we need to recognize.

Tools

In order to get results we need to train our machine learning system. Since we have access to the WordPress plugin, we can easly manage to get thousands of PNG images with the exact output of the Captcha puzzles.

We know that this Captcha generates only four characters, what we want now is to split every single character to let the system recognize every single letter:

So we’ll start with a raw CAPTCHA image:

And then we’ll convert the image into pure black and white (this is called thresholding) so that it will be easy to find the continuous regions:

Next, we’ll use OpenCV’s findContours() function to detect the separate parts of the image that contain continuous blobs of pixels of the same color:

Then it’s just a simple matter of saving each region out as a separate image file. And since we know each image should contain four letters from left-to-right, we can use that knowledge to label the letters as we save them. As long as we save them out in that order, we should be saving each image letter with the proper letter name.

But sometimes the CAPTCHAs have overlapping letters like this:

So we extract two letters as one region like this:

If we don’t handle this problem, we’ll end up creating bad training data. We need to fix this so that we don’t accidentally teach the machine to recognize those two squashed-together letters as one letter.

A simple hack here is to say that if a single contour area is a lot wider than it is tall, that means we probably have two letters squished together. In that case, we can just split the conjoined letter in half down the middle and treat it as two separate letters:

Now that we have a way to extract individual letters, let’s run it across all the CAPTCHA images we have. The goal is to collect different variations of each letter. We can save each letter in it’s own folder to keep things organized.

Here’s a picture of what the “W” folder looked like after the extraction of all the letters:

Create the neural network

Defining this neural network architecture only takes a few lines of code using Keras:

# Build the neural network!
model = Sequential()

# First convolutional layer with max pooling
model.add(Conv2D(20, (5, 5), padding="same", input_shape=(20, 20, 1), activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

# Second convolutional layer with max pooling
model.add(Conv2D(50, (5, 5), padding="same", activation="relu"))
model.add(MaxPooling2D(pool_size=(2, 2), strides=(2, 2)))

# Hidden layer with 500 nodes
model.add(Flatten())
model.add(Dense(500, activation="relu"))

# Output layer with 32 nodes (one for each possible letter/number we predict)
model.add(Dense(32, activation="softmax"))

# Ask Keras to build the TensorFlow model behind the scenes
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

# Train the neural network

model.fit(X_train, Y_train, validation_data=(X_test, Y_test), batch_size=32, epochs=10, verbose=1)

After 10 passes over the training data set, we hit nearly 100% accuracy. At this point, we should be able to automatically bypass this CAPTCHA whenever we want! We did it!

Read the full article here.

#deeplearning #python #google #computers

8 years ago in #machinelearning by mkrvx0 (28)

$0.00

2 votes

Sort:

Trending

[-]

resteembot (48) 8 years ago

Resteemed by @resteembot! Good Luck!
The resteem was payed by @greetbot
Curious?
The @resteembot's introduction post
Get more from @resteembot with the #resteembotsentme initiative
Check out the great posts I already resteemed.

$0.00

STEEM 0.13

TRX 0.35

JST 0.034

BTC 115195.87

ETH 4539.92

SBD 0.86

Do you hate CAPTCHA? 15 minutes is what you need to hack it!

Tools

Create the neural network

# Train the neural network

Coin Marketplace