A fun, quick and detailed step by step beginner's guide to scraping websites with Ruby Mechanize.

in #tutorial7 years ago

Scraping means extracting data from websites. This is a beginner's step-by-step guide to scraping websites using Ruby, which is a powerful programming language. We'll use the Mechanize gem in this tutorial. Gems in ruby are basically reusable programs.


1 - Installing Ruby and the required gems

Assuming that you don't have Ruby set up in your system, we'll first install ruby to proceed further.
Download the installation file : rubyinstaller-2.3.3-x64_2.exe.

If you have an IDM (Internet Download Manager), you'll be prompted with a screen like the following snapshot. Click on Start Download and the file will be downloaded in the path specified at the 'Save As' field.


Now, navigate to the path where you downloaded the file and double click to begin with the installation. After that you should see a 'Start Command Prompt with Ruby' option when you press the windows button. Click on it and you should see a command prompt as follows

Let's install two gems that we will be needing for this tutorial.
First Gem is called the Nokogiri gem. Let us install the gem by executing the following command. Type the following code and press enter.

 gem install nokogiri

Upon executing the above command, we'll be presented with the gem installation progress as follows

Then, we'll proceed further to install the gem called Mechanize which is based on Nokogiri. Let us install the gem by executing the following command. Type the following code and press enter.

 gem install mechanize

Upon executing the above command, we'll be presented with the gem installation progress as follows


2 - Installing a Text Editor - Sublime Text

Our environment setup is now complete and we can begin with writing the program. We'll need a text editor for this. I use Sublime Text. You can download a copy from [Sublime Text Build 3126 x64 Setup.exe](https://download.sublimetext.com/Sublime%20Text%20Build%203126%20x64%20Setup.exe).

3 - Naming the file and Writing the code

Now, open Sublime Text and save the file with your desired filename ending with the extension '.rb'. I named my program as SteemScrap.rb

Let us write our first line of code for a Web scrapper.

 require 'mechanize' 

We tell ruby that we will be using the mechanize gem and its functions.

  agent = Mechanize.new 

We created a new object of Mechanize and called it 'agent'.

Now, let's fetch a page using the Mechanize object 'agent'.

page = agent.get('http://www.google.com/')

It returns the elements of the website as an object to us.

Now, we can pretty print the page as follows

 pp page

If you've followed all the instructions until now, your code should look like the following snippet.


4 - Saving and executing the ruby program from command prompt

Save your file [Ctrl+S] and go back to your command prompt. Navigate to the folder where you saved your ruby file from the command prompt by using the 'cd' command to change directory. So, if your command prompt now shows C:\Users\YourName> and you've saved the file SteemScrap.rb in the Desktop, type the following command to navigate to the Desktop.
  C:\Users\YourName>cd C:\Users\Desktop>

Now, let's run the program by executing the following command from C:\Users\Desktop> in the command prompt window.

   ruby SteemScrap.rb

Upon execution of the above command, Mechanize will return you the elements of the page. Scroll down and you should see the form element. It tells us about the form name and other important controls of the form.

You just retreived the contents of a webpage without visiting it via a browser. Good job! We now, will need to add the following lines to our existing code.

      google_form = page.form('f')
  google_form.q = 'steemit'
  page = agent.submit(google_form)
  pp

5 - Firing a query and fetching the results

The final code should look something like the following:

What is 'f' and what is 'q' and what did we just do?

'f' is the name of the form and 'q' is the name of the input field where you enter your search query in the google search box.

So, when we type page.form('f'), it fetches us the form in the page google.com referred by name ='f'. So we store returned results in 'google_form' and similarly, assigning a value to 'google_form.q' replaces the input field of the google search box identified by the name 'q', by the value we assigned. In our example, I used the search string 'Steemit'.


6 -Now, follow step 4, execute the code and voila!

I will not show the returned results, just to keep the curiosity alive. :D
+isteemit

Image Source: Image 1

Sort:  

The price of STEEM is 0.164754 USD per STEEM

Coin Marketplace

STEEM 0.33
TRX 0.11
JST 0.034
BTC 66579.21
ETH 3282.19
USDT 1.00
SBD 4.30