Extracting data by using Jsoup

in #utopian-io7 years ago (edited)

What Will I Learn?

Greetings, in this tutorial we will focus on data pulling and sorting by using Jsoup library of Java.

  • You will learn the implementation of Jsoup library,
  • You will learn algorithm and function generation,
  • You will learn testing the code by using an IDE.

Requirements

  • IDE is required to test the code (preferably Eclipse IDE for java developers)
  • Interest on coding and applications,
  • Basic knowledge on Java.

Difficulty

It is highly recommended that you have a prior knowledge on

  • Intermediate

Tutorial Contents

In our this tutorail we will focus on obtaining and processing data's that are avaliable online on websites. Since there are numerous data's and useful informations this program will get its datas from wikipedia and preciselly the table found on Emergency telephone numbers according to countries
To extend our aim is to get the sample outputs shown below,


1.png


1.png

To begin the coding, the very first thing we need to do is to define the libraries that we are going to use. First librarty that we need to implement is java.io.IOException which is capable of showing/displaying detailed errors when user enters an unexpected input. Briefly it is used to optimize input/output (i/o) relation.

import java.io.IOException;

Now we need to add Jsoup libraries that we will use to connect the desired web page and pull the data's or html codes and index them according to our needs. In other words by using Jsoup we enable tracking the html codes of the desired sites.

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

Then last the final library will be related to our scanner function that gets the user entered values and keep them string, ,int or var format.

import java.util.Scanner;

Now we need to declare our class,

public class wikipedia

And we need to define our method by saying public static void, we mean that the code is visible,no return value and a class type.

public static void main(String[] args) throws IOException

Now we can proceed on calling scanner function which we will later use to get user input,

Scanner keyboard =new Scanner (System.in);

Then we can proceed on showing the location of data and the sorting algorithm,

Document doc = Jsoup.connect("http://www.wikizero.info/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9lbWVyZ2VuY3lfdGVsZXBob25lX251bWJlcnM").get();

This code equals the doc document to the entire html code of the website defined. In our this example the website is wikizero's list of emergency telephone numbers page. Once we pull the html codes of the site our document becomes,

1.png

Now we need to get the tables from this code since we need the number and country part of the website. To do that we added the table tag of the page and equal our element to the table tag,

Elements initialtable = doc.select("table.wikitable.sortable tr");

Now from the entire page's html documentation we only picked the table related part. We can delete the few additional information panel by writing,

initialtable.remove(0);

Then we generated a full string which will be later used to pick the desired country.

String full = " ";

Asking the user to enter the country. Here user can also enter the part of country. For example writting 'uni' will result all the countries starting with uni, preciselly United Arab Emirates , United Kingdom, United States of America and United States Virgin Island.

System.out.println("Where are you from?");


String country=keyboard.next();

Now we can proceed on examining and changing the data into user frinedly version. For this we must first trace the elements and check if they are needed. To trace the elements a for loops is used,

for (Element row : initialtable )

Then this row element is equaled to the full element the make vertical lines of input or seperate each country.

full = row.text();

Nıw we need to use an indexOf function to find where is the entered country,

int index = full.indexOf(country);

This will provide us '0' when the user entered name and the country in our website matches. Otherwise it will give us -1 output. By using an if statement we can find the location of the country that user entered.

if (index == 0 )

,Now we have only user entered country. We can change few lines to make it readable for the user. Firstly all information inside brackets are deleted since wikipedia uses them to reference another site.
To do that below code was implemented,

full = full.replaceAll("\[.*\]", "");

Now we can add a line to let user know which number should he pick incase of emergency,

System.out.println(country+ " Emergency number");


System.out.println("--------------------------");

Now we need to split the row text according to the ';' which indicates a better readable user friendly output,

                    {
                    System.out.println(retval);
                    }

and a final line to give an empty line for multiple outputs.

System.out.println();

As a result our entire code became,

import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import java.util.Scanner;

public class wikipedia {
    public static void main(String[] args) throws IOException  {
        Scanner keyboard =new Scanner (System.in);
        Document doc = Jsoup.connect("http://www.wikizero.info/index.php?q=aHR0cHM6Ly9lbi53aWtpcGVkaWEub3JnL3dpa2kvTGlzdF9vZl9lbWVyZ2VuY3lfdGVsZXBob25lX251bWJlcnM").get();
        Elements initialtable = doc.select("table.wikitable.sortable tr");
        initialtable.remove(0);
        String full = " ";
        System.out.println("Where are you from?");
        String country=keyboard.next();
        //System.out.println(doc);
        for (Element row : initialtable ) 
            {
            full = row.text();
            int index = full.indexOf(country);
            if (index == 0 )
                {
                full = full.replaceAll("\\[.*\\]", "");
                System.out.println(country+ " Emergency number");
                System.out.println("--------------------------");
                for (String retval: full.split("; ")) 
                    {
                    System.out.println(retval);
                    }
                System.out.println();
                } 
           }
    } 
}

Now when we can test the code for different inputs,

When we enter 'C', we can see the countries starting with the letter C,

1.png


1.png


When we entered 'Fr' we can see the numbers from France, French Polynesia and French Polynesia.


1.png


When we type 'republic' we can get Republic of Congo,Republic of Korea, Republic of China (Taiwan) and Republic of Macedonia's numbers

1.png


In our next tutorail we will implement country flags, numbering for multiple results and improve output format.

Curriculum

To have a basic knowledge you may check my previous tutorials,



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

Thanks for your good posts, I followed you! +upvote

Hey @wodsuz I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x

Coin Marketplace

STEEM 0.24
TRX 0.26
JST 0.041
BTC 98706.68
ETH 3491.96
USDT 1.00
SBD 3.39