How To Scrape Websites and API's In Node.js Using Osmosis (by example)

in #utopian-io6 years ago (edited)

Repository

https://github.com/Vheissu/node-osmosis-scraping-examples

What Will I Learn?

  • You will learn how to install Node.js packages
  • You will learn how to build a Node.js application
  • You will learn how to use the Node Osmosis package to scrape web pages and applications
  • You will learn how to scrape pages with pagination, use selectors
  • You will learn how to scrape using realistic real-world examples

Requirements

  • A computer running any operating system; macOS, Windows, Linux
  • A code editor or IDE such as Visual Studio Code where you will be writing code
  • Node.js which can be downloaded here which provides us with the ability to run Node applications
Required Knowledge:
  • Familiarity with Javascript and in particular Node.js
  • Familiarity with PowerShell or Terminal

Difficulty

Intermediate

Tutorial Contents

In this tutorial, you will learn how to create your own web scraper using Node.js. By the end of this tutorial, you will have a functional suite of scrapers that can be extended and adapted for many different use cases.

There are quite a few examples online of how you can scrape using various scraping libraries and packages, most of them seem to give you one example and fail to showcase some of the more powerful features of the library. In this tutorial, we'll cover multiple concepts, more than just the basics and more than your standard scraping tutorial.

The following tutorial steps assume you have installed Node.js on your machine and that it is functioning.

Disclaimer: web scraping is a legal grey area. While you're not likely to get into trouble, please be aware that some site owners might not want to be scraped and do their best to avoid doing so. There could be legal consequences if you scrape a site or API without permission. Consider this tutorial a learning experience and not encouragement or instruction to go and illegally scrape websites and API's.

Getting Started

Somewhere on your machine (home directory, root level C: drive), we need to create a folder where our application will live. Let's call our project "web-scraper" (I know, original name).

Once you have done this, open up a terminal window and type npm init this will use the Node Package Manager initialiser to create a package.json file where our dependencies will be living.

Follow the prompts and for the sake of simplicity, just stick with the default options (you can keep on hitting the enter key). Eventually, you will get to the final step which asks you if you want to create an application with these settings, hit enter to confirm.

Finally, we want to install the Osmosis package in our application. Assuming you still have your terminal window open in the project directory, run the following:

npm install osmosis --save

Let's Write Some Scrapers

In your code editor, create a file in the root of the project directory called index.js which is where our scraper examples will live. In a couple of examples, we will be scraping steemit.com. I am aware that there are clients you can use to get this data from the Steem blockchain, the intention is to showcase scraping using a familiar example.

Before we continue, at the top of your newly created index.js file add the following require to import the Osmosis package and parser. This is just a standard CommonJS style import.

const osmosis = require('osmosis');

All of our examples are going to exist within functions that return promises. A promise is a resolvable object that can return data now or in the future without knowing the length.

Understanding Osmosis Methods

If you are familiar with working with jQuery Osmosis will feel somewhat similar. The concept of scraping a page is you chain together different function calls to get a result.

The find method is like jQuery's $() functionality in which you can use CSS selectors as well as XPath selectors to query for elements and values on the page.

The behaviour of the find method is that it doesn't take on the previous context and always searches the entire page you're loading, this is something you need to remember. Chaining a find method will not mean the previous find context is honoured.

The set method is what we use to construct an object to return for our query. It allows us to build up an object of page contents, which as you will see is crucial to scraping content from a page and returning it.

The data method is perhaps the most important. This method gets the data that has been queried and created from within the set method, the return value on the callback function is the object itself.

In all of our examples, we are assigning this value to an outer variable (either directly or pushing via an array push).

There is plenty to read on the official Node Osmosis repository on Github here.

Scraping Meta Tags

Let's start out lightly with a basic example. If you wanted to get meta tag information from a webpage such as its meta description or open graph tags, our function below will come in handy.

Breaking this example down, we tell Osmosis to fetch steemit.com, then we tell it to query the page for the head element. Then we create a new object populated with a title and description from open graph meta tags. As you can probably discern from reading the code, we are querying for og:title and og:description open graph meta tags.

// Wrap functionality in a function
function getOpenGraphMeta() {
    // Return a promise as execution of request is time-dependent
    return new Promise((resolve, reject) => {
        let response;

        osmosis
            // Tell Osmosis to load steemit.com
            .get('https://steemit.com')
            // We want to get the metatags in head, so provide the head element as a value to find
            .find('head')
            // Set creates our final object of data we will get calling .data
            // the secondary values for these are select lookups. We are saying find meta tag with this property and return its content
            .set({
                title: "meta[property='og:title']@content",
                description: "meta[property='og:description']@content"
            })
            // Store a copy of the above object in our response variable
            .data(res => response = res)
            // If we encounter an error we will reject the promise
            .error(err => reject(err))
            // Resolve the promise with our response object
            .done(() => resolve(response));
    });
}

getOpenGraphMeta().then(res => {
    console.log(res);
});

Scraping Multiple Elements

Writing code that looks similar to above, we are now going to scrape the trending posts right off of the Steemit homepage and return them as an array of objects containing the title and summary text.

As you'll see, there isn't much more work required to parse singular or multiple elements, it's just CSS selectors.

In our example, we are scraping trending posts from the homepage. The posts are contained inside of an unordered list (ul) and each post is a list element (li). In our find statement, we specify we want to get all immediate li element of .PostsList__summaries. Using set we construct an object for each item found. We then push each object into an array from within the data method.

function getHomePageTrending() {
    return new Promise((resolve, reject) => {
        let response = [];

        osmosis
            // Load steemit.com
            .get('https://steemit.com')
            // Find all posts in postslist__summaries list
            .find('.PostsList__summaries > li')
            // Create an object with title and summary
            .set({
                title: 'h2',
                summary: '.PostSummary__body'
            })
            // Push post into an array
            .data(res => response.push(res))
            .error(err => reject(err))
            .done(() => resolve(response));
    });
}

getHomePageTrending().then(res => {
    console.log(res);
});

Scrape Upcoming Music Releases From Metacritic

In this example, we are going to be scraping metacritic.com's list of upcoming music releases. I don't really know of an API that gives you upcoming music releases in such an easy to digest format. This is where things start to get a little more advanced.

We are still using familiar concepts, but we are also going to write a little bit of code to create the structure that we require as we're parsing a flat table list and need to create a hierarchy.

In this example, you will see that we are constructing a nested array classifying the releases under their appropriate release date. We hook into the process with a callback function using then which gets the current context and more importantly the current data we've scraped. We then populate an object keyed by release date with children.

Admittedly it's quite a bit to digest, but to break it down to its simplest explanation what we are doing is:

  • Getting the first table in the page (there are two tables, the first have release dates and the second has albums that are to be announced with no release dates)
  • We then query all table row elements (tr) in the table because each element lives in a tr (including the release dates)
  • We then create an object that queries for a table heading (th) and if it exists, creates a property called releaseDate and gets the value. If not, then this property doesn't exist.
  • We then query for table cells (td) with classes artistName and albumTitle which I am sure you can guess what they contain
  • Inside of then we check if we have a release date and create a new object property keyed by the date, otherwise we have a release and push it in
function getUpcomingMusicReleases() {
    return new Promise((resolve, reject) => {
        let currentDate = null;
        let releasesMap = {};

        osmosis
            // Load upcoming music page
            .get('http://www.metacritic.com/browse/albums/release-date/coming-soon/date')
            // Find the first music table and all of its tr elements
            .find('.releaseCalendar .musicTable:first tr')
            // Construct an object containing release date and its relevant releases
            .set({
                // Get the release date (if relevant)
                releaseDate: 'th',
                // For every release on this date create an array
                artist: '.artistName', 
                album: '.albumTitle'
            })
            // Transform our flat data into a tree like structure by creating a nested object
            .then(async (context, data) => {
                // Is the current object a release date?
                if (typeof data.releaseDate !== 'undefined') {
                    // Store the current date
                    currentDate = data.releaseDate;

                    // Create an empty array where we can push releases into
                    releasesMap[currentDate] = [];
                } else {
                    // This is a release, not a date, push the object with artist and album name
                    const artwork = await getArtwork(data.artist, data.album);
                    releasesMap[currentDate].push(data);
                }
            })
            .error(err => reject(err))
            .done(() => resolve(releasesMap));
    });
}

getUpcomingMusicReleases().then(res => {
    console.log(res);
});

Scrape New York Times Global Stories

Another little fun exercise is scraping the list of global news stories from the New York Times website. This time we are going to be configuring a couple of options with the Osmosis plugin.

// Make the user agent that of a browser (Google Chrome on Windows)
osmosis.config('user_agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36');
// If a request fails don't keep retrying (by default this is 3)
osmosis.config('tries', 1)
// Concurrent requests (by default this is 5) make this 2 so we don't hammer the site
osmosis.config('concurrency', 2);

function getNewsTitles() {
    return new Promise((resolve, reject) => {
        let stories = [];

        osmosis
            // The URL we are scraping
            .get('https://www.nytimes.com/section/world')
            // Find all news stories with the class story-body
            .find('.story-body')
            .set({
                // Get the first link href value inside of each story body
                link: 'a:first@href',
                // Get the news story title
                title: '.headline',
                // Get the news story summary
                summary: '.summary',
                // Get the image source for the story
                img: 'img@src'
            })
            .data(data => {
                // Push each news story found into an array we'll send back when we are done
                stories.push(data);
            })
            .error(err => reject(err))
            .done(() => resolve(stories));
    });
}

getNewsTitles().then(stories => {
    // Should contain all news stories found
    console.log(stories);
});

Scrape Top Free Apps On Google Play

The Google Play Store is where Android phone users go to download new apps. The store offers its web interface for non-phone users, so we can easily scrape useful information like the top free apps from the store. Once more, putting into practice similar concepts from previous examples, let's scrape these top apps and get some useful info back.

function topFreeApps() {
    return new Promise((resolve, reject) => {
        let list = [];

        osmosis
            // Scrape top free apps
            .get('https://play.google.com/store/apps/collection/topselling_free')
            // All apps exist inside of a div with class card-content
            .find('.card-content')
            // Create an object of data
            .set({
                link: '.card-click-target@href', // Link to the app
                title: 'a.title', // Title
                img: '.cover-image@src' // App image
            })
            .data(data => {
                // Each iteration, push the data into our array
                list.push(data);
            })
            .error(err => reject(err))
            .done(() => resolve(list));
    });
}

topFreeApps().then(list => {
    console.log(list);
});

Scraping Google Search Results (pagination)

In this example, we are going to be scraping Google search results and using pagination, scrape more than the first page of results. In our case, we are going to scrape the first 3 pages of Google search results.

We are also going to be conservative and careful here because Google is good at detecting scraper usage, we'll set a user agent, limit retries to just one and concurrency to 2 (no more than two requests made at a time).

As you'll see, the following example isn't really too unlike the previous examples. We are still using get to make a request to Google search results, then we are using the paginate method to identify where our pagination links are and the number of times we want to paginate as the second argument (we're doing 3).

Finally, you'll see a newly introduced delay method which accepts a millisecond value. This will delay calls to paginate between each one (in our case it's 2000 or 2 seconds), which will stop us quickly iterating Google results and being detected.

osmosis.config('user_agent', 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36');
osmosis.config('tries', 1)
osmosis.config('concurrency', 2);

function scrapeGoogle() {
    return new Promise((resolve, reject) => {
        let list = [];

        osmosis
            // Do Google search
            .get('https://www.google.co.in/search?q=cats')
            .paginate('#navcnt table tr > td a[href]', 3)
            .delay(2000) // delay 2 seconds between pagination calls
            // All apps exist inside of a div with class card-content
            .find('.g')
            // Create an object of data
            .set({
                link: 'h3 a@href', // Search result link
                title: 'h3', // Title
            })
            .data(data => {
                // Each iteration, push the data into our array
                list.push(data);
            })
            .error(err => reject(err))
            .done(() => resolve(list));
    });
}

scrapeGoogle().then(list => {
    console.log(list);
});

Conclusion

In this tutorial, we scraped real-world websites using the powerful Node Osmosis package. We covered different selector types, querying for multiple sets of data, pagination, changing the user agent and fetching different types of data and returning it.

Using what we applied in the above examples, you should now be armed with the knowledge required to create your own scrapers for different needs, from say scraping real estate listings or job site advertisements to social media.

Proof of Work Done

https://github.com/Vheissu/node-osmosis-scraping-examples

Sort:  

Hey @beggars
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Thank you for your contribution.

  • The repository must have the repository link of node.js

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thanks @portugalcoin I'll add in a link to Node.js in the repository later on tonight.

Congratulations! Your post has been selected as a daily Steemit truffle! It is listed on rank 4 of all contributions awarded today. You can find the TOP DAILY TRUFFLE PICKS HERE.

I upvoted your contribution because to my mind your post is at least 16 SBD worth and should receive 86 votes. It's now up to the lovely Steemit community to make this come true.

I am TrufflePig, an Artificial Intelligence Bot that helps minnows and content curators using Machine Learning. If you are curious how I select content, you can find an explanation here!

Have a nice day and sincerely yours,
trufflepig
TrufflePig

Congratulations @beggars! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of comments

Click on the badge to view your Board of Honor.
If you no longer want to receive notifications, reply to this comment with the word STOP

Do not miss the last post from @steemitboard:
SteemitBoard World Cup Contest - Home stretch to the finals. Do not miss them!


Participate in the SteemitBoard World Cup Contest!
Collect World Cup badges and win free SBD
Support the Gold Sponsors of the contest: @good-karma and @lukestokes


Do you like SteemitBoard's project? Then Vote for its witness and get one more award!

Indeed web scraping - has become an indispensable tool for companies and developers seeking to gather information from the vastness of the Internet. Whether for market research, competitor analysis, or content aggregation, web scraping provides valuable insights that can help drive decision-making and innovation. One of the outstanding tools in this field is ZenRows, which simplifies the process of collecting data from websites. With ZenRows, developers can easily access and extract data from websites in a structured way, eliminating the need for manual data collection and processing. Using this tool, developers can automate the data collection process, saving time and resources, ensuring accuracy and consistency in data retrieval.

Coin Marketplace

STEEM 0.28
TRX 0.12
JST 0.033
BTC 69774.83
ETH 3620.03
USDT 1.00
SBD 3.72