[Tutorial] Building a Web Crawler With Python

in #utopian-io5 years ago (edited)

python.png
Repository: Python Repository Github

Software Requirements:
Visual Studio Code(Or any preferred code editor)

What you will learn:
In this tutorial you would learn to build a web crawler with these functionalities

  • Learn to automate the boring stuff, crawl websites you visit often
  • Identify and navigate through html with python
  • Use the Beautiful Soup package in the bs4 module.

Difficulty: Beginner

Tutorial
Today , I jumped out of javascript and ionic to do a tutorial based on python. This tutorial explains how you could crawl websites and find content in a way more easier way that the conventional. Web crawling is the basis of all web search from google to duck duck go. So lets hope you learn a thing or two.
The first thing you should understand with web crawling is that it isn't allowed for every site. Before you choose to crawl any website, read the terms and conditions and make sure the website allows or doesn't explicitly prohibit crawling.
So lets get to it
Every website template is built with html which divides the page into sections based on readable code. The idea is to use this html content to find a particular thing on that page we would want and accessing it or using that data for a different purpose.

In this tutorial we would be using a module called bs4. It is the latest version for this module, but if running on an older system you could choose to use bs3.

So the first thing you would want to do is to install this module.
If you have from python 3.0>installed, you should head to your terminal and use the python package installer called pip to install the module.

pip install bs4 #Wait for it to download all the dependencies.

If you're using a different code editor such as Pycharm, you should head to the settings find the module manager, type the name of the 'bs4' module and install it. The difference is that if you do this it would only run within the Pycharm application.

So start off a new project and deal with the imports.

from bs4 import Beautiful Soup #html reading module
import requests #Module used to send http requests
import re #Module used to filter data collected

So the uses of the modules we imported are stated beside them, but we'll get to using them better in a bit.

The next question is what website we would like to crawl. For the purpose of this tutorial i would be creating a crawler that searches a site that offers movie download links for any movie i would want to download hence, it would be like a google searcher for the content within that site alone.

You could take a peek at the site we want to crawl Here.
So getting to it,
well start of with the basics of any code 'functions'

def thecrawler(maxpages,movie):#I am creating a function that would help us search multiple sites. But for now i would just search one
    page = 1
    while page < maxpages:
    searchnetnaija(movie,max_pages)

The next thing is to build the searchnetnaija function.

def searchnetnaija(movie,max_pages):
    search = True#This is to prevent the code from running without end
    while(search):
        print('This works')
        url1="http://www.netnaija.com/videos/movies"
        sourcecode = requests.get(url1)# I use this to get the code from the site
        plain_text = sourcecode.text#Change the code to readable text
        soup = BeautifulSoup(plain_text,'lxml')#Change it to beautiful soup format 
        list = []
        for link in soup.find_all('a'):#Filter for all links
            lin = link.get('href')
            
            list.append(lin)
        search = False;
        
        for dat in list:#Search for the movie you want to find
            x = re.search(r'movies',dat)
            if x:
                s = r'%s' % movie
                y = re.search(s,dat)
                if y:
                    print(dat)

So if you run this it may run with an error or not run at all based on your software. The reason is that we named the lxml parser. Which is the chosen parser that compiles our soup.
To fix this head to your console and use the python package installer to install the parser like this

pip install lxml

Now run your code in your console with a keyword and see whether the crawler would be able to find the link for the desired movie within the site. Let me try it with a movie i'm sure is there.
Screenshot (52).png

And so we see that the crawler has searched the page for the links that we need when using the 'legacies' keyword. So this works.
Tip: If you're following this and you would like to see what you have actually learnt add a little bit more functionality by using the maxpages argument i added to make the crawler crawl through a bit more pages. Just interpolate the page number to the url link and make it increase after every loop.

Thanks for reading
You could head to my Github to see the code for this tutorial.

Sort:  

Thank you for your contribution @yalzeee.
After reviewing your tutorial, we suggest the following points for improvements:

  • We suggest that you explain at the beginning of your tutorial what you are going to teach. For example, at the beginning of this tutorial could have the explanation of what a crawler is.

  • Put more pictures during your tutorial, so the tutorial is more enjoyable to read.

Thank you for following our instructions in moderation. Your tutorials are better. Thank you for your work.

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Thank you for your review, @portugalcoin! Keep up the good work!

Hello! Your post has been resteemed and upvoted by @ilovecoding because we love coding! Keep up good work! Consider upvoting this comment to support the @ilovecoding and increase your future rewards! ^_^ Steem On!

Reply !stop to disable the comment. Thanks!

Hello! I find your post valuable for the wafrica community! Thanks for the great post! We encourage and support quality contents and projects from the West African region.
Do you have a suggestion, concern or want to appear as a guest author on WAfrica, join our discord server and discuss with a member of our curation team.
Don't forget to join us every Sunday by 20:30GMT for our Sunday WAFRO party on our discord channel. Thank you.

Hi @yalzeee!

Your post was upvoted by @steem-ua, new Steem dApp, using UserAuthority for algorithmic post curation!
Your post is eligible for our upvote, thanks to our collaboration with @utopian-io!
Feel free to join our @steem-ua Discord server

Hey, @yalzeee!

Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Get higher incentives and support Utopian.io!
Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via SteemPlus or Steeditor).

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!

Coin Marketplace

STEEM 0.33
TRX 0.11
JST 0.034
BTC 66753.89
ETH 3256.47
USDT 1.00
SBD 4.34