[Tutorial] Building a Web Crawler With Python
Repository: Python Repository Github
Visual Studio Code(Or any preferred code editor)
What you will learn:
In this tutorial you would learn to build a web crawler with these functionalities
- Learn to automate the boring stuff, crawl websites you visit often
- Identify and navigate through
- Use the
Beautiful Souppackage in the
Today , I jumped out of
ionic to do a tutorial based on python. This tutorial explains how you could crawl websites and find content in a way more easier way that the conventional. Web crawling is the basis of all web search from google to duck duck go. So lets hope you learn a thing or two.
The first thing you should understand with web crawling is that it isn't allowed for every site. Before you choose to crawl any website, read the terms and conditions and make sure the website allows or doesn't explicitly prohibit crawling.
So lets get to it
Every website template is built with
html which divides the page into sections based on readable code. The idea is to use this html content to find a particular thing on that page we would want and accessing it or using that data for a different purpose.
In this tutorial we would be using a module called
bs4. It is the latest version for this module, but if running on an older system you could choose to use
So the first thing you would want to do is to install this module.
If you have from python 3.0>installed, you should head to your terminal and use the python package installer called pip to install the module.
pip install bs4 #Wait for it to download all the dependencies.
If you're using a different code editor such as Pycharm, you should head to the settings find the module manager, type the name of the 'bs4' module and install it. The difference is that if you do this it would only run within the Pycharm application.
So start off a new project and deal with the imports.
from bs4 import Beautiful Soup #html reading module import requests #Module used to send http requests import re #Module used to filter data collected
So the uses of the modules we imported are stated beside them, but we'll get to using them better in a bit.
The next question is what website we would like to crawl. For the purpose of this tutorial i would be creating a crawler that searches a site that offers movie download links for any movie i would want to download hence, it would be like a google searcher for the content within that site alone.
You could take a peek at the site we want to crawl Here.
So getting to it,
well start of with the basics of any code 'functions'
def thecrawler(maxpages,movie):#I am creating a function that would help us search multiple sites. But for now i would just search one page = 1 while page < maxpages: searchnetnaija(movie,max_pages)
The next thing is to build the
def searchnetnaija(movie,max_pages): search = True#This is to prevent the code from running without end while(search): print('This works') url1="http://www.netnaija.com/videos/movies" sourcecode = requests.get(url1)# I use this to get the code from the site plain_text = sourcecode.text#Change the code to readable text soup = BeautifulSoup(plain_text,'lxml')#Change it to beautiful soup format list =  for link in soup.find_all('a'):#Filter for all links lin = link.get('href') list.append(lin) search = False; for dat in list:#Search for the movie you want to find x = re.search(r'movies',dat) if x: s = r'%s' % movie y = re.search(s,dat) if y: print(dat)
So if you run this it may run with an error or not run at all based on your software. The reason is that we named the
lxml parser. Which is the chosen parser that compiles our
To fix this head to your console and use the python package installer to install the parser like this
pip install lxml
Now run your code in your console with a keyword and see whether the crawler would be able to find the link for the desired movie within the site. Let me try it with a movie i'm sure is there.
And so we see that the crawler has searched the page for the links that we need when using the 'legacies' keyword. So this works.
Tip: If you're following this and you would like to see what you have actually learnt add a little bit more functionality by using the
maxpages argument i added to make the crawler crawl through a bit more pages. Just interpolate the page number to the url link and make it increase after every loop.
Thanks for reading
You could head to my Github to see the code for this tutorial.