Python Scrapy

ertinfagor (40)in #howto • 7 years ago

Hi!

Today I want to write an article on Data Science theme. One of the first tasks of Data Science is a ETL (Extract Transform Load) and I want to tell about Scrapy. Scrapy is a Python Framework of extracting data from Web sites. In this article I`ll try to give basics of Scrapy and in the future articles we will try to obtain how scrapy can by valuable for steemit users.

What is Scrapy use cases? Imagine that you want to buy some kind of thing and you need to make some research about proposals. In this use case you must obtain (extract) all items of a certain category from different internet shops. If this operation is one-time activity than no problem to do it manual. But if you need to do this many times in a day or amount of items is billions then this is the task for web crawler like Scrapy. Another use case is extracting data from API like steemit and I think that it will be a theme of future articles. Let me know in the comments below what kind of data in steemit do you need.

Let me write a range of features from documentation that can Scrapy:

Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.

An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.

Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)

Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.

Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).

Wide range of built-in extensions and middlewares for handling:

cookies and session handling
HTTP features like compression, authentication, caching
user-agent spoofing
robots.txt
crawl depth restriction
and more

A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler

Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

In the next article we will install Scrapy and write first spider.

#programming #tutorial #introduction #scrapy