Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3

in #utopian-io6 years ago (edited)

Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3

python_logo.png

What Will I Learn?

  • You will learn how to fetch historical pricing data from the coinmarketcap.com ("CMC") website, using web crawling techniques, since (as far as I'm aware of) a historical CMC price API is absent,
  • how to filter, reformat and reorder the fetched HTML into a serializable dictionary data format,
  • how to serialize and store the dictionary historical pricing data to a JSON file format, (for later file processing, data analysis et cetera).

Requirements

  • A working modern computer running macOS, Windows or Ubuntu
  • An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution
  • The ambition to learn Python programming

Difficulty

Intermediate

Curriculum (of the Learn Python Series):

Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3

In the previous Learn Python Series episodes, regarding both Developing a Web Crawler (Parts 1 & 2), as well as Handling JSON, we learned how to use the Requests and BeautifulSoup4 libraries to get and parse web data, and how to store data to disk, either via "regular text files" as well as JSON data. Also we briefly looked at how to fetch JSON data using a publicly available API, from coinmarketcap.com ("CMC"), in order to get and store the "latest" Bitcoin, Steem and SBD price data ticks.

Regarding the latter: that CMC API doesn't (currently) provide us with historical data, and in order to make an "educated guess" whether prices will probably go up or down in the short to mid-term, we do need historical data. In this episode, we're going to combine the knowledge gathered from the three previous Learn Python Series episodes, by using a web crawler technique to get historical price data, and by using JSON techniques to store that data to disk (for further processing later on).

Using the CMC historical data web interface

Even though the coinmarketcap.com API currently has no option to provide us with historical data, at least not that I'm aware of right now (=> if you do know a CMC historical data API endpoint, please mention it to me in the comments!), CMC does show historical data via their web interface. So let's use that for this episode, using our web crawler techniques.

Fetching historical BTC-USD price data

In order to get daily CMC historical price data on BTC shown in a web browser, please visit https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20140101&end=20180501

This url is "configurable" with a couple of parameters we can automate:

  • "bitcoin" is the name of the coin we're interested in right now. In the previous tutorial, I've already given the values "steem" and "steem-dollars" (instead of another abbreviation like "sbd" or something) we can use as well;
  • "start" is a GET parameter we can pass a start date to in the form of "20140101", meaning we're interested in historical data beginning at Jan. 01, 2014;
  • "end" is a GET parameter we can pass an end date to, so let's use a date in the near future from today (Friday April 27, 2018), May 01, 2018.

As a result, a web page loads, holding HTML data with daily values of the fields:

  • Date
  • Open
  • High
  • Low
  • Close
  • Volume
  • Market Cap

Let's work with that!

import json
import requests
from bs4 import BeautifulSoup
coin = 'bitcoin'
start_date = 20140101
end_date = 20180501
cmc_url = ('https://coinmarketcap.com/currencies/' 
           + str(coin) 
           + '/historical-data/?start=' 
           + str(start_date) 
           + '&end=' 
           + str(end_date)
)
r =requests.get(cmc_url)
content = r.text
soup = BeautifulSoup(content, 'lxml')

If we inspect the returned HTML (using the browser's web developer / inspector console), we find multiple <tr class="text-right"> elements, each containing 7 <td> elements, each holding daily data for "Date", "Open", "High", "Low", "Close", "Volume", and "Market Cap". Exactly the data we want to extract from the HTML page and format to our liking.

daily_ticks_list = soup.select('tr.text-right')

If we would print the daily_ticks_list values, we'll get a list of HTML data in this form:

[<tr class="text-right">
<td class="text-left">Apr 26, 2018</td>
<td data-format-fiat="" data-format-value="8867.32">8867.32</td>
<td data-format-fiat="" data-format-value="9281.51">9281.51</td>
<td data-format-fiat="" data-format-value="8727.09">8727.09</td>
<td data-format-fiat="" data-format-value="9281.51">9281.51</td>
<td data-format-market-cap="" data-format-value="8970560000.0">8,970,560,000</td>
<td data-format-market-cap="" data-format-value="1.50736e+11">150,736,000,000</td>
</tr>, <tr class="text-right">
<td class="text-left">Apr 25, 2018</td>
<td data-format-fiat="" data-format-value="9701.03">9701.03</td>
<td data-format-fiat="" data-format-value="9745.32">9745.32</td>
<td data-format-fiat="" data-format-value="8799.84">8799.84</td>
<td data-format-fiat="" data-format-value="8845.74">8845.74</td>
<td data-format-market-cap="" data-format-value="11083100000.0">11,083,100,000</td>
<td data-format-market-cap="" data-format-value="1.64893e+11">164,893,000,000</td>
</tr>]

Now we need to filter and reformat each element in daily_ticks_list from HTML (well, a list of bs4 objects each containing parseable HTML) to a list of elements containing dictionaries holding only the data I'm interested in (in this case: only the dates and opening prices), we can then store to a .json file for later processing.

A few things to note here:

  • since the resulting HTML data is, per day, stored as a list element, for the day_string we first need to select the appropriate <td> HTML element with class="text-right", filter the 0-indexed element from that, and then filter out its contents value again via its 0-indexed element;
  • the date format presented in the CMC HTML is in the form of Apr 26, 2018 which I'd like to convert to the form of 20180426. This could be done pretty easy using various (built-in) Python date/time conversion methods, but since it's a (good?) habit of mine in the entire Learn Python Series to not use technical tricks I haven't covered before, I'll use other ways to convert the date to my prefered string format;
  • luckily, after inspecting the presented date data, every date string begins with a 3-character month abbreviation, then a space, then a 2-character day representation, another space, a comma, and finally a 4-character year representation. And as a result, we can simply use string slicing (as covered in the handling strings episodes), a self-defined months_dict month conversion dictionary, and string concatenation to go from Apr 26, 2018 to 20180426;
  • the opening price <td> doesn't have an HTML class name, but it's the second child <td> of each <tr>, so we'll follow the same 0-indexed element mechanism as we did with filtering out the date, but this time using the bs4 selection value select('td:nth-of-type(2)').

Let's just create an empty day_ticks list, and append daily dictionaries to it holding date and open_price_usd key:value pairs, like so:

day_ticks = []
months_dict = {
                'Jan': '01', 'Feb': '02', 'Mar': '03',
                'Apr': '04', 'May': '05', 'Jun': '06',
                'Jul': '07', 'Aug': '08', 'Sep': '09',
                'Oct': '10', 'Nov': '11', 'Dec': '12'
}
for day_tick in daily_ticks_list:
    day_string = day_tick.select('td.text-left')[0].contents[0]
    year = day_string[8:]
    month = months_dict[day_string[:3]]    
    day = day_string[4:6]    
    date = year + month + day
    open_price_usd = day_tick.select('td:nth-of-type(2)')[0].contents[0]
    day_data = {'date': date, 'open_price_usd': open_price_usd}
    day_ticks.append(day_data)

Now the order of BTC price data in the day_ticks list is in descending order, from new (today) to old (start_date = 20140101). And since I want to store this data in reversed order (which will become clear in the remainder of this tutorial), from old to new, I'll first reverse the list order, like so:

day_ticks.reverse()

Next we'll save the old-to-new BTC price data to a .json file, in the same way we've been doing in the previous handling JSON tutorial:

file_name = coin + '_usd.json'
with open(file_name, 'w') as f:
    json.dump(day_ticks, f, indent=4)

As a result, if we now look into our current working directory, we find a newly generated file named bitcoin_usd.json and if you open it using your favorite text editor, you'll find data in the following form:

[
    {
        "date": "20140101",
        "open_price_usd": "754.97"
    },
    {
        "date": "20140102",
        "open_price_usd": "773.44"
    },
    
    ...
    ...
    ...
    
    {
        "date": "20180425",
        "open_price_usd": "9701.03"
    },
    {
        "date": "20180426",
        "open_price_usd": "8867.32"
    }
]

Cool! Exactly what we were looking for!

Creating a function to process historical CMC pricing data on multiple coins at once

Next up, we'll create a function to process historical CMC pricing data on multiple coins - in our example case regarding BTC, Steem and SBD - all in one go, like so:

import json
import requests
from bs4 import BeautifulSoup

coins = ['bitcoin', 'steem', 'steem-dollars']
months_dict = {
                'Jan': '01', 'Feb': '02', 'Mar': '03',
                'Apr': '04', 'May': '05', 'Jun': '06',
                'Jul': '07', 'Aug': '08', 'Sep': '09',
                'Oct': '10', 'Nov': '11', 'Dec': '12'
}

def store_historical_price_data(coin):
    
    # construct the precise CMC url
    start_date = 20140101
    end_date = 20180501
    cmc_url = ('https://coinmarketcap.com/currencies/' 
        + str(coin) + '/historical-data/?start=' 
        + str(start_date) 
        + '&end=' 
        + str(end_date)
    )
    
    # fetch the html data and parse it via bs4
    r =requests.get(cmc_url)
    content = r.text
    soup = BeautifulSoup(content, 'lxml')
    
    # filter all historical coin pricing
    daily_ticks_list = soup.select('tr.text-right')
    
    # create an empty list `day_ticks`
    day_ticks = []

    # filter, reformat, and append the dates and opening prices
    # to `day_ticks`
    for day_tick in daily_ticks_list:
        day_string = day_tick.select('td.text-left')[0].contents[0]
        year = day_string[8:]
        month = months_dict[day_string[:3]]    
        day = day_string[4:6]    
        date = year + month + day
        open_price_usd = day_tick.select('td:nth-of-type(2)')[0].contents[0]
        day_data = {'date': date, 'open_price_usd': open_price_usd}
        day_ticks.append(day_data)
        
    # reverse the data, from old-to-new
    day_ticks.reverse()
    
    # store the data to a named .json file placed in the current working directory
    file_name = coin + '_usd.json'
    with open(file_name, 'w') as f:
        json.dump(day_ticks, f, indent=4)

# initiate `store_historical_price_data()` for all elements in `coins`
for coin in coins:
    store_historical_price_data(coin)

As a result, we now have 3 files (bitcoin_usd.json, steem-dollars_usd.json, and steem_usd.json, where the steem-dollars_usd.json file looks like:

[
    {
        "date": "20160718",
        "open_price_usd": "0.957820"
    },
    {
        "date": "20160719",
        "open_price_usd": "1.47"
    },
    
    ...
    ...
    ...
    
    {
        "date": "20180425",
        "open_price_usd": "3.48"
    },
    {
        "date": "20180426",
        "open_price_usd": "3.59"
    }
]

and the steem_usd.json file looks like:

[
    {
        "date": "20160418",
        "open_price_usd": "0.642902"
    },
    {
        "date": "20160419",
        "open_price_usd": "0.877911"
    },
    
    ...
    ...
    ...
    
    {
        "date": "20180425",
        "open_price_usd": "3.50"
    },
    {
        "date": "20180426",
        "open_price_usd": "4.15"
    }
]

So, concluding:

  • our BTC historical data ranges from "20140101" to "20180426",
  • our Steem historical data ranges from "20160418" to "20180426",
  • our SBD historical data ranges from "20160718" to "20180426".

What did we learn, hopefully?

In this episode, I showed you how to combine web crawling techniques and JSON serialization to fetch historical pricing data from a web page where an API providing the same type of data is absent, and store the JSON data for later processing.

In the next tutorial, we'll further process, combine and analyze the historical data we got in this tutorial episode, and see if we might be able to draw a few interesting financial conclusions from it!

Thank you for your time!



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for your contribution, it has been accepted


Need help? Write a ticket on https://support.utopian.io.
Chat with us on Discord.

[utopian-moderator]

Thanks!

Hey @scipio! Thank you for the great work you've done!

We're already looking forward to your next contribution!

Fully Decentralized Rewards

We hope you will take the time to share your expertise and knowledge by rating contributions made by others on Utopian.io to help us reward the best contributions together.

Utopian Witness!

Vote for Utopian Witness! We are made of developers, system administrators, entrepreneurs, artists, content creators, thinkers. We embrace every nationality, mindset and belief.

Want to chat? Join us on Discord https://discord.me/utopian-io

Coin Marketplace

STEEM 0.27
TRX 0.11
JST 0.033
BTC 63900.26
ETH 3063.07
USDT 1.00
SBD 4.21