Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
What Will I Learn?
- You will learn how to fetch historical pricing data from the coinmarketcap.com ("CMC") website, using web crawling techniques, since (as far as I'm aware of) a historical CMC price API is absent,
- how to filter, reformat and reorder the fetched HTML into a serializable dictionary data format,
- how to serialize and store the dictionary historical pricing data to a JSON file format, (for later file processing, data analysis et cetera).
Requirements
- A working modern computer running macOS, Windows or Ubuntu
- An installed Python 3(.6) distribution, such as (for example) the Anaconda Distribution
- The ambition to learn Python programming
Difficulty
Intermediate
Curriculum (of the Learn Python Series
):
- Learn Python Series - Intro
- Learn Python Series (#2) - Handling Strings Part 1
- Learn Python Series (#3) - Handling Strings Part 2
- Learn Python Series (#4) - Round-Up #1
- Learn Python Series (#5) - Handling Lists Part 1
- Learn Python Series (#6) - Handling Lists Part 2
- Learn Python Series (#7) - Handling Dictionaries
- Learn Python Series (#8) - Handling Tuples
- Learn Python Series (#9) - Using Import
- Learn Python Series (#10) - Matplotlib Part 1
- Learn Python Series (#11) - NumPy Part 1
- Learn Python Series (#12) - Handling Files
- Learn Python Series (#13) - Mini Project - Developing a Web Crawler Part 1
- Learn Python Series (#14) - Mini Project - Developing a Web Crawler Part 2
- Learn Python Series (#15) - Handling JSON
Learn Python Series (#16) - Mini Project - Developing a Web Crawler Part 3
In the previous Learn Python Series
episodes, regarding both Developing a Web Crawler (Parts 1 & 2)
, as well as Handling JSON
, we learned how to use the Requests and BeautifulSoup4 libraries to get and parse web data, and how to store data to disk, either via "regular text files" as well as JSON data. Also we briefly looked at how to fetch JSON data using a publicly available API, from coinmarketcap.com ("CMC"), in order to get and store the "latest" Bitcoin, Steem and SBD price data ticks.
Regarding the latter: that CMC API doesn't (currently) provide us with historical data, and in order to make an "educated guess" whether prices will probably go up or down in the short to mid-term, we do need historical data. In this episode, we're going to combine the knowledge gathered from the three previous Learn Python Series
episodes, by using a web crawler technique to get historical price data, and by using JSON techniques to store that data to disk (for further processing later on).
Using the CMC historical data web interface
Even though the coinmarketcap.com API currently has no option to provide us with historical data, at least not that I'm aware of right now (=> if you do know a CMC historical data API endpoint, please mention it to me in the comments!), CMC does show historical data via their web interface. So let's use that for this episode, using our web crawler techniques.
Fetching historical BTC-USD price data
In order to get daily CMC historical price data on BTC shown in a web browser, please visit https://coinmarketcap.com/currencies/bitcoin/historical-data/?start=20140101&end=20180501
This url is "configurable" with a couple of parameters we can automate:
- "bitcoin" is the name of the coin we're interested in right now. In the previous tutorial, I've already given the values "steem" and "steem-dollars" (instead of another abbreviation like "sbd" or something) we can use as well;
- "start" is a GET parameter we can pass a start date to in the form of "20140101", meaning we're interested in historical data beginning at Jan. 01, 2014;
- "end" is a GET parameter we can pass an end date to, so let's use a date in the near future from today (Friday April 27, 2018), May 01, 2018.
As a result, a web page loads, holding HTML data with daily values of the fields:
- Date
- Open
- High
- Low
- Close
- Volume
- Market Cap
Let's work with that!
import json
import requests
from bs4 import BeautifulSoup
coin = 'bitcoin'
start_date = 20140101
end_date = 20180501
cmc_url = ('https://coinmarketcap.com/currencies/'
+ str(coin)
+ '/historical-data/?start='
+ str(start_date)
+ '&end='
+ str(end_date)
)
r =requests.get(cmc_url)
content = r.text
soup = BeautifulSoup(content, 'lxml')
If we inspect the returned HTML (using the browser's web developer / inspector console), we find multiple <tr class="text-right">
elements, each containing 7 <td>
elements, each holding daily data for "Date", "Open", "High", "Low", "Close", "Volume", and "Market Cap". Exactly the data we want to extract from the HTML page and format to our liking.
daily_ticks_list = soup.select('tr.text-right')
If we would print the daily_ticks_list
values, we'll get a list of HTML data in this form:
[<tr class="text-right">
<td class="text-left">Apr 26, 2018</td>
<td data-format-fiat="" data-format-value="8867.32">8867.32</td>
<td data-format-fiat="" data-format-value="9281.51">9281.51</td>
<td data-format-fiat="" data-format-value="8727.09">8727.09</td>
<td data-format-fiat="" data-format-value="9281.51">9281.51</td>
<td data-format-market-cap="" data-format-value="8970560000.0">8,970,560,000</td>
<td data-format-market-cap="" data-format-value="1.50736e+11">150,736,000,000</td>
</tr>, <tr class="text-right">
<td class="text-left">Apr 25, 2018</td>
<td data-format-fiat="" data-format-value="9701.03">9701.03</td>
<td data-format-fiat="" data-format-value="9745.32">9745.32</td>
<td data-format-fiat="" data-format-value="8799.84">8799.84</td>
<td data-format-fiat="" data-format-value="8845.74">8845.74</td>
<td data-format-market-cap="" data-format-value="11083100000.0">11,083,100,000</td>
<td data-format-market-cap="" data-format-value="1.64893e+11">164,893,000,000</td>
</tr>]
Now we need to filter and reformat each element in daily_ticks_list
from HTML (well, a list of bs4 objects each containing parseable HTML) to a list of elements containing dictionaries holding only the data I'm interested in (in this case: only the dates and opening prices), we can then store to a .json file for later processing.
A few things to note here:
- since the resulting HTML data is, per day, stored as a list element, for the
day_string
we first need to select the appropriate<td>
HTML element withclass="text-right"
, filter the 0-indexed element from that, and then filter out itscontents
value again via its 0-indexed element; - the date format presented in the CMC HTML is in the form of
Apr 26, 2018
which I'd like to convert to the form of20180426
. This could be done pretty easy using various (built-in) Python date/time conversion methods, but since it's a (good?) habit of mine in the entireLearn Python Series
to not use technical tricks I haven't covered before, I'll use other ways to convert the date to my prefered string format; - luckily, after inspecting the presented date data, every date string begins with a 3-character month abbreviation, then a space, then a 2-character day representation, another space, a comma, and finally a 4-character year representation. And as a result, we can simply use string slicing (as covered in the
handling strings
episodes), a self-definedmonths_dict
month conversion dictionary, and string concatenation to go fromApr 26, 2018
to20180426
; - the opening price
<td>
doesn't have an HTML class name, but it's the second child<td>
of each<tr>
, so we'll follow the same 0-indexed element mechanism as we did with filtering out the date, but this time using the bs4 selection valueselect('td:nth-of-type(2)')
.
Let's just create an empty day_ticks
list, and append daily dictionaries to it holding date
and open_price_usd
key:value pairs, like so:
day_ticks = []
months_dict = {
'Jan': '01', 'Feb': '02', 'Mar': '03',
'Apr': '04', 'May': '05', 'Jun': '06',
'Jul': '07', 'Aug': '08', 'Sep': '09',
'Oct': '10', 'Nov': '11', 'Dec': '12'
}
for day_tick in daily_ticks_list:
day_string = day_tick.select('td.text-left')[0].contents[0]
year = day_string[8:]
month = months_dict[day_string[:3]]
day = day_string[4:6]
date = year + month + day
open_price_usd = day_tick.select('td:nth-of-type(2)')[0].contents[0]
day_data = {'date': date, 'open_price_usd': open_price_usd}
day_ticks.append(day_data)
Now the order of BTC price data in the day_ticks
list is in descending order, from new (today) to old (start_date = 20140101
). And since I want to store this data in reversed order (which will become clear in the remainder of this tutorial), from old to new, I'll first reverse the list order, like so:
day_ticks.reverse()
Next we'll save the old-to-new BTC price data to a .json
file, in the same way we've been doing in the previous handling JSON
tutorial:
file_name = coin + '_usd.json'
with open(file_name, 'w') as f:
json.dump(day_ticks, f, indent=4)
As a result, if we now look into our current working directory, we find a newly generated file named bitcoin_usd.json
and if you open it using your favorite text editor, you'll find data in the following form:
[
{
"date": "20140101",
"open_price_usd": "754.97"
},
{
"date": "20140102",
"open_price_usd": "773.44"
},
...
...
...
{
"date": "20180425",
"open_price_usd": "9701.03"
},
{
"date": "20180426",
"open_price_usd": "8867.32"
}
]
Cool! Exactly what we were looking for!
Creating a function to process historical CMC pricing data on multiple coins at once
Next up, we'll create a function to process historical CMC pricing data on multiple coins - in our example case regarding BTC, Steem and SBD - all in one go, like so:
import json
import requests
from bs4 import BeautifulSoup
coins = ['bitcoin', 'steem', 'steem-dollars']
months_dict = {
'Jan': '01', 'Feb': '02', 'Mar': '03',
'Apr': '04', 'May': '05', 'Jun': '06',
'Jul': '07', 'Aug': '08', 'Sep': '09',
'Oct': '10', 'Nov': '11', 'Dec': '12'
}
def store_historical_price_data(coin):
# construct the precise CMC url
start_date = 20140101
end_date = 20180501
cmc_url = ('https://coinmarketcap.com/currencies/'
+ str(coin) + '/historical-data/?start='
+ str(start_date)
+ '&end='
+ str(end_date)
)
# fetch the html data and parse it via bs4
r =requests.get(cmc_url)
content = r.text
soup = BeautifulSoup(content, 'lxml')
# filter all historical coin pricing
daily_ticks_list = soup.select('tr.text-right')
# create an empty list `day_ticks`
day_ticks = []
# filter, reformat, and append the dates and opening prices
# to `day_ticks`
for day_tick in daily_ticks_list:
day_string = day_tick.select('td.text-left')[0].contents[0]
year = day_string[8:]
month = months_dict[day_string[:3]]
day = day_string[4:6]
date = year + month + day
open_price_usd = day_tick.select('td:nth-of-type(2)')[0].contents[0]
day_data = {'date': date, 'open_price_usd': open_price_usd}
day_ticks.append(day_data)
# reverse the data, from old-to-new
day_ticks.reverse()
# store the data to a named .json file placed in the current working directory
file_name = coin + '_usd.json'
with open(file_name, 'w') as f:
json.dump(day_ticks, f, indent=4)
# initiate `store_historical_price_data()` for all elements in `coins`
for coin in coins:
store_historical_price_data(coin)
As a result, we now have 3 files (bitcoin_usd.json
, steem-dollars_usd.json
, and steem_usd.json
, where the steem-dollars_usd.json
file looks like:
[
{
"date": "20160718",
"open_price_usd": "0.957820"
},
{
"date": "20160719",
"open_price_usd": "1.47"
},
...
...
...
{
"date": "20180425",
"open_price_usd": "3.48"
},
{
"date": "20180426",
"open_price_usd": "3.59"
}
]
and the steem_usd.json
file looks like:
[
{
"date": "20160418",
"open_price_usd": "0.642902"
},
{
"date": "20160419",
"open_price_usd": "0.877911"
},
...
...
...
{
"date": "20180425",
"open_price_usd": "3.50"
},
{
"date": "20180426",
"open_price_usd": "4.15"
}
]
So, concluding:
- our BTC historical data ranges from "20140101" to "20180426",
- our Steem historical data ranges from "20160418" to "20180426",
- our SBD historical data ranges from "20160718" to "20180426".
What did we learn, hopefully?
In this episode, I showed you how to combine web crawling techniques and JSON serialization to fetch historical pricing data from a web page where an API providing the same type of data is absent, and store the JSON data for later processing.
In the next tutorial, we'll further process, combine and analyze the historical data we got in this tutorial episode, and see if we might be able to draw a few interesting financial conclusions from it!
Thank you for your time!
Posted on Utopian.io - Rewarding Open Source Contributors
Thank you for your contribution, it has been accepted
Need help? Write a ticket on https://support.utopian.io.
Chat with us on Discord.
[utopian-moderator]
Thanks!
Hey @scipio! Thank you for the great work you've done!
We're already looking forward to your next contribution!
Fully Decentralized Rewards
We hope you will take the time to share your expertise and knowledge by rating contributions made by others on Utopian.io to help us reward the best contributions together.
Utopian Witness!
Vote for Utopian Witness! We are made of developers, system administrators, entrepreneurs, artists, content creators, thinkers. We embrace every nationality, mindset and belief.
Want to chat? Join us on Discord https://discord.me/utopian-io