Steemit Followers - Data extraction - Part 1

shaunmza (62)in #steemdev • 8 years ago (edited)

A little while back I decided to try and build something to help me visualise the activity against my account in terms of people who have followed, unfollowed and muted me. I then wrote a post on that, here is the link,
https://steemit.com/steem/@shaunmza/do-you-want-to-know-who-followed-stopped-following-you-on-steemit

I mentioned I would also write a post on how I got the data out, that is the subject of this post.

Blockchain data, bypassing the blockchain

My desire was to get a working prototype up quickly and with as little hurdles to get over before I could see results. My first choice was practically made for me, 'Where to get the data?'.

I wrote a post some time back on using STEEMSQL (http://steemsql.com/) to query Steem data, https://steemit.com/steemit/@shaunmza/querying-steem-blockchain-data-getting-started and I used a similar methodology to get the follower data. I am absolutely blown away that someone would put in all the effort to take all of the Steem transactions and put them into a database, then make that very same database publicly accessible! 🙌

The other option is to use the Steemd API, which I am in the process of getting to grips with now, however there is a greater learning curve in getting going on that. I have enough SQL experience to have had a fair idea of how to extract and manipulate the data to serve my needs.

We know where, but where... ?

So I knew the source of my data, but where exactly it sat was a mystery to me, lets look at the transactions that get stored on the blockchain, shall we?

SQL to get this data, SELECT DISTINCT(type) FROM Transactions;

account_create
account_create_with_delegation
account_update
account_witness_proxy
account_witness_vote
cancel_transfer_from_savings
change_recovery_account
claim_reward_balance
comment
comment_options
convert
custom
custom_json
delegate_vesting_shares
delete_comment
escrow_approve
escrow_dispute
escrow_release
escrow_transfer
feed_publish
limit_order_cancel
limit_order_create
pow
pow2
recover_account
request_account_recovery
set_withdraw_vesting_route
transfer
transfer_from_savings
transfer_to_savings
transfer_to_vesting
vote
withdraw_vesting
witness_update

That's quite a list, take some time to find the transaction type for 'follows', I will wait 😉
...
...
Yes, none of those relate to following, directly.
I spent some time going through the tables, running selects etc. let me spare you the details on that though. Our reward lies in the 'custom_json' transaction type.

The Transactions table stored only the required data for a transaction, any metadata is stored in a related table, in our case we are interested in the data held within the TxCustoms table, just take my word for it 😀

Now I am not even going to pretend that I know what the custom_json transactions are all used for, but if we take a look at the TxCustoms table, we will see the following transaction type id's in there (I am making < up as I don't know the correct terminology, sub transaction maybe?)
¯\_(ツ)_/¯

SQL for this list is SELECT DISTINCT(tid) FROM TXCustoms;

account_history
bookchain
booyah
charlieshrem
com.steemit.community
follow
message
opinion
reblog
task
test
test_community
testchain

#6 is the one we want, lets see what goes into that shall we?

SQL for the following is SELECT TOP 20 * FROM TXCustoms;

ID	tx_id	tid	json_metadata
1	294602	follow	{"follower":"steemit","following":"steem","what":["posts"]}
2	316479	follow	{"follower":"red","following":"piedpiper","what":["posts"]}
3	316498	follow	{"follower":"red","following":"piedpiper","what":["posts"]}
4	316499	follow	{"follower":"red","following":"piedpiper","what":["posts"]}
5	316508	follow	{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
6	316509	follow	{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
7	316530	follow	{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
8	316625	follow	{"follower":"jsc","following":"cryptogee","what":["posts"]}
9	318083	follow	{"follower":"jsc","following":"ico","what":["posts"]}
10	318087	follow	{"follower":"jsc","following":"ico","what":[]}
11	324105	follow	{"follower":"jsc","following":"max-keiser","what":["blog"]}
12	324123	follow	{"follower":"jsc","following":"max-keiser","what":["blog","posts"]}
13	324127	follow	{"follower":"jsc","following":"wmougayar","what":["posts"]}
14	331143	follow	{"follower":"arhag","following":"dantheman","what":["blog"]}
15	331159	follow	{"follower":"arhag","following":"dantheman","what":[]}
16	331167	follow	{"follower":"arhag","following":"dantheman","what":["blog"]}
17	331169	follow	{"follower":"nextgencrypto","following":"dantheman","what":["blog"]}
18	331497	follow	{"follower":"summon","following":"schnitzler","what":["blog"]}
19	331544	follow	{"follower":"tuck-fheman","following":"cryptoctopus","what":["blog"]}
20	331689	follow	{"follower":"pixellated","following":"tuck-fheman","what":["blog"]}

Note

The what portion changed from posts to blog, I assume this was in the very beginning and the developers decided to stick with blog
When what is empty it implies a user was unfollowed OR unmuted

So looking at that, we can see that what we want is in the database, but seeing as it is locked up in a JSON object within a field, extracting it is going to be troublesome 😣 So close though!!

After this, I spent some more time digging around, fortunately, the software I use, Valentina Studio, has a handy Schema browser, which means I was able to see that there were views as well. Taking a look at the existing views I saw a promising one named TxCustomsFollows, the views structure filled me with glee, I had found my source!

SQL for the following data SELECT TOP 5 * FROM TxCustomsFollows WHERE following = 'shaunmza'; (replace shaumza with your own username, so easy!)

timestamp	follower	following	what
Jan 26 2017 11:20:09:000PM	dimimp	shaunmza	["ignore"]
Jan 28 2017 05:03:09:000PM	luke490	shaunmza	["ignore"]
Jan 28 2017 11:37:54:000PM	zedikaredirect	shaunmza	["blog"]
Jan 29 2017 03:10:30:000AM	soundlegion	shaunmza	["blog"]
Jan 29 2017 04:27:42:000PM	foti	shaunmza	["blog"]

SO AWESOME!!!!!

However I noticed a little issue right off the cuff, the time it took to run that query was, well, excessive. 6.820 seconds!
Obviously the view is doing the heavy lifting of parsing the JSON string and extracting the data for us. This means it is doing a full table scan on the TxCustoms (which in SQL is not great), so each time we run a query against the view, we are causing a lot of work to be done on the database.

In the interest of being good users of this service, which is provided free of charge, it would be much better to extract this data, so that we can query to our hearts content without having negative effects on STEEMSQL.

Roger. We are GO for extraction!

The language I am most familiar with is PHP, so I used it to write a quick and dirty import script, to run the queries on STEEMSQL (MS SQL database) and then insert the data retrieved into a local MySQL database. The data is broken up and manipulated to fit into my local data structures, which were designed in such a way as to lend themselves to the queries I thought I would be writing.

Here is the query I run from within PHP;

    SELECT TOP 100000 
        timestamp, 
        CAST(follower AS TEXT) as follower, 
        CAST(following AS TEXT) AS following, 
        CAST(what AS TEXT) AS what 
    FROM TxCustomsFollows 
    WHERE timestamp > '" . $row['date_created'] . "' 
    ORDER BY timestamp ASC";

The 100000 is an arbitrary number I settled on, so that I could leave it on a 10 minute cron for a few hours and let it catch up.
Most of the query time seems to come from the view building up the data, so the difference between fetching 10 records and 100 000 is negligible.

This was the first part done! Now to extract the data, from the data!

I love these memes, always chuckle

Summary

I went through many iterations in my mind on this, from using Go instead of PHP, saving to a Graph Database instead of MySQL etc.

In the end I really needed to see some progress as I was itching to have a prototype up, so the simple straightforward path worked out the best.

I was hoping this post would be more about the whole process, but I am cutting it short here as it is already lengthy, so i will cover the data querying and getting it ready for a graph in my next post, stay tuned!