Steemit Followers - Data extraction - Part 1
A little while back I decided to try and build something to help me visualise the activity against my account in terms of people who have followed, unfollowed and muted me. I then wrote a post on that, here is the link,
https://steemit.com/steem/@shaunmza/do-you-want-to-know-who-followed-stopped-following-you-on-steemit
I mentioned I would also write a post on how I got the data out, that is the subject of this post.
Blockchain data, bypassing the blockchain
My desire was to get a working prototype up quickly and with as little hurdles to get over before I could see results. My first choice was practically made for me, 'Where to get the data?'.
I wrote a post some time back on using STEEMSQL (http://steemsql.com/) to query Steem data, https://steemit.com/steemit/@shaunmza/querying-steem-blockchain-data-getting-started and I used a similar methodology to get the follower data. I am absolutely blown away that someone would put in all the effort to take all of the Steem transactions and put them into a database, then make that very same database publicly accessible! 🙌
The other option is to use the Steemd API, which I am in the process of getting to grips with now, however there is a greater learning curve in getting going on that. I have enough SQL experience to have had a fair idea of how to extract and manipulate the data to serve my needs.
We know where, but where... ?
So I knew the source of my data, but where exactly it sat was a mystery to me, lets look at the transactions that get stored on the blockchain, shall we?
SQL to get this data, SELECT DISTINCT(type) FROM Transactions;
- account_create
- account_create_with_delegation
- account_update
- account_witness_proxy
- account_witness_vote
- cancel_transfer_from_savings
- change_recovery_account
- claim_reward_balance
- comment
- comment_options
- convert
- custom
- custom_json
- delegate_vesting_shares
- delete_comment
- escrow_approve
- escrow_dispute
- escrow_release
- escrow_transfer
- feed_publish
- limit_order_cancel
- limit_order_create
- pow
- pow2
- recover_account
- request_account_recovery
- set_withdraw_vesting_route
- transfer
- transfer_from_savings
- transfer_to_savings
- transfer_to_vesting
- vote
- withdraw_vesting
- witness_update
That's quite a list, take some time to find the transaction type for 'follows', I will wait 😉
...
...
Yes, none of those relate to following, directly.
I spent some time going through the tables, running selects etc. let me spare you the details on that though. Our reward lies in the 'custom_json' transaction type.
The Transactions
table stored only the required data for a transaction, any metadata is stored in a related table, in our case we are interested in the data held within the TxCustoms
table, just take my word for it 😀
Now I am not even going to pretend that I know what the custom_json
transactions are all used for, but if we take a look at the TxCustoms
table, we will see the following transaction type id's in there (I am making < up as I don't know the correct terminology, sub transaction maybe?)
¯\_(ツ)_/¯
SQL for this list is SELECT DISTINCT(tid) FROM TXCustoms;
- account_history
- bookchain
- booyah
- charlieshrem
- com.steemit.community
- follow
- message
- opinion
- reblog
- task
- test
- test_community
- testchain
#6 is the one we want, lets see what goes into that shall we?
SQL for the following is SELECT TOP 20 * FROM TXCustoms;
ID | tx_id | tid | json_metadata |
---|---|---|---|
1 | 294602 | follow | {"follower":"steemit","following":"steem","what":["posts"]} |
2 | 316479 | follow | {"follower":"red","following":"piedpiper","what":["posts"]} |
3 | 316498 | follow | {"follower":"red","following":"piedpiper","what":["posts"]} |
4 | 316499 | follow | {"follower":"red","following":"piedpiper","what":["posts"]} |
5 | 316508 | follow | {"follower":"jsc","following":"officialfuzzy","what":["posts"]} |
6 | 316509 | follow | {"follower":"jsc","following":"officialfuzzy","what":["posts"]} |
7 | 316530 | follow | {"follower":"jsc","following":"officialfuzzy","what":["posts"]} |
8 | 316625 | follow | {"follower":"jsc","following":"cryptogee","what":["posts"]} |
9 | 318083 | follow | {"follower":"jsc","following":"ico","what":["posts"]} |
10 | 318087 | follow | {"follower":"jsc","following":"ico","what":[]} |
11 | 324105 | follow | {"follower":"jsc","following":"max-keiser","what":["blog"]} |
12 | 324123 | follow | {"follower":"jsc","following":"max-keiser","what":["blog","posts"]} |
13 | 324127 | follow | {"follower":"jsc","following":"wmougayar","what":["posts"]} |
14 | 331143 | follow | {"follower":"arhag","following":"dantheman","what":["blog"]} |
15 | 331159 | follow | {"follower":"arhag","following":"dantheman","what":[]} |
16 | 331167 | follow | {"follower":"arhag","following":"dantheman","what":["blog"]} |
17 | 331169 | follow | {"follower":"nextgencrypto","following":"dantheman","what":["blog"]} |
18 | 331497 | follow | {"follower":"summon","following":"schnitzler","what":["blog"]} |
19 | 331544 | follow | {"follower":"tuck-fheman","following":"cryptoctopus","what":["blog"]} |
20 | 331689 | follow | {"follower":"pixellated","following":"tuck-fheman","what":["blog"]} |
Note
- The what portion changed from posts to blog, I assume this was in the very beginning and the developers decided to stick with blog
- When what is empty it implies a user was unfollowed OR unmuted
So looking at that, we can see that what we want is in the database, but seeing as it is locked up in a JSON object within a field, extracting it is going to be troublesome 😣 So close though!!
After this, I spent some more time digging around, fortunately, the software I use, Valentina Studio, has a handy Schema browser, which means I was able to see that there were views as well. Taking a look at the existing views I saw a promising one named TxCustomsFollows
, the views structure filled me with glee, I had found my source!
SQL for the following data SELECT TOP 5 * FROM TxCustomsFollows WHERE following = 'shaunmza';
(replace shaumza with your own username, so easy!)
timestamp | follower | following | what |
---|---|---|---|
Jan 26 2017 11:20:09:000PM | dimimp | shaunmza | ["ignore"] |
Jan 28 2017 05:03:09:000PM | luke490 | shaunmza | ["ignore"] |
Jan 28 2017 11:37:54:000PM | zedikaredirect | shaunmza | ["blog"] |
Jan 29 2017 03:10:30:000AM | soundlegion | shaunmza | ["blog"] |
Jan 29 2017 04:27:42:000PM | foti | shaunmza | ["blog"] |
SO AWESOME!!!!!
However I noticed a little issue right off the cuff, the time it took to run that query was, well, excessive. 6.820 seconds!
Obviously the view is doing the heavy lifting of parsing the JSON string and extracting the data for us. This means it is doing a full table scan on the TxCustoms
(which in SQL is not great), so each time we run a query against the view, we are causing a lot of work to be done on the database.
In the interest of being good users of this service, which is provided free of charge, it would be much better to extract this data, so that we can query to our hearts content without having negative effects on STEEMSQL.
Roger. We are GO for extraction!
The language I am most familiar with is PHP, so I used it to write a quick and dirty import script, to run the queries on STEEMSQL (MS SQL database) and then insert the data retrieved into a local MySQL database. The data is broken up and manipulated to fit into my local data structures, which were designed in such a way as to lend themselves to the queries I thought I would be writing.
Here is the query I run from within PHP;
SELECT TOP 100000
timestamp,
CAST(follower AS TEXT) as follower,
CAST(following AS TEXT) AS following,
CAST(what AS TEXT) AS what
FROM TxCustomsFollows
WHERE timestamp > '" . $row['date_created'] . "'
ORDER BY timestamp ASC";
The 100000
is an arbitrary number I settled on, so that I could leave it on a 10 minute cron for a few hours and let it catch up.
Most of the query time seems to come from the view building up the data, so the difference between fetching 10 records and 100 000 is negligible.
This was the first part done! Now to extract the data, from the data!
I love these memes, always chuckle
Summary
I went through many iterations in my mind on this, from using Go instead of PHP, saving to a Graph Database instead of MySQL etc.
In the end I really needed to see some progress as I was itching to have a prototype up, so the simple straightforward path worked out the best.
I was hoping this post would be more about the whole process, but I am cutting it short here as it is already lengthy, so i will cover the data querying and getting it ready for a graph in my next post, stay tuned!
Other posts that may interest you in a similar vein;
All of it just went over my head:D
Shoot ok, maybe it was a bit dry?
Next post I will try and inject some humour, explain more on why I am doing something, rather than showing 'what'.
Thanks for the feedback!
No no. The fault is mine, I didnot get the technicalities in there so..
Excellent post!
Thanks for this great service for the community.
Looking at https://api.steemdata.com/health it seems mongodb is around 19 hours behind.
However when trying to get my following it is empty. Tested and working with older usernames.
res = s.Accounts.find_one({'name': 'codewithcheese'}, projection={'following': 1, '_id': 0})