Steemit Followers - Data extraction - Part 1

in #steemdev8 years ago (edited)

code.png

A little while back I decided to try and build something to help me visualise the activity against my account in terms of people who have followed, unfollowed and muted me. I then wrote a post on that, here is the link,
https://steemit.com/steem/@shaunmza/do-you-want-to-know-who-followed-stopped-following-you-on-steemit

I mentioned I would also write a post on how I got the data out, that is the subject of this post.

Blockchain data, bypassing the blockchain

My desire was to get a working prototype up quickly and with as little hurdles to get over before I could see results. My first choice was practically made for me, 'Where to get the data?'.

I wrote a post some time back on using STEEMSQL (http://steemsql.com/) to query Steem data, https://steemit.com/steemit/@shaunmza/querying-steem-blockchain-data-getting-started and I used a similar methodology to get the follower data. I am absolutely blown away that someone would put in all the effort to take all of the Steem transactions and put them into a database, then make that very same database publicly accessible! 🙌

The other option is to use the Steemd API, which I am in the process of getting to grips with now, however there is a greater learning curve in getting going on that. I have enough SQL experience to have had a fair idea of how to extract and manipulate the data to serve my needs.

We know where, but where... ?

So I knew the source of my data, but where exactly it sat was a mystery to me, lets look at the transactions that get stored on the blockchain, shall we?

SQL to get this data, SELECT DISTINCT(type) FROM Transactions;

  1. account_create
  2. account_create_with_delegation
  3. account_update
  4. account_witness_proxy
  5. account_witness_vote
  6. cancel_transfer_from_savings
  7. change_recovery_account
  8. claim_reward_balance
  9. comment
  10. comment_options
  11. convert
  12. custom
  13. custom_json
  14. delegate_vesting_shares
  15. delete_comment
  16. escrow_approve
  17. escrow_dispute
  18. escrow_release
  19. escrow_transfer
  20. feed_publish
  21. limit_order_cancel
  22. limit_order_create
  23. pow
  24. pow2
  25. recover_account
  26. request_account_recovery
  27. set_withdraw_vesting_route
  28. transfer
  29. transfer_from_savings
  30. transfer_to_savings
  31. transfer_to_vesting
  32. vote
  33. withdraw_vesting
  34. witness_update

That's quite a list, take some time to find the transaction type for 'follows', I will wait 😉
...
...
Yes, none of those relate to following, directly.
I spent some time going through the tables, running selects etc. let me spare you the details on that though. Our reward lies in the 'custom_json' transaction type.

The Transactions table stored only the required data for a transaction, any metadata is stored in a related table, in our case we are interested in the data held within the TxCustoms table, just take my word for it 😀

Now I am not even going to pretend that I know what the custom_json transactions are all used for, but if we take a look at the TxCustoms table, we will see the following transaction type id's in there (I am making < up as I don't know the correct terminology, sub transaction maybe?)
¯\_(ツ)_/¯

SQL for this list is SELECT DISTINCT(tid) FROM TXCustoms;

  1. account_history
  2. bookchain
  3. booyah
  4. charlieshrem
  5. com.steemit.community
  6. follow
  7. message
  8. opinion
  9. reblog
  10. task
  11. test
  12. test_community
  13. testchain

#6 is the one we want, lets see what goes into that shall we?

SQL for the following is SELECT TOP 20 * FROM TXCustoms;

IDtx_idtidjson_metadata
1294602follow{"follower":"steemit","following":"steem","what":["posts"]}
2316479follow{"follower":"red","following":"piedpiper","what":["posts"]}
3316498follow{"follower":"red","following":"piedpiper","what":["posts"]}
4316499follow{"follower":"red","following":"piedpiper","what":["posts"]}
5316508follow{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
6316509follow{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
7316530follow{"follower":"jsc","following":"officialfuzzy","what":["posts"]}
8316625follow{"follower":"jsc","following":"cryptogee","what":["posts"]}
9318083follow{"follower":"jsc","following":"ico","what":["posts"]}
10318087follow{"follower":"jsc","following":"ico","what":[]}
11324105follow{"follower":"jsc","following":"max-keiser","what":["blog"]}
12324123follow{"follower":"jsc","following":"max-keiser","what":["blog","posts"]}
13324127follow{"follower":"jsc","following":"wmougayar","what":["posts"]}
14331143follow{"follower":"arhag","following":"dantheman","what":["blog"]}
15331159follow{"follower":"arhag","following":"dantheman","what":[]}
16331167follow{"follower":"arhag","following":"dantheman","what":["blog"]}
17331169follow{"follower":"nextgencrypto","following":"dantheman","what":["blog"]}
18331497follow{"follower":"summon","following":"schnitzler","what":["blog"]}
19331544follow{"follower":"tuck-fheman","following":"cryptoctopus","what":["blog"]}
20331689follow{"follower":"pixellated","following":"tuck-fheman","what":["blog"]}

Note

  • The what portion changed from posts to blog, I assume this was in the very beginning and the developers decided to stick with blog
  • When what is empty it implies a user was unfollowed OR unmuted

So looking at that, we can see that what we want is in the database, but seeing as it is locked up in a JSON object within a field, extracting it is going to be troublesome 😣 So close though!!

After this, I spent some more time digging around, fortunately, the software I use, Valentina Studio, has a handy Schema browser, which means I was able to see that there were views as well. Taking a look at the existing views I saw a promising one named TxCustomsFollows, the views structure filled me with glee, I had found my source!

SQL for the following data SELECT TOP 5 * FROM TxCustomsFollows WHERE following = 'shaunmza'; (replace shaumza with your own username, so easy!)

timestampfollowerfollowingwhat
Jan 26 2017 11:20:09:000PMdimimpshaunmza["ignore"]
Jan 28 2017 05:03:09:000PMluke490shaunmza["ignore"]
Jan 28 2017 11:37:54:000PMzedikaredirectshaunmza["blog"]
Jan 29 2017 03:10:30:000AMsoundlegionshaunmza["blog"]
Jan 29 2017 04:27:42:000PMfotishaunmza["blog"]

SO AWESOME!!!!!

However I noticed a little issue right off the cuff, the time it took to run that query was, well, excessive. 6.820 seconds!
Obviously the view is doing the heavy lifting of parsing the JSON string and extracting the data for us. This means it is doing a full table scan on the TxCustoms (which in SQL is not great), so each time we run a query against the view, we are causing a lot of work to be done on the database.

In the interest of being good users of this service, which is provided free of charge, it would be much better to extract this data, so that we can query to our hearts content without having negative effects on STEEMSQL.

Roger. We are GO for extraction!

The language I am most familiar with is PHP, so I used it to write a quick and dirty import script, to run the queries on STEEMSQL (MS SQL database) and then insert the data retrieved into a local MySQL database. The data is broken up and manipulated to fit into my local data structures, which were designed in such a way as to lend themselves to the queries I thought I would be writing.

Here is the query I run from within PHP;

    SELECT TOP 100000 
        timestamp, 
        CAST(follower AS TEXT) as follower, 
        CAST(following AS TEXT) AS following, 
        CAST(what AS TEXT) AS what 
    FROM TxCustomsFollows 
    WHERE timestamp > '" . $row['date_created'] . "' 
    ORDER BY timestamp ASC";

The 100000 is an arbitrary number I settled on, so that I could leave it on a 10 minute cron for a few hours and let it catch up.
Most of the query time seems to come from the view building up the data, so the difference between fetching 10 records and 100 000 is negligible.

This was the first part done! Now to extract the data, from the data!

1o7lw5.jpg
I love these memes, always chuckle

Summary

I went through many iterations in my mind on this, from using Go instead of PHP, saving to a Graph Database instead of MySQL etc.

In the end I really needed to see some progress as I was itching to have a prototype up, so the simple straightforward path worked out the best.

I was hoping this post would be more about the whole process, but I am cutting it short here as it is already lengthy, so i will cover the data querying and getting it ready for a graph in my next post, stay tuned!

Follow me!

Other posts that may interest you in a similar vein;

Sort:  

All of it just went over my head:D

Shoot ok, maybe it was a bit dry?

Next post I will try and inject some humour, explain more on why I am doing something, rather than showing 'what'.

Thanks for the feedback!

No no. The fault is mine, I didnot get the technicalities in there so..

Excellent post!

Thanks for this great service for the community.

Looking at https://api.steemdata.com/health it seems mongodb is around 19 hours behind.

However when trying to get my following it is empty. Tested and working with older usernames.

res = s.Accounts.find_one({'name': 'codewithcheese'}, projection={'following': 1, '_id': 0})

Coin Marketplace

STEEM 0.27
TRX 0.21
JST 0.039
BTC 97426.90
ETH 3593.28
USDT 1.00
SBD 3.88