I.T. Spices The LINUX Way

lightingmacsteem (55)in #knowledge • 7 years ago

Simplified Approach To The STEEM Blockchain Data Gathering - Post #42

JOSH and our cute puppy COFFEE……...

--

INFORMATION IS KEY TO THE FUTURE

I have spent the last two weeks trying to understand and fully grasp the STEEM blockchain and its underlying processes. In these series of tutorials I will try my best to make everything as dummified as possible as I always intend to empower the masses in the STEEMIT community to at least for them to gather some basic information and possibly gain more interest in this wonderful platform.

Before anything else, let me say in advance that the very first thing a STEEM data gatherer needs is the ability to “parse” a JSON-formatted text file. So to the learners here, it is a very wise initiative to give yourself the chance to know the format of a JSON file as I can clearly see this is the future of all data formats which will reduce our present databases to mere “pointers” in the near future of Big Data Mining.

Imagine every aspect of life and business in the future where everything runs on ones and zeros, your identity, your transactions, your biography, health records, everything inside a huge file. I believe this file format of the future is JSON so what better way to start learning than now.

--

THE STEEM BLOCKCHAIN SPREAD OVER 1 GIGABYTE FILES

I finally had the smooth flow of gathering the blockchain from public nodes (thanks to gtg’s fast server) and slowly but surely spreading it to my disk at 1 gigabyte file sizes for some play-arounds later.

Here is my simple python code (I slightly modified the @loki’s approach) that uses steem (not piston) as it has the “get_blocks_range” call which I find much much faster:

++++++++++++++++++++++++++++++++++++++++++

#!/usr/bin/env python3
import os
import sys
from steem import Steem
from pprint import pprint

current_block = int(sys.argv[1])
limit_block = int(sys.argv[2])

nodes = ['https://gtg.steem.house:8090',]

s = Steem(nodes)

block = s.get_blocks_range(int(sys.argv[1]),int(sys.argv[2]))
pprint(block)
current_block += 1

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Here is my BASH script which the python script above is being “managed” from; notice the sys.argv on the python script as this is intended to accept any variable from the BASH environment:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/bin/sh

start=$1
end=$2
limit=$3

while [ $end -le $limit ]; do
echo "date ----- Started procesing blocks $start to $end....." >/dev/tty
#####THIS IS THE ONLY REAL FETCH; SAVE TO FILE
/root/steem-python/bin/fetch-steem-block-to-JSON $start $end | /root/steem-python/bin/convert-block-to-REALJSON | jq -c '.[]'
echo "date ----- Finished procesing blocks $start to $end....." >/dev/tty
start=$(($start + $4))
end=$(($end + $4))
if [ $start -ge $3 ]; then
exit
fi
if [ $end -ge $3 ]; then
end=$3
fi
done
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

If you observe the BASH script above, it has another python parser that converts any data gathered from the STEEM blockchain into real JSON format (I got this from an internet URL and tested to be perfect!!!):

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
#!/usr/bin/env python3

import json
import sys

def dump(s):
print(json.dumps(eval(s)))
def main(args):
if not args:
dump(''.join(sys.stdin.readlines()))
else:
for arg in args:
dump(''.join(open(arg, 'r').readlines()))
return 0

if name == "main":
sys.exit(main(sys.argv[1:]))
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

The way the above scripts operate can be described as:

-The BASH script as the overall manager of the STEEM blockchain fetch runs the python script so that the get_blocks_range is incremented. BASH is the “controller and incrementer” while python is the “fetcher”. The BASH accepts these arguments:

*$1 is the start block in the range
*$2 is the end block in the range; this block will not be included in the fetch
*$3 is the limit block; if this is reached, the BASH script needs another file for the next million record block
*$4 is the increment; as the start and end blocks finished fetching at each cycle, the BASH script increments the start and end range until the limit is reached

-Another python script is called by BASH during the fetching of block cycles so that each record will become unique and real JSON, one blockchain record per line as saved in the 1Gb files.

-A very powerful, fast and versatile program in jq is deployed to “parse” and play around with the JSON formatted blockchain stream. This will be the real data cruncher moving forward in this fun exercise of mine as I play around with the STEEM blockchain data. The jq utility can be further helped with the sed and the grep utilities for more precision in mining the data.

--

THIS APPROACH MAY MEAN MUCH IN THE FUTURE

A very simple approach really, but one that carries huge potential. You wanna know how huge? Hmmmm, what if I will tell you that it is possible that in the future all things bigdata will be reversed? Meaning, all real data will be on files, and all our very fast databases nowadays will be reduced to mere pointers as queried by Web APIs as to what and where the files are located and what lines as held exactly the same by parallel processing computers.

I have a very nice benchmark, well preliminary but promising. In a million lined file of JSON blockchain data, no matter how I play around with the volumes of data in layers, it seems to not even reach .01 of a second as timed. It is worth noting that my disk is an external hardisk, and both the Linux Mint OS and the JSON data at 1Gb are in there.

Imagine what the new datacenter-grade NAS storage on pure flash disks can do. Moreover, I can already foresee a “parallel processing” approach to data mining if done this way, say a hundred computers (even if commodity ones), each holding the same 1 million JSON lines file and trying to outrun the others every time a query arrives. Moving way ahead because it is very clear to all of us that the blockchain is peer-to-peer, what if a thousand “seeds” are all in parallel processing? Surely it will not cause any headaches to such processors as all they need to hold is just a million lines JSON file.

I really wonder what the future processors and hard disks hold as pure power and capacity, and with my described possibilities here, one can clearly see as to why I had spent much time the past two weeks in attaining the simplest, the fastest and the surest way to collect the STEEM blockchain on 1 million JOSN lines files.