I.T. Spices The LINUX Way

lightingmacsteem (55)in #blockchain • 6 years ago

Python In The Shell: The STEEMIT Ecosystem – Post #96

THE BEAUTY OF PYTHON AND LINUX SHELL COMBINED

In continuation as per the previous post here:

https://steemit.com/blockchain/@lightingmacsteem/3rnbel-i-t-spices-the-linux-way

And in finishing our discussion as per the short python code below:

1 ###FIND THE LINE NUMBER OF THE BLOCK_ID IN THE INSERTFILE
2 if maxid != 0:
3   findthis = ('"block_id": "' + blockid[1] + '"')
4   ccc = ("grep -m1 -n '" + findthis + "' " + insertfile + " | cut --delimiter=':' -f1")   
5   line = int((os.popen(ccc).read()).strip()) + 1
6   lastline = int(line) + int(299999)
7   endline = int(lastline) + 1
8 else:
9   line = int(1)
10  lastline = line + int(299999)
11  endline = int(lastline) + 1

Getting Specific Lines Out Of Millions

6   lastline = int(line) + int(299999)

Line 6 is a number as the result of the lastline variable; said number is the specific line number as parsed previously from the multi-million lines of the source huge file. The 299999 counter simply means that we want 300 thousand lines including the very line result previously.

7   endline = int(lastline) + 1

Line 7 is the last line number to be processed. By now we can picture already that we are defining the start and end lines that we are going to insert into our database being the new records that they are.

A Clearer Explanation Up To This Point

We need to at least compose ourselves what we are really trying to accomplish here.

There is a very huge file containing JSON-formatted data of the STEEMIT blockchain. Inside such file are records that we want to save and insert into the database. It could have been easier if not for the number of lines of data on the said file, so we devised a way to somehow do it in batches, going back and forth between the huge JSON file and the database.

We only need to process 300 thousand lines at each cycle, based on the last line as already saved on the database. The last line on the database is very important because it will give us the idea that the said line and record of data is without a doubt the very last record as compared to the huge source file. So we continue from there, trying to manipulate the routines to be very simple yet much faster.

We also need to rememebr that the huge file is JSON in the millions of lines!!! So this number, at 300 thousand lines per cycle, will be big enough for our purpose but not that huge as to choke the very system that we will be running this code from.

We iterate thru the lines inside the file, meaning, we took the file line per line, examine data, then inserting it into the database. We do this by getting the needed lines first, save it into a memory file to make it much faster. It is this memory file that we read repeatedly as it is fast, not the source file as we will be slow there.

Python does the overall control and manipulation, the linux shell does the complex job of a one-shot go in examining the huge file in the fastest way possible, extracting only the 300 thousand lines as needed at each cycle.

The next post will finish off this topic on the huge file parsing.

Man-made Systems Can Always Be Improved By The Same Men

If we are given such tools, I really believe any combination of problems relating to data manipulations and decision making can be done, no exceptions.

I do mean those that are made by men, of course.