Part 27: Analysing The Amount Of Transactions On The STEEM Blockchain In Parallel

in utopian-io •  4 months ago

steem-python.png

This tutorial is part of a series where different aspects of programming with steem-python are explained. Links to the other tutorials can be found in the curriculum section below. This part will explain how to perform a multi threaded block analysis of the STEEM Blockchain with a MySQL database.


Repository

https://github.com/steemit/steem-python

What will I learn

  • Store data in parallel
  • MySQL blocking
  • Reduce RAM consumption
  • Plotting data

Requirements

  • Python3.6
  • steem-python
  • MySQL
  • PHPMyAdmin
  • Matplotlib

Difficulty

  • intermediate

Tutorial

Preface

The tutorial builds on the previous two tutorials: Part 21: Use Multi Threading To Analyse The Steem Blockchain In Parallel and Part 26: Retrieving Data From The STEEM Blockchain And Storing Into A MySQL Database.

STEEM has a block time of 3 seconds, that means that every day 28 800 new blocks are created. Analysing those in serial will take a long time. A better way to do so would be in parallel. Doing so will drastically improve performance but also introduce new problems which will be discussed in this tutorial.

This tutorial will show how to analyse a large amount of blocks and count the amount of transactions occurring per minute, per hour and per day. Storing all the data in a database and plotting the data afterwards.

Setup

Download the files from Github. There 3 are files analysis.py which is the main file, database.py which contains the interactions with the MySQL database and plot.py which is used to plot the generated data. analysis.py takes two arguments that set the amount_of_blocks and thread_amount. It is important that the amount_of_blocks is an even multiple of the thread_amount. plot.py takes one argument which sets the date to plot the data for.

Run scripts as following:
> python analysis.py 65536 16
> python plot.py 2018-07-06

Configure the database

The code is set up to store the data in a database steem with 3 different tables: txs_minute, txs_hour and txs_day. Also a user test with the password Test!234# is used. Executing the code below will set everything up in the same manner. If you wish to use your own settings make sure to make the appropriate changes in database.py.

Login as the root user:

mysql -u root -p

Create a test account, this account is used in the code:

GRANT ALL PRIVILEGES ON *.* TO 'test'@'localhost' IDENTIFIED BY 'Test!234#';

Create a new database called steem:

CREATE DATABASE steem;

Create the tables for storing the data:

CREATE TABLE `steem`.`txs_minute` ( `id` INT NOT NULL AUTO_INCREMENT , `date` DATETIME NOT NULL , `amount` INT NOT NULL , PRIMARY KEY (`id`), UNIQUE (`date`)) ENGINE = InnoDB;

CREATE TABLE `steem`.`txs_hour` ( `id` INT NOT NULL AUTO_INCREMENT , `date` DATETIME NOT NULL , `amount` INT NOT NULL , PRIMARY KEY (`id`), UNIQUE (`date`)) ENGINE = InnoDB;

CREATE TABLE `steem`.`txs_day` ( `id` INT NOT NULL AUTO_INCREMENT , `date` DATETIME NOT NULL , `amount` INT NOT NULL , PRIMARY KEY (`id`), UNIQUE (`date`)) ENGINE = InnoDB;

Exit mysql:

\q


Screenshot 2018-07-12 18.06.15.png

Store data in parallel

The amount of transactions for a given date, hour and minute will be stored in 3 different tables. This allows for different resolutions for performing analysis after. The timestamp is used as the unique key value in a dict with a frequency counter. All of this information is taken from the timestamp that is located inside each block and converted to a datetime_object.

timestamp = block['timestamp']
datetime_object = datetime.strptime(timestamp, '%Y-%m-%dT%H:%M:%S')

for operation in block['transactions']:
    hour = datetime_object.hour
    minute = datetime_object.minute

    self.process_transaction(f'{hour}:{minute}:00', self.data_minute)
    self.process_transaction(f'{hour}:00:00', self.data_hour)
    self.process_transaction(f'00:00:00', self.data_day)



The way the parallelisation works in this code is that all the blocks that are to be analysed are divided evenly among all of threads. This means that several threads can be working on the same unique key. For example the same date.

def process_transaction(self, string, data):
    key = str(self.date) + ' ' + string
    if string in data:
        data[key] += 1
    else:
        data[key] = 1


When storing the data from memory to the database multiple threads may want to add the same unique key to the table. To prevent this, first a check has to be done to see if the key is already in the the table. If so the record must be updated instead of inserted.

if self.check_if_record_exist(date, table):
    update_record(amount, date, table)
else:
    insert_record(amount, date, table)

MySQL blocking

Now imagine two threads accessing the database at the same time. Thread 1 and thread 2 both check to see if the date 2018-07-06 is already in the database. Now if one of threads looked in the database just before the other thread did an insert or update it would see the old data. It would then precede to use to old data to update the record. Omitting the work of the previous thread. This can be prevented by locking the table.

cursor.execute(f"LOCK TABLES {table} WRITE;")



By locking the table any other thread is unable to perform any writing actions. By having each thread acquire their own lock first they will not be able to make any reading actions as well. Python gracefully deals with this situation by waiting until the lock is released.

cursor.execute("UNLOCK TABLES;")

Reduce RAM consumption

When performing large scale analysis on the blockchain the amount of RAM usage can increase significantly. Reducing RAM usage while also keeping performance in mind is therefor important. Depending on the type of analysis being done it is recommended to clear the data buffers regularly, but not to often to reduce overhead. With 28 800 blocks per day, 1 week would be 201 600 blocks and 4 weeks 806 400 blocks. At the time of writing each block contains about 50 transactions. The amount of operations adds up quickly, especially when storing more variables.

For this tutorial every hour of blocks analysed each thread inserts the data into the database and clears its buffers. Doing so also spreads out interaction with the database so that threads will be less likely to have to wait on each other.

if datetime_object.hour != self.hour:
    self.dump_data()
    self.data_hour = {}
    self.data_minute = {}
    self.data_day = {}
    self.hour = datetime_object.hour

Plotting data

pyplot which is part of matplotlib is very useful tool for plotting graphs. In the case of plotting the data for the amount of transactions per hour for a given date the query would be as follows. BETWEEN allows for a specific date range to set.

query = (f"SELECT `date`,`amount` FROM `{table}` WHERE `date` " +
         f"BETWEEN '{start_date}' AND '{end_date}' " +
         "ORDER BY `date` ASC;")

datetime can be used with timedelta to calculate the end_date from a start_date that looks like %Y-%m-%d as follows:

start_date = datetime.strptime(date, '%Y-%m-%d').date()
end_date = start_date + timedelta(days=1)

The x and y values for the graph are fetched from the database and put into separate lists. x will be used to plot the times while y plot the amount of transactions.

x, y = db.get_data(start_date, 'txs_hour')

for result in self.cur.fetchall():
        x.append(result[0])
        y.append(result[1])

return x, y

The values are then plotted using pyplot. Labels are added for the x and y axis. Finally a grid is added and tight_layout() is called so that everything fits on the generated graph.

plt.plot(x, y)
plt.xlabel('time in [UTC]')
plt.ylabel('transactions/hour')
plt.tight_layout()
plt.grid()

The x values are datetime objects. Dates from matplotlib is used to process these values. major_locator sets the x values to display on the graph, for which hours will be used. For formatting %H:%M is used to omit the date and the seconds. xlim() sets the start and end values for the x asis and xtick(rotation=45) rotates the x labels so they stay readable.

hours = mdates.HourLocator()
t_fmt = mdates.DateFormatter('%H:%M')
plt.gca().xaxis.set_major_locator(hours)
plt.gca().xaxis.set_major_formatter(t_fmt)
plt.xlim([x[0], x[len(x)-1]])
plt.xticks(rotation=45)

Running the script

Running analysis.py with the arguments number_of_blocks and number_of_threads will start analysing the number of blocks starting from current head_block - number_of_blocks. It will then store the amount of transactions occurring in all of the blocks for each minute, hour and day. Depending on the amount of blocks being analysed this may still take a while.

> python analysis.py 262144 16

Took at least half an hour and resulted in the following table for days:


Screenshot 2018-07-12 18.06.29.png

Having a closer look running plot.py for the date 2018-07-06 then returned the following graph:
> python plot.py 2018-07-06


myfig.png

Curriculum

Set up:

The code for this tutorial can be found on GitHub!

This tutorial was written by @juliank.

Authors get paid when people like you upvote their post.
If you enjoyed what you read here, create your account today and start earning FREE STEEM!
Sort Order:  

Thank you for your contribution.

  • Very interesting tutorial, good job!!!

Your contribution has been evaluated according to Utopian policies and guidelines, as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, click here.


Need help? Write a ticket on https://support.utopian.io/.
Chat with us on Discord.
[utopian-moderator]

Hey @steempytutorials
Thanks for contributing on Utopian.
We’re already looking forward to your next contribution!

Want to chat? Join us on Discord https://discord.gg/h52nFrV.

Vote for Utopian Witness!