A python script to synchronise copies of a log (like the blockchain)

l0k1 (69)in #sysadmin • 7 years ago (edited)

I

have been busy this morning working out a way to reduce the disk activity required to keep one copy of the Steem blockchain block_log file that you only update periodically, for the purpose of providing a bittorrent download of the current version.

Just to explain the use-case, if I make a snapshot of the block_log, I can generate a new .torrent file with the update, and seed it. I would make a daily backup of this file and save on disk activity by rotating the directory (keeping 7 copies with torrents created for 7 days), a user can be sure they can for 7 days download the whole thing.

If that time elapses, they can remove (without deleting the files) the torrent, and re-add the latest version, which will make their bittorrent client update it and bittorrent (like my update script) will only download and rewrite the changed data in the files.

There is no simple existing solution especially for networks for optimising the amount of disk activity when dealing with very very large files. I will later on be writing a script that uses the RPC to query and assemble all blocks for 1 day or 1 week or whichever settings seem most sensible, and produce a folder with these parts, and another script that joins this together and allows you to transmit or regenerate the whole file.

In fact, the only application available these days that lets you synchronise very large files while avoiding rewriting unchanged parts of the file is precisely Bittorrent.

So as part of my Witness Services, I will be providing a 7 daily updated torrents of the block_log that rotate so there is always a copy that you can spend as long as a week downloading before you will need to download a newer .torrent file, remove the torrent and re-add it leaving the existing file as it is. The scripts below are how this progressive updating will be applied, and I will be configuring Nginx to also provide the 7 .torrent files for the last 7 days so you can easily find them.

The scripts

First, for testing purposes so you can see how it works by taking a file, making a copy that has been truncated by an arbitrary number of bytes, that you can use to demonstrate the reassembly process run by the second script:

truncate.py

#!/usr/bin/python3
# This script makes a copy of a file minus an arbitrary number of bytes at the end
import os
import argparse

parser = argparse.ArgumentParser ( description = 'Copies a file except for some number of bites at the end' )
parser.add_argument ( 'source', metavar = 'SOURCE', type = str, help = 'The file you want to copy sans data at the end' )
parser.add_argument ( 'dest', metavar = 'DEST', type = str, help = 'The name of the new truncated copy' )
parser.add_argument ( 'trunc', metavar = 'TRUNC', type = int, help = 'The amount in bytes you want to truncate from the original file' )

args = parser.parse_args ( )

print( 'Creating a truncated copy of ' + args.source + ' to file ' + args.dest + ' with ' + str ( args.trunc ) + ' bytes truncated' )

source = open ( args.source, 'rb' )
dest = open (args.dest, 'wb' )
statinfo = os.stat ( args.source )
print ( "Source file is " + str ( statinfo.st_size ) + ' bytes in size' )
payloadsize = statinfo.st_size - args.trunc
print ( "Coping " + str ( payloadsize ) + " of " + args.source )
source.seek ( 0 )
dest.write ( source.read ( payloadsize ) )

source.close ( )
dest.close ( )

The next script takes the 'current' and 'old' (up to date, out of date) copies of a file, and copies the new extra data in the file to the 'old' version, bringing it up to date:

update-append.py

#!/usr/bin/python3
# This script copies the additional data in a log (added but never changed)
# from a current version to the end of an out of date version, making
# both files the same
import os
import argparse

parser = argparse.ArgumentParser ( description = 'Updates a log file with new content appended to it' )
parser.add_argument ( 'current', metavar = 'CURRENT', type = str, help = 'The file with new additional content' )
parser.add_argument ( 'old', metavar = 'OLD', type = str, help = 'The file you want to append new content to' )

args = parser.parse_args ( )

print( 'Copying new data from ' + args.current + ' to file ' + args.old )

current = open ( args.current, 'rb' )
old = open (args.old, 'ab' )
currentinfo = os.stat ( args.current )
oldinfo = os.stat ( args.old )
print ( 'current is ' + str ( currentinfo.st_size ) + ' bytes' )
print ( 'old is ' + str ( oldinfo.st_size ) + ' bytes' )
seekstart = oldinfo.st_size
copysize = currentinfo.st_size - oldinfo.st_size
print ( 'Copying from byte ' + str ( seekstart ) + ' of ' + args.current + ' and appending to file ' + args.old + ' a total of ' + str ( copysize ) + ' bytes' )
current.seek ( seekstart )
old.write ( current.read ( copysize ) )

current.close ( )
old.close ( )

This is the result you get running a test sequence on these two scripts with an arbitrary file:

 loki@vaioe  ~  ./truncate.py inception.tar.xz inception.tar.xz.1 10000
Creating a truncated copy of inception.tar.xz to file inception.tar.xz.1 with 10000 bytes truncated
Source file is 43601920 bytes in size
Coping 43591920 of inception.tar.xz
 loki@vaioe  ~  ls -l inception.tar.xz*
-rw-rw-r-- 1 loki loki 43601920 feb 24 08:03 inception.tar.xz
-rw-rw-r-- 1 loki loki 43591920 mrt  6 09:56 inception.tar.xz.1
 loki@vaioe  ~  sha1sum inception.tar.xz*
d69f893b56ff6bfcc20146be9c01765f2eac6de9  inception.tar.xz
a3ff6ce812c666bbf54bc2ff520288272144a9ca  inception.tar.xz.1
 loki@vaioe  ~  ./update-append.py inception.tar.xz inception.tar.xz.1 
Copying new data from inception.tar.xz to file inception.tar.xz.1
current is 43601920 bytes
old is 43591920 bytes
Copying from byte 43591920 of inception.tar.xz and appending to file inception.tar.xz.1 a total of 10000 bytes
 loki@vaioe  ~  sha1sum inception.tar.xz*                           
d69f893b56ff6bfcc20146be9c01765f2eac6de9  inception.tar.xz
d69f893b56ff6bfcc20146be9c01765f2eac6de9  inception.tar.xz.1
 loki@vaioe  ~ 

Very Happy Update:

The script works perfectly on differently updated copies of the Steem block_log, below, a test I performed on (a copy of) the current live version on the RPC and the one that is on the torrent from this post: here!. I will start to look into how to do this across a network connection, I think just alter the script to output to stdout (just print it instead of writing to file) and you can stream it in a pipe through scp or ssh with a command. A modified version could do a partial HTTP or FTP transfer that has the same result as well.

Main thing is this works on the blockchain file so there is a way to rapidly sync it when one copy is out of date.

loki@projectinception:~/test$ update-append.py current old
Copying new data from current to file old
current is 8194130879 bytes
old is 8159354102 bytes
Copying from byte 8159354102 of current and appending to file old a total of 34776777 bytes
loki@projectinception:~/test$ sha1sum 
current  old      
loki@projectinception:~/test$ sha1sum *
311f613137c9300a56ebcce343f08330cb6dffed  current
311f613137c9300a56ebcce343f08330cb6dffed  old
loki@projectinception:~/test$

😎

We can't code here! This is Whale country!