Steemd v0.16.0 & ChainBase - I/O Issues & Possible Solutions

bhuz (60)in #witness-category • 9 years ago (edited)

New Steemd Version - v0.16.0

It's been a week since the last HardFork took place. As you all know by now, the new version brought a few changes to some key aspects of Steem economy.

Still, you may not know about another more technical change brought by the new version.
As a matter of fact, v0.16.0 also officially introduced ChainBase - a fast version controlled, transactional database, based upon memory mapped boost::multi_index containers - as defined by its GitHub repo.

I am not going into how ChainBase works or any other details. Let's just say that it represents a core aspect of how the new steemd manages blocks and all the things needed for both consensus and non-consensus type of information.

Besides various benefits, unfortunately, there seems to be at least one specific drawback.
As some of you may have noticed, in the last days the number of random missed blocks is been quite higher than usual, both for top witnesses and backup nodes.

It is important to state and point out how this did not affect the stability of the network/steem blockchain

Though, from a witness perspective, it is a pretty annoying behavior.

The following should be mainly considered as opinions backed by some tests results that can be influenced by my system components (both hardware type and software version...)
What seems to be meaningful under my system, could be totally meaningless for yours, as well as some suggested changes that work for me, could even result in an overall worsening on your system.

The Root Cause And The Affected Systems

To keep things simple, we can say that, due to the use of a memory-mapped file, ChainBase performs a lot of Disk-I/Os. Generally, Input/Output operations are well known to be one of the slowest types of operation, and usually the main bottleneck in modern systems.

Since sync and replay processes heavily rely on I/O, it is easy to expect some decrease in performance; meaning an increase in the time needed to fully perform those same tasks.

Although to a lesser degree, these operations keep going during the normal "production" phase too. For this very reason, concurrently with high I/O spikes, steemd may be slow to respond. This seems to be the cause that leads the node to miss random blocks.

Systems on HDD

Specifically, on systems that rely on common Hard Disk for storage (magnetic, non-SSD), both sync and replay can easily take days, literally.
These Drives have pretty low Read/Write speed and IOPS performance - Input/Output (operation) Per Second - that can easily slow down the processes that rely on them.

Steemit, in one of their post, suggests some "Platform Configuration Optimizations", mainly addressed to Linux users. These optimizations are:

(1) --flush = 100000
(2) echo    75 | sudo tee /proc/sys/vm/dirty_background_ratio 
    echo  1000 | sudo tee /proc/sys/vm/dirty_expire_centisecs
    echo    80 | sudo tee /proc/sys/vm/dirty_ratio
    echo 30000 | sudo tee /proc/sys/vm/dirty_writeback_centisecs

What are they and what they do?

(1) --flush = 100000 - Is a steemd parameter which flushes (msync - synchronize a file with a memory map) the entire memory-mapped file to disk approximately once every n blocks. So about every 100k blocks for the default/recommended value.

Running some tests, I and other witnesses, came to the conclusion that flush doesn't really do much. Actually, we couldn't see any correlation between the flushing time frame and I/O spikes, so we could say that flush seems to do nothing at all really.
For this reason, I would not bother with it, and I would simply set it to 0.

(2) - These are commands that change kernel settings in an attempt to better tune the virtual memory subsystem, specifically for steemd.

dirty_background_ratio - Represents the percentage of system memory that once dirty, the kernel can start write to disk. This is a background process doing asynchronous writes, a non-blocking event.

dirty_ratio - Represent the percentage of system memory that once dirty, the process doing the writes would start flushing to disk. At this point, all new I/O blocks until dirty pages have been written to disk. This is the process/application itself doing synchronous writes, a blocking event.

dirty_expire_centisecs - Denote how long data can be in cache before they have to be written to disk.

dirty_writeback_centisecs - Denote how often the kernel threads responsible for writing the dirty pages to disk (pdflush), will wake up to check if there is work to do.

Recap:

For performance reasons, data is usually not always written out to disk but is temporarily stored in cache instead (dirty pages).
Every dirty_writeback_centisecs pdflush wakes up and write to disk (for real) those dirty pages that have been in memory longer than dirty_expire_centisecs.
If, due to high I/O, dirty pages keep growing and hit dirty_background_ratio, the kernel will start writing to disk regardless of the above parameters (in background, asynchronously).
If in spite of the kernel background flushing, dirty pages hit dirty_ratio, the application doing the writes (and so the one generating high I/O causing the dirty pages to grow and hit the limit) will block, pausing all I/O until dirty pages return again below dirty_ratio value.

Therefore:

In my understanding, the intent of these changes is to hold the kernel and prevent him from writing cached data to disk too often. In fact, compared to the default settings, these changes will allow more data/dirty pages to remain in memory before the flushing process will kick in. Your disk drive will have to handle bigger writes but much less frequently than before.

Do these changes really help?

Yes and No.

Let's start considering the sync and replay tasks.
First, you should not expect miracles: hardware will still dictate your overall performance. For this reason, you will still be highly limited by your HDD.

Anyway, from some tests, it seems that these new settings can help reducing the time needed to perform sync and replay (again, do not expect miracles) but you will need more than 8GB of RAM to actually see some improvements. If your system only has 8 or less GB of RAM, you probably would not see any meaningful difference, if at all.

What about steemd in "production" mode?
It is my opinion that the recommended changes do not fit well when steemd enters its usual execution mode. I actually believe these changes to be "harmful" if running in production.

If during sync and replay we can accept and even profit by occasional slow down/freezes due to big data being written to disk during the flush (instead of having lot of flushes with fewer data do write), I believe this same behavior would increase the chance of missing random blocks once sync/replay is completed.

Allowing more dirty pages to be kept in memory, means that once they need to be written out, the disk will have to manage more data and will need more time to do so. If in the meantime steemd needed to write or read some data from disk to be able to generate a block, it would find the disk too busy to satisfy the requests in time. This would lead to failure in generating the block in the allowed slot time, and so to a shining new missed block.

I think the scenario in which the disk have more frequent I/O but with fewer data to handle, would be better for the production mode. The I/O would complete faster and the overall disk load would be distributed more fairly over time.

Therefore I would suggest lowering at least the dirty_writeback_centisecs to a value between 500 and 3000 (respectively 5 and 30 seconds).
Another option would be simply switching back to the default kernel settings.

Systems on SSD

SSDs have way higher IOPS compared to HDDs. This allows them to satisfy even more I/O requests easily and faster than what an HDD could do.

If it is true that the new version brings more I/O load, and that it may be high enough to cause some issue on HDD, it is also true that that same load should be managed by a common SSD without too much effort, for the reason stated above.

This means that if your system is backed by a Solid State Drive, you probably did not experience all these I/O issues, and there is probably no reason to start bother yourself with VM kernel settings and such at all.

It is worth noting, however, that some VPS providers impose some limitations on the amount of I/O operations in a given time. So if you are not on a dedicated server, even if your system has an SSD, you could still encounter some I/O related issue due to the mentioned limits.
Check with your VPS provider to see if that is the case and increase your limits if that is an option.

Even if backed by a quality SSD on a Dedicate Server, I would still suggest you keep reading and look at the option below.

Shared Memory - /run/shm

Another option is to try to ditch Disk I/O issue entirely by storing the memory mapped file directly into RAM. At the point of writing, it is my opinion that this is the best way and could probably be a best practice to run steemd.

I find this method funny because pre-v0.16.0 and so pre-chainbase, steemd used to store the blockchain-state in RAM. With v0.16.0 and the introduction of chainbase, devs decided (for good reasons) to change the architecture and store the blockchain-state to disk instead, through a memory mapped file.

Now, what I and other witnesses are suggesting you do, is to take that file and put it back on RAM! This is easy to do, thanks to the --shared-file-dir parameter that allows you to specify the directory to use for the mapped files.

How To

For a witness node built as low_memory_node and with only the witness plugin activated, I recommend:

At least 8GB of RAM
At least 4GB of Swap

Check the size of /run/shm with:
df -h /run/shm
(by default, it should be equal to 50% of your RAM)

You probably need to mount it, and if its size is smaller than 12G, you should resize it too:
sudo mount -o remount,size=12G /run/shm

At this point, you should be ready to go. You can start steemd with:
./steemd --shared-file-dir /run/shm

N.B.

The first time you run steemd with the "/run/shm" method, you will need to:

Copy shared_memory.bin and shared_memory.meta files from your data folder to /run/shm
(default data folder: witness_node_data_dir/blockchain)

or else

Start steemd forcing a replay that will rebuild the mapped files from your blockchain data (block_log file): ./steemd --shared-file-dir /run/shm --replay

Additional Info

You can find other useful information in the following posts and in their comments too:
Best Practice Running Steemd v0.16.0 - by @abit
Witness update - by @aizensou that includes a nice trick by @smooth

Improvements To Come

There is one specific GitHub issue and its related Pull Request that I am really looking forward to being resolved and merged.

Why?

Currently, blockchain data is stored both in the block_log (the actual blockchain file) and in shared_memory.bin mapped file (that contains blockchain-state data).
This means:

There are useless I/O operations due to storing the same blocks data in both file, also with the likely consequence of unnecessarily increasing of dirty pages that will need to be flushed and that may lead to additional I/O.
The mapped file size is largely (and unnecessarily) affected by the blockchain size

Once this issue will be resolved and the code changes will be merged, I think we could expect some reduction of overall I/O. Furthermore, running steemd with the /run/shm method should require less RAM: 4GB could be enough at that point.
(This sound especially good for backup witnesses)

I would like to thanks my fellow witnesses who spent much time doing tests and with whom I have been able to compare data and results. A very special thanks to (sort by name):
@abit, @arhag, @smooth.

#steem #steemd #chainbase