Exploring Steem Scalability
In this post, we will address some of the concerns that have been raised regarding the increasing RAM usage of steemd nodes, as well as our future scaling plans. While the challenges associated with scaling are not something we will ever take lightly, we also think that many of the concerns have been raised due to some misunderstandings about how to properly/optimally operate steemd nodes. We will provide some guidance on this in the sections below, and we will also talk about several changes that we have in the pipeline for addressing our future projected growth.
What is Scalability?
The Steem community is rapidly growing, and with it, so is the Steem blockchain. Growth is great, but it brings with it scaling challenges. Other projects (such as Bitcoin and Ethereum) have been stuck at a standstill with their scaling problems for years - unable to adopt any significant changes to meet the growing demands that increased usage has placed on their blockchains. Steem on the other hand has continued to rapidly evolve and is meeting these challenges head on, thereby enabling it to process more transactions than every other blockchain combined. In other words, the majority of blockchain transactions occurring globally are being done on Steem.
We’ve been able to do this because our team is made up of an ever-growing roster of the most talented and innovative blockchain engineers on the planet. This doesn’t make us cocky; it makes us acutely aware of the scaling challenges in front of us, and we want to assure you that we are adequately prepared to deal with them. While we are confident in our strategy, we are also eager to hear your thoughts, objections, and insights in the comments.
A Brief History of Scaling
The most critical decision with respect to scaling is where you start. The more scalable the foundation, the more scalable the stack. A stack’s ability to scale tends to have, at best, an exponential relationship to the starting point. It is incredibly rare for an architecture to go from being able to support 3,000 people to 3,000,000 people overnight. Instead, it goes from 3 to 6 to 12, etc. Starting from an architecture that was already far ahead of the pack in terms of scalability (Graphene) was a critical component of the scaling strategy. Those that failed to make similar decisions now find themselves in the difficult position of having to rebuild their foundation without damaging the entire ecosystem that was built on top of it.
ChainBase and AppBase
The first major scalability-related upgrade was the replacement of Graphene with ChainBase. Thanks to its faster load and exit times, and increased robustness against crashes, ChainBase was critical to enabling Steem to process its current volume of transactions.
The next major improvement that is nearing completion (thanks to the hard work of @vandeberg and the blockchain team) is AppBase, which further improves Steem’s overall scalability through modularization. AppBase will allow many components of the Steem blockchain to run independently, which will permit steemd to take better advantage of the multithreaded nature of computers, and even enable different components of the blockchain to be run on different servers - reducing the need to run the Steem blockchain on individual “high powered, high cost” servers.
Optimizing Steemd Nodes: Block Log + State File
With respect to operating a steemd node currently, it is critical to understand that Steem requires two data stores: the block log and the state file. The block log is the blockchain itself, written to disk. It is accessed infrequently, but is critical to verifying the integrity of new blocks and reindexing the state file if needed.
The state file contains the current state of Steem objects, such as account balances, posts, and votes. It is backed by disk, but accessed via a technique called memory-mapped files. This technique was introduced in December 2016 with the release of ChainBase.
Many node operators are suggesting that servers should have enough RAM to hold the entire Steem state file, due to the fact that Steem's performance drops when the operating system begins “paging” Steem's memory, which is a common memory management technique. We want to be very clear that it is not required to run a steemd node in this way. This is certainly a valid technique for increasing the performance of reindexing the node and servicing API calls, but is only useful in a limited number of cases. In the majority of cases (including with witness, seed, and exchange nodes), it is sufficient to store the shared memory file on a fast SSD or NVMe drive, instead of in RAM.
Witness and Seed Node RAM Requirements
When running a steemd node with only the
witness plugin enabled (the common configuration for witness and seed nodes), Steemit recommends 16 GB of RAM, although 8 GB is likely sufficient if your node does not need to reindex often. If the shared memory file is stored in
/dev/shm/, then additional RAM would be needed to hold the entire state file, but this is not a recommended configuration. To avoid the need for extra RAM, the shared memory file can be stored directly on a fast SSD or NVMe drive.
A server with 8-16 GB of RAM will be slow with reindex, but it will function properly as a seed/witness node once it is up to date with the latest block. Running on a 32 GB server is ideal for optimal replay times, but it is not a requirement for a witness/seed node to properly operate.
Shared Memory File Size
The default configuration for a steemd node stores the shared memory file in the
data/blockchain directory. As long as this location is on a fast enough (SSD or NVMe) drive with sufficient space, then the default setting should work.
The current recommendation is to have at least 150 GB of fast SSD storage, which includes the
block_log (currently around 90 GB) and
shared_memory.bin (currently around 33 GB). These amounts will increase over time.
Whenever the size of the shared memory file has increased beyond the size that is configured in the
config.ini file, it has been necessary to update the configuration to a larger size and restart the node. There will be a change included in the next release (Steem 19.4) that will automatically increase this limit as needed, without the need to restart the node. This will be able to be configured and turned off entirely if you want to keep your state file in
“Full Node” Requirements
Nodes that are running additional API plugins (especially account history) will require more RAM to support a larger state file. A “full node” (one that is running all of the plugins) can technically run on a 64 GB server, but it will be extremely slow to reindex, and it will be slow at serving API calls because the operating system paging algorithm does not handle memory-mapped files very well. A node with 64-256 GB RAM and a fast SSD/NVMe drive may be adequate for many use cases, depending on the load.
Increasing Performance on High Use Nodes
For more heavily used nodes, the best way (currently) to increase the performance is to have enough RAM to hold the entire database. This skips the need for paging altogether, which technically defeats the purpose of having a memory mapped file. For a node running all of the plugins except account history, this currently requires 256 GB RAM on a pre-AppBase node.
A technique that we have been using to lower the memory requirements on a “full node” (one with everything including account history), is to split the API node into two servers. One server runs only “account history,” and the other server runs everything else. This allows both servers to use less than 256 GB RAM, instead of running everything on a 512 GB RAM server. We strongly recommend running account history on a dedicated node if you want a complete history for all accounts, since it eliminates the need to have a single 512 GB RAM server.
Optimizing the use case of a “full node” is a top priority of ours, and one that we will talk about more in the next section. If you only need history for certain accounts though, or only care about certain operations, the hardware requirements may already be significantly reduced.
Future Scaling Plans
We are currently working on several projects that will reduce the memory requirements of “full nodes” by moving much of the API logic into non-consensus plugins such as HiveMind and SBDS. This will allow a lot of the functionality to be run off of SSD storage, rather than in RAM - which will lower the operating costs. By offloading data to hivemind/sbds and/or RocksDB (below), we should be able to reduce the requirements for a full node down to the same requirements for a consensus/seed node, which is an important goal of ours.
In addition to the non-consensus plugins, we have begun research on using alternative data stores and moving away from Chainbase. One such data store that has shown promise is RocksDB.
RocksDB is a fast-on-disk data store with an advanced caching layer, which could further minimize latency when reading/writing to and from the disk as it is optimized for fast, low-latency storage. Used in production systems at multiple web-scale entreprises (Facebook, Yahoo, LinkedIn), RocksDB is based on LevelDB but with increased performance thanks to its ability to exploit multiple CPU cores and SSD storage for input/output bound workloads. Its use in MyRocks, for example, lead to less SSD storage use, longer SSD endurance, and more available IO capacity for handling queries.
We are also working to modularize the blockchain beyond even what was originally planned for the initial AppBase implementation, for example, by having separate services that can be run on different servers. This will allow processes to be further spread across many small servers, increasing flexibility and decreasing cost.
As blockchain projects continue to become more mainstream, scalability is going to become more and more of a concern. Being a scalable blockchain is not just about being able to make a one-time fix to meet the current resource challenges. It is about being prepared to meet the future challenges as well.
Steem has already proven itself as the fastest and most heavily transacted public blockchain in existence, and scalability continues to remain a top focus of ours. We know that scaling challenges will never completely go away, which is why we plan to continue innovating to ensure that whatever growth comes our way - we'll be ready.
P.S. Don't forget to share your thoughts, objections, and insights in the comments!