Sparkster Development Update. WE August 10th 2018.

sparkster (35)in #tech • 6 years ago (edited)

In this series, we will detail to the reader the challenges and research we are undertaking to build our blockchain technology; this will serve as a status update and provide the reader with a clear outline of where we are headed and where we have been.

The language

The first question that arose during the preliminary research phase was that of which programming language to use. We had two possibilities here:

(1) an interpreted language, and
(2) a “native” or “compiled” language.

We evaluated many languages of both types using benchmark specifications and found that the well-known fact holds true: compiled languages generally perform better than interpreted languages. For this reason, we eliminated languages like Python, .NET and Java as a viable choice to build our block chain components.

We ended up with two possibilities at this point: Go, or C++. Both of these are compiled languages, compiling down to Assembly which is directly executed by the architecture. Go was developed by Google, and aims to eliminate many of the complexities associated with C++ by implementing automatic memory management through garbage collection, and eliminating the often times obscure syntax that C++ is well known for. In other words, Go is a managed language with the performance benefits of C++ and simple syntax close to that of Python. Arguably, this fact about Go lets developers be more efficient in development.

While C++ is generally not preferred over Go for rapid application development, our goal was to achieve the best performance and we were willing to sacrifice simplicity to realize this goal. Therefore, we next turned our attention to C++. The consensus was that although both Go and C++ compile down to assembly, C++ still outperforms Go in many instances because of an absence of managed features such as bounds checking on arrays to prevent buffer overflows; further, C++ does not implement automatic garbage collection, which speeds up the performance dramatically since there is no period in a C++ program’s execution cycle where the program will have to lock the objects table to clear unused objects (except in the case of using shared_ptr, but this is optional).

So, while C++ does not implement safety checks and other benefits gained from managed languages, its performance is enhanced because of the absence of these features. Add to this that many times, bounds checking is not necessary because the memory addresses that are pointed to are directly related to the loops in which they occur, so if a loop is written carefully it is impossible to overrun memory. While this requires the developer to be more careful in designing loops than they must be in a managed environment, which could slow down development somewhat, we are aiming for performance over built-in memory management and have chosen C++ as our primary language.

Distributed Data Store

The next challenge we faced was that of a distributed datastore. Running code on the block chain is a simple task, since we can write a program to execute code, but storing data cannot be handled as it is handled in a centralized cloud computing environment. In this case, we needed a decentralized datastore; not a pseudo-decentralized datastore, as in the case of a distributed database where the database is spread across multiple systems and a central controller exists to where all requests are routed, thereby requiring the central controller to forward the request to a specific node. This is because a pseudo-decentralized database is, as the name suggests, still centralized due to the controller.

For our purposes, we needed an actual decentralized datastore where any node can be asked to fetch data from the network; we are now running such a technology.

The challenge with a decentralized datastore is how to efficiently retrieve data and store data, since we do not know ahead of time where the data lives on the network, and whether or not the data even exists. The first problem is easily solved, so we will turn our attention to the second problem, that of the existence of the data itself.

Suppose we wanted to retrieve a record A from the datastore. We represent all data as a directed acyclic graph (DAG.) We have two options here: we can either visit every node on the graph by performing a depth-first search until we find the data we want, or we can index the data.

If we were to visit and check every node to retrieve one piece of the data, the retrieval process would grow directly according to the amount of data stored. The issue here with an O(m+n) complexity where m is the number of nodes and n is the number of edges is that eventually the number of nodes and edges will become sufficiently large to where data retrieval will take an impossibly long time, so we dropped this solution.

Next, we turned our attention to indexing. When data is indexed, an efficient structure is created that points to the data (most commonly a hash table.) When a system wants to retrieve a record given some criteria, it will look in the index, and use it to (hopefully) access the collection of data in O(1) time on average. Of course, this depends on how well the data is indexed, and if an index is even available. It’s important to note here that distributed data stores do not automatically support indexing, so we have to build it.

Now that the set of data associated with a particular index is known, we again had two options. To find the data with a particular value such as “where age = 50”, we can easily get a list of all age fields by looking in the index; however, to find the data with age equal to 50, we could either visit every collection which had a unique age value, or we can search the indexed values more efficiently. In the first case, again we run into the problem of the process’s execution time growing linearly according to the number of unique age values, so we end up with a complexity of O(p) where p is the number of unique age values.

A second approach is to perform a binary search on the unique values. Because of the way we have designed our indexes, they are always in sorted order, so performing a search that performs best with sorted data is ideal, such as a binary search where the collection is halved until the particular piece of data is found or it is determined that the data does not exist. By employing this approach, we ended up with a complexity of O(log p), meaning that the run-time cost of the search grows logarithmically according to the number of unique values on a particular index.

At this point, the reader might ask the question of why we did not use a hash table to store unique values as we did with the index field names themselves. The answer is that suppose we wanted to fetch the collection of records where ages were less than 50. We could not efficiently do this with a hash table since hash tables are unordered by nature. Ordered hash tables certainly do exist, but it is not as efficient to access their data as it is with an unordered hash table. So, were we to insert the unique age values into a hash table, we would incur a penalty when looking for a particular region of data. Conversely, the binary search algorithm has a slow enough growth curve to where we felt it best to employ this method for all lookups to unique index values, since we believe that most of the time, we will want a collection of data and not simply one piece of data. So while hash tables are efficient for index field lookups, we thought them to not be efficient for unique value lookups on indexes because in many cases we will need the data to be ordered so that we can quickly collect a region of data from the datastore.

We will continue to update the reader every week with new developments and progress on the development of our block chain.

Munawar Bijani
Core Developer, Sparkster.me

#blockchain #sparkster