Sparkster Development Update. WE August 17th 2018.

sparkster (35)in #development • 6 years ago (edited)

Last week, we detailed some challenges and problems we were solving with respect to our block chain. In this article, we will continue this discussion and focus specifically on the challenges and progress we have made this week to update on where we stand with the engineering process of our block chain technology.

We ended last week with a theoretical discussion on indexes and how we intend to handle quick retrieval and storage of data in our block chain. We will proceed to detail this process further, coinciding with our implementation.

In order to successfully utilize our indexes, we have to take a C-style expression like (“amount”<50) and evaluate it against our datastore.

This expression’s goal is to give us all records that contain an amount field whose value is less than 50. The problem here is that the expression is fed to the system in what is known as infix form. This is no problem for a simple expression such as what we have shown above. Imagine, however, if we had an expression such as (“age”<30 OR “height” > 5 feet AND “hometown” = US). Following the rules of Boolean Algebra, the AND takes precedence over the OR. This is no problem for us as humans, but poses issues for a computer. The reason is that the computer must first scan the expression and find the highest precedence operator (AND.) Then it must compute “height” > 5 feet AND “hometown” = US (we’ll call this B); after this, it must scan the expression again, looking for the next highest-precedence operator (the OR.) It will then compute “age” < 30 OR B.

As we can see, the computer must perform two passes on the expression to properly compute the result, a process that can become expensive as the expression grows. We can solve this by separating the expression into parts delimited by logical operators (so part A will be (“age”<30), part B will be (“height”>5 feet), and part C will be (“hometown”=US)); however, we can further introduce complex precedence rules such as parentheses, causing the computer to scan the expression even more times, and rendering the method of tokenizing by logical operators insufficient, because now the computer has to consider if the operators appear inside sets of parentheses before tokenizing them.

A better solution to this problem is to convert the expression from its infix form to its postfix form. In postfix form, operators are ordered left to right according to their precedence, and appear after the operands on which they operate. The computer only has to perform one pass over the expression to evaluate it, saving many CPU cycles.

Consider the expression (“age”<30 OR “height” > 5 feet AND “hometown” = US) again. In postfix form, the expression looks like: "age" 30 < "height" 5feet > "hometown" US = AND OR. Here, the computer will consume the “age” and 30 operands. When it finds the less-than symbol, it will compute “is age less than 30?” We’ll call this result A. Next it consumes the “height” and 5 feet operands. When it finds the greater-than symbol, it will compute “is height greater than 5 feet?” We’ll call this result B. Next, it will consume the “hometown” and US operands. When it finds the equals-to symbol, it will compute “is the hometown equal to US?” We’ll call this result C. Next, the computer consumes the AND symbol, so it computes (B AND C.) Finally, it consumes an OR symbol, so it computes (A OR [B AND C]).

It follows from this discussion that if we want an efficient filter condition evaluator, our first goal is to convert our infix expression to a postfix expression. The rules for doing this are well documented, so we will leave it up to the reader to look up the details of the procedure and will not elaborate on them here.

Once we convert our filter conditions to postfix form, we have to build an expression tree. An expression tree is a representation of an expression in tree-like form. The benefit to constructing an expression tree is that the expression is now in a form such that all the operands are along the bottom of the tree (called the leaf nodes.) Evaluating the expression becomes as simple as traveling down the tree and evaluating each left and right subtree, performing the operation dictated by the parent tree whose left and right children we are traveling down.

With respect to indexing, one more problem we have to solve is the case of filtering by fields that are not indexed. If a field is indexed, we can perform a quick look-up on the index; however, if the user wants to filter by “age” and age is not a field with an index, we run into a problem because there is no quick way for us to achieve a filter by age at this point.

Our solution is to prioritize based on index. For example, suppose the user wants to filter by “hometown” and “age.” The field “hometown” has an index, and “age” does not. In this case, we will first filter by “hometown,” which will give us the set of all records whose “hometown” values are set to “US.” In order to achieve this, we must know, before we begin to evaluate the expression, which fields are indexed and which fields are not indexed. We do this by examining the datastore while converting our expression into postfix form. We proceed to mark those fields which have indexes. We then execute a partial expression against the datastore and defer the rest of the filter expression to only be executed on the set of records returned by the partial filter.

One last area of optimization we need to complete is called short-circuit evaluation. While almost all programming languages implement it, we are building our own expression evaluator so we must write the logic for this manually. In short-circuit evaluation, if a computer determines that a logic expression will not hold before it is finished evaluating it, it will not continue to evaluate the expression further. An example is the expression A and B. Suppose here that A evaluates to false. There is no need to evaluate B because A is false, and the expression (false and B) is always false no matter what B evaluates to. Another example is the expression A or B. If A is true, we do not need to evaluate B because (true or B) is always true. In our case, implementing short-circuit evaluation will amount to not evaluating the right subtree of a logical operator in our expression tree, and will eliminate an extra hit to our datastore.

While some of us are working on the indexing problem, some of us are hard at work on our virtual machine. The virtual machine’s job is to execute code on our block chain. We’ve completed draft one of the VM and are able to execute code quickly, but we’re still optimizing the execution and trying to have the code execute quicker than it does now, even though we’re already seeing millisecond execution times.

Our next task with the virtual machine is to work on the communication channel between it and the storage nodes, since the storage nodes will be responsible for data IO. We must select a communications protocol that is both fast and reliable, so standard HTTP is not an option for us.

This concludes our update for this week; we will update the reader again with another progress report next week.

#sparkster #blockchain #decentralized #cloud