Bifurcation Array Search Trees - a novel memory efficient self balancing binary search tree algorithmsteemCreated with Sketch.

in #programming6 years ago (edited)

I am using Steem primarily as a kind of developer journal now, and I had delayed writing too much about this new idea before I got it quite clear exactly how it works.

What is this for?

First thing to explain is that a binary search tree is primarily a way to keep a dynamic collection of data in an order that can be searched. You could of course just keep it in a list, but every time you add something, you would have to re-sort it. This would get ridiculously expensive the larger the tree gets.

Even if you used a linked list so you can just slot it into the chain, you would have to walk the whole list until you come to the two members that are lesser and greater than the new item you want to add. Maybe you can cut the total distance in half by knowing where the middle was, and starting at either end.

If you knew what your maximum possible set was, with every possible item (say, 32 bit integers) you could collapse the entire data structure into bits marking each location, and yes you would gain a 32-fold reduction in memory versus the entire set. The entire set is 232 which is 4,294,967,296 and would require 17,179,869,184 bytes of storage. 16 Gigabytes, so you could drop this down to 500Mb at the cost of having to walk that entire 500Mb of memory in order to collate your set. Probably this would not turn out to be terribly efficient either.

Enter the Self Balancing Binary Search Tree

The standard method for working with sets of data that you can sort from smallest to greatest is the Binary Search Tree and to improve search efficiency you want the branches of the tree to be as short as possible, thus the Self-Balancing part.

The problem is that if you are going to sort a piece of data that is quite small, for instance, less than 256 bits (32 bytes) you have to spend at least 8 bytes if you implement a one-way linked tree, that you always have to start from the root to climb it. More usually you put 3 references on each node, the parent, and the left child and the right child.

The cost of extra memory utilisation is entirely absurd when the size of your data is equal or smaller in size than the 32 bit references you use to store the structure. This would be using array indices, or otherwise boxing yourself into a 32 bit memory limit instead of using full 64 bit pointer references.

The maximum theoretical size of a standard 3-vector binary search tree storing 32 bits of data per node would weigh in at 16gb for every possible 32 bit value.

Why I am inventing a new Binary Search Tree Algorithm

I have invented a new graph-theoretic proof of work algorithm, based roughly on John Tromp's Cuckoo Cycle. In my variant, you start with a random 64 bit value, and then iteratively hash this value, and the value thus produced, and you are looking for a specific number of these that have the second half of one being the same as the first half of the next, around in a circle, without branching.

To find a solution you precisely have to sort your chain of hashes, and you have to search the hashes for half collisions , where one half of two separate elements of the hash chain have the same value (coordinate). Sort and search. What binary trees are made for.

Furthermore, as well as having to search for a specific number of hashchain elements forming a cycle, the cycle has to appear before a consensus maximum, as verification of the solutions has a bit of a cost on verifiers, so it's no good to find a solution that has any vector from a hashchain element that requires more iterations than the set limit. It is not expensive to generate values for a hashchain, since the verifier code will not have to retrieve or write any main memory, and because a miner has to search every element until the limit or they find a solution, the cost of searching is far far higher than verification.

It would of course be possible to use a simple sort algorithm that sorts your list only considering the first or last half of each vector, but this would be inordinately memory-expensive as well as computation expensive, compared to using a Binary Search Tree. However, a Binary Search tree is going to use a lot more memory than a solver that simply spews a hashchain into a huge chunk of memory, and then runs whatever is the optimal linear sort algorithm.

So, the options are, sort an array. Sort a linked list that will be 3× the amount of memory taken up by the data. Use a standard binary search tree like Scapegoat, and you quadruple the memory requirement. If you use a Red/Black tree, most you need even a little more memory to store the node colours, commonly a whole extra byte, though you could implement a bit-list instead, at the extra cost of bit-twiddling.

Or, you look at all the options, and realise

There must be a better way!

The goal here is to use a lot of memory, let's not pussy-foot around. The whole point of Cuckoo Cycle and my avian-theme-named variant 'Hummingbird' is exactly to force miners to have to use stupid amounts of memory. Tromp targeted the average 4gb memory found in average miner video cards.

I want to make it impossible to use a video card altogether, by requiring around 20Gb of memory, which will make any GPU implementation bottleneck on the main memory bus, except maybe for three video cards connected via Crossfire or SLI able to share their memory at PCI-Express 8 lane speed or better. This is not a very cost effective competitor to a simple CPU and 32Gb of DDR memory, 4 lanes will only bring it up to a bit over par with a CPU memory Front Side Bus.

I am setting the baseline parameters at a 64 bit hash chain, and you have to sort potentially 8 gigabytes of heads and tails to find the chains and the branches. The fastest way to search this number-space will of course be a Binary Search Tree, but conventional implementations will see you potentially chewing up 32Gb of memory to store the tree, and getting heavily bogged down in random memory access walking the tree.

So I had the thought - what if I instead structure the tree as a set of progressively doubling slots, a binary tree consists of a number of rows, called the Depth, and I would use the same amount of memory, at worst, for sorting 32 bit values, as if I used a simple array, but by storing each row progressively and filling up the slots, and balancing the tree to minimise the time required when searching for already discovered duplicates.

Note that there is three buckets. The elements that have not either branched or chained, the half hashchain elements that were found twice in the same half, and a list of candidate chains that have not yet reached the network consensus length (if they exceed it, they go into the trash pile with the branching halves).

So, our un-branching un-chaining tree will steadily shed nodes as new elements are found that either go into the trash or into the candidates bucket, this requires two 64 bit Bifurcation Array Search Tree structures, sorted by head and tail, the trash bucket only has to store one tree of 32 bit values, as any hashchain member with a collision, the non-colliding parts also form part of a branchy tangle so you want them broken apart and pinned to the rejects tree, and the candidates, because they are inherently not sortable, are stored instead in a simple two dimensional array, and the component 32 bit coordinates are stored with references to the array element they are stored in the candidates.

So my implementation requires two 64 bit BAST's, sorted by head and by tail, one 32 bit BAST for the trash, a 2d array plus another BAST, perhaps consisting of 48 bits (I doubt that there will be more than 16 bits of index consumed by the candidates ttree) with the 32 bit values and a pointer to where they can be found in the candidates array (this is so that newly generated hashchains can search here after first searching the trash to make sure the new element isn't also trash, and of course if a branch is found in a candidate it has to be hurled into the trash as well).

This is about the most optimal model I can think of for performing this search, and for reasons of the competitive nature of mining solvers (such that the best ones are closed source with developer fees), I want to make the reference solver as good as it can possibly be, and even possibly perhaps pretty much impossible to make any faster.

Sorting using buckets or any other linear sort is inherently going to result in a heavy, un-minimisable cost in the initial sorts, and not only that, one does not really know exactly when enough hashchain elements have been generated if one generates and then sorts after. You really need to be iterating generate, sort, search at each new element created. This way the moment the solution is found will be as early as it can possibly be.

The unique problems of the Bifurcation Array Search Tree

To avoid rapidly growing the size of the array, with each row a double quantity of memory is required, the insert and delete algorithms need to try and keep the depth of the tree minimal. This is exactly the same problem that self-balancing algorithms try to solve. However, where they differ is in that in this search data structure, rebalancing requires only to usually shift around the pointers in each node. By its nature, the physical memory distance between each node can be quite large and random.

So, initially I was going to call this a 'Top Heavy Equal Weight Binary Search Tree' because every time this tree would become imbalanced such that there is two more elements on one side of the root than the other, you would quite possibly then also have to allocate an expensive, yuge chunk of memory and every time you do this, you have to allocate more.

Since the distribution of the values is pretty much guaranteed to be random, fairly quickly before the tree gets too big, the root will pretty much be the median value in the collection of nodes. But more than this, the other thing has to do with the structure of the tree being an array that you count rows downwards from element 0, when you have to rebalance it, quite often the parts of the tree that have to be rearranged to maintain balance, are all very close to each other, and often sequential pages, so much of the memory access required by this algorithm is linear memory retrieval and more than likely the memory pages will remain within the CPU cache while they are being rearranged to keep the balance.

So, at least in theory, this algorithm will essentially max out the linear memory access of a CPU main memory, as opposed to how Tromp implemented Cuckoo Cycle, which has a tiny or nearly nonexistent cache in the GPU processing modules, even if you were able to minimise the nonlinearity of memory retrieval, the caches are probably too small to be able to gain any boost from the high speed of on-die cache memory. This is of course not the case with a CPU. I can pretty much count on at least 4Mb of cache memory for quite low end processors now, and thus, most of the main memory latency is avoided, and instead tends towards streaming chunks of memory in proximity to each other, both in and out for read/write.

Linear access means our performance bottleneck is not latency but throughput. Memory that is being read and written sequentially happens quite fast. I did a test with my i5-7600 and DDR4-2400 memory, and it was able to spew zeros onto 16Gb of memory in about 1m45s about 160Mb/s, and that was probably not as fast as is possible, I think that time counted the cost of finding all that memory first, which I think was probably something like 3/4 of the total. Or in other words, once all the memory is allocated, that time doesn't need to be spent again. There is always initial costs with this because we are working with multi-tasking operating systems. In theory I suppose a single process running as its own kernel (this would be the ASIC version of Hummingbird), would be able to freely dance around memory as it chooses, without having to ask anyone else. But I'm not going to implement such a kernel...

The rebalancing pattern

The most interesting thing about this algorithm is the way that the tree is balanced. In the algorithm as I have written it so far, upon the first comparison with the root node, we know whether we might be adding to the left or right. If the data successfully inserts (no match found) we increment the weight balance count of left and right. If there is more than two more on one side versus the other, we then zig-zag (like a hummingbird) through the bifurcation array back up to the root, and then out the other side so we essentially shift the central item in the list one left, and thus place the weight on the opposite side than we would have if we didn't rebalance it (and be forced to do more expensive memory allocations.

The rebalancing is literally, factually a zig-zag pattern, it goes up to the edge of the tree, then walks back down to find the next-most node, which when going left from an insertion point is the right-most node somewhere coming off a parent of its parent. Eventually the shuffle pushes the old root over to the other side, and it is shoved around until the endpoint falls out onto an empty slot, and the tree is balanced.

There is a fair number of copies that need to be made of data while doing this, but generally it will quickly ascend to the root and usually fall out the other side in fairly short order. Until the tree is more than 20 deep, likely on a system dedicated to doing this solving, the entire tree for a 32 bit node payload will remain in cache, and so each shuffle will only take about as long as the actual operations it requires.

The walking formulae

Because we are remapping an pyramid-sorta-shaped geometry into a linear array, there is a simple formula that tells you where parent, left and right child are, relative to any slot in the array. For your edification, here it is:

// Up - returns the parent of a node at i
func (b *Bast) Up(i Index) Index {
    return i>>1 - (i+1)%2
}

// Left - returns the left child given the index of a node
func (b *Bast) Left(i Index) Index {
    return i<<1 + 1
}

// Right - returns the right child given the index of a node
func (b *Bast) Right(i Index) Index {
    return i<<1 + 2
}

In a standard binary search tree implementation, conceptually it is simpler, in that you just read a value from a variable in a structure, that points you to the structure you are looking for. But in reality, this walking could cost a LOT of time in a very large tree, spanning a very large amount of memory.

Instead, as you can see there, walking down only requires two operations, walking up requires four. 4 cpu cycles is a pretty low cost compared to waiting for memory to show up, and if you are walking upwards, the amounts of memory required keep getting smaller, and then over to the other side, the memory is probably also already cached, and I think many cpu's now, especially these fancy Ryzen chips with their neural network predictors, probably will know that a particular program likes to do zigzags up and down a memory structure and it will be on its way into the cache by the time you need to mess around with it.

Conclusion

So, I now have managed to implement and tested the search and insert-without-rebalance, and you can go look at it at https://github.com/l0k1verloren/bast - I have been doing lots of scribbly hypotheticals to exactly characterise the zigzag pattern of rebalancing. The reason I called it Bifurcation Array Search Tree was that Top Heavy Weight Balanced Binary Search Tree, THWBBST, is absolutely pig ugly, and extremely long winded. Bifurcation means splitting into two, and it's an array, and it is for searching, and it is geometrically shaped like a tree.

It has been quite amusing, and why I decided to sit down and write this explanation of what I am doing, to discover that the search tree algorithm I have conceived uses fast zig-zag movements to rebalance always to the top, for a proof of work algorithm that I am calling 'Hummingbird' for a broader project named homophonically from the russian word for Hummingbird, Kolibri - pronounced precisely by standard english pronunciation, Calibrae. Well, and BAST is also the name of the cat headed egyptian god, who is in charge of farms and forests.

I believe this algorithm could have other, useful applications, such as lookup tables for databases. I am not aware of such a design, but it would surely massively accelerate the search and insert/delete functions in a database, especially SQL type, with those big long lookup indexes that make graph databases so appealing for complex searching (the linking between one table and another requires a search and the search is fastest on a top heavy balanced binary tree). But anyway, it is written originally for my purpose, a proof of work. I may end up writing my own database engine at some point so it's nice to know I have an option for a very fast index.

PS

I have also realised that the tree could become quite imbalanced in the differential between long and short paths, so, after I have implemented the balancing operations described roughly above (I have further elaborated them in the README.md, I also need to look at how to correct imbalances vertically, where only children are children of only children (of only children, maybe). I am calling these patterns 'tentacles'. Well, maybe 'tendrils' would be a better word for it.

The beginnings of my idea for this optimisation is to create a simple array with indices corresponding to each row of the tree, and increment and decrement these values when new nodes are inserted or removed. It doesn't really matter too much if paths have moderate differences since most search, insert and delete operations will be fairly random. But it is simple to identify full rows because they are 2rownumber if completely full. If a row is 2rownumber-1 we know that it is very sparse, and probably somewhere in the row we can pull these tendrils up and turn them into new little branches, pulling the sparse chains of children up and fanning them out to distribute them better and reduce path length differentials.

Clearly, in the case of deletions of a terminal node, the children of its sibling become part of a section of tendril, and this would be a good opportunity to bunch the branch upwards. Addition of a new node at the end of an only child also implies the benefit of rotating the tendril, equalising the distance of these nodes from the root.

The goal is to make the tree 'top-heavy' as the more densely packed the centre and top of the tree is, the shorter the paths from the root. In the two cases above, the operation is very cheap and simple to identify, and will be added immediately after the left-right displacement operation is implemented.

Sort:  

Congratulations @l0k1verloren! You received a personal award!

Happy Birthday! - You are on the Steem blockchain for 1 year!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Do not miss the last post from @steemitboard:

Are you a DrugWars early adopter? Benvenuto in famiglia!
Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Congratulations @l0k1verloren! You received a personal award!

Happy Steem Birthday! - You are on the Steem blockchain for 2 years!

You can view your badges on your Steem Board and compare to others on the Steem Ranking

Do not miss the last post from @steemitboard:

Downvote challenge - Add up to 3 funny badges to your board
Vote for @Steemitboard as a witness to get one more award and increased upvotes!

Coin Marketplace

STEEM 0.28
TRX 0.11
JST 0.031
BTC 69279.36
ETH 3870.43
USDT 1.00
SBD 3.73