Blockhain Series: What are Hash Tables and Hash Functions

quantalysus (60)in #blockchain • 6 years ago

Screen Shot 2018-05-23 at 3.31.15 PM

Public key cryptography has enabled a wide range of applications from password enabled web applications to blockchain technology. In this article I'm going to to dive into hash tables and hashing functions, key ingredients enabling Bitcoin and other cryptocurrency projects. The hashing functions allow data to be recorded in groups called blocks. Trying to change the data stored in that block would create a different output. Any observer following the paper trail would notice the output had changed. Hashing then can be considered a feature or tool to ensure the data is tamper proof, truthful, and in a sense a digital fingerprint. We'll start with describing what a Hash Table and then move onto defining a Hashing Function.

Hash Table

A hashtable is a data structure that allows for the very fast retrieval of data no matter how much data there is. Imagine an incredibly large library with 10 floors with millions of books. Searching for a specific book would require an incredibly organized data system, bookshelf labeling, and constant verification the book is where it is supposed to be. Hash tables are widely used in database indexing, caching, program compilation, error checking, password authentication and blockchain.

Consider a simple one dimensional array, or a list of items stored in a bin. Each bin's position reads left to right starting at 0. To find an item in the list, you can employ a brute force approach such as a linear list. For a very big array, this could take a very long time.

Here is an example of an oversimplified single array with 6 positions, starting our count at 0. For now we will leave the values blank. We'll return to this example.

Index	0	1	2	3	4	5
Value

Now suppose you want to retrieve a value in the array where you already know the index. You can retrieve that value very quickly. The time it takes to look up the array if you know the index ahead of time is independent of the array size and the position of the array. But what happens when you know the value you're looking for but do not know the index ahead of time? The index can be calculated using the value itself. Such a method would require tying the value and the index together.

So let's come up with a method to insert values into an index position in our six index array.

Adding ASCII values up

To insert values into our index we will come up with a formula. Our algorithm will take the sum of the ASCII values of each letter in our message and take the modulo of the size of our array. Alright let's come up with our words:

KYLE
JEFF
ZOMBIE
QUANTALYSUS
BITCOIN
NEO

You can look up ASCII values here or use the image below.

Screen Shot 2018-05-23 at 2.05.51 PM.png

A quick word on the 'modulo' operator before we begin. If you do not recall what modulo is it's best just to use an example. Let's say we want to calculate 5 / 3. The answer would be 1 2/3. Since we are short of a whole number, we use the value of 2 as our modulus. Alright let's keep going.

Now going back to our words we can derive:

KYLE = K (75) + Y (89) + L (76) + E (69) = 309 % 6 (number of positions in array) = 3
JEFF = 1
CRYPTO ZOMBIES = 0 (include space for the calculation)
QUANTALYSUS = 4
BITCOIN = 4
NEO = 4

Next let's populate the array with the index we calculated using our algorithm we created up above.

Position	0	1	2	3	4	5
KEY	CRYPTO ZOMBIES	JEFF		KYLE	QUANTALYSUS

After filling the above index we run into an issue with the terms BITCON, and NEO because the index position they are in already has a value parked there. We'll get back to how to resolve this issue but let's keep going.

Searching for CRYPTO ZOMBIES

We successfully stored our first three values of KYLE (the name of my business partner by the way), JEFF (the author of this article), CRYPTO ZOMBIES (Kyle's successful cryptocurrency YouTube channel) and QUANTALYSUS (my website and brand's name). But we ran into issues storing BITCOIN and NEO. As mentioned above, if we knew the index of a value we could quickly find it. Let's call our search value find_item() to perform an array lookup.

find_item(0) = CRYPTO ZOMBIES

find_item(1) = JEFF

find_item(3) = KYLE

find_item(4) = QUANTALYSUS

Hash Graphs

Arrays are incredibly useful for more than just storing simple words. Let's say we wanted to other pieces of information such as favorite numbers or favorite colors, we could do that. Let's call these values. So now we have key-value pairs. For the key of KYLE we have two values: favorite number is 9 and favorite color is green. Our table might look like the the table below. This is a hash table, also known as a hash graph.

Position	0	1	2	3	4	5
KEY	CRYPTO ZOMBIES	JEFF		KYLE	QUANTALYSUS
VALUE 1	8	7		9	4
VALUE 2	BLUE	RED		GREEN	GREEN

Hash Algorithm

Now let's assume you need to find the index of a given key or key-value pair. In our simple example above we assumed you were looking for a key or a key value pair based on already know the index ahead of time. This time around we have the key or key-value pair but do not have the index. In the example of the library it would be the equivalent of trying to return the book to its rightful place. In our example above we could simply re-engineer our simple algorithm of adding up the ASCII values modulo the array size.

There are many methods to speed up the calculation of the hashing algorithm and there are much more complex and secure hashing algorithms.

BITCOIN and NEO collide

In our original example BITCOIN and NEO did not make the cut because the key of QUANTALYSUS had already taken index 4. This is known as a collision. To find these two keys a space in the index we could set up what's called Open Addressing since every open space is allowed to store a key and its corresponding values. Open Addressing itself has several techniques to resolve collisions. For this example we will use Linear Search/Probing, a technique that will probe the array for the next available space and potentially cycle from the beginning until it finds a space. BITCOIN collides with index 4 so the linear search places the key in index 5. The search cycles around until Neo can be stored in open space number 2. Our table now looks like this:

Index	0	1	2	3	4	5
Value	CRYPTO ZOMBIES	JEFF	NEO	KYLE	QUANTALYSUS	BITCOIN

Other approaches to Open Addressing aside from Linear Search include "plus 3 rehash", "quadratic probing", and "double hashing".

Load Factor

The more items we store the increased likelihood of collisions. One way to solve this is to make the array sizes very large. The ratio between the number of open addresses and the sizes of the array is called the Load Factor. If the load factor is reasonably low, then open addressing using linear probing should work to help you discover key-values very quickly. In some architectures, the size of the array is dynamic and can adjust depending on the needed space to store data. But designing a large array comes at the cost of slowing down search speed.

Chaining

Another way of dealing with collisions is Closed Addressing otherwise known as Chaining. Rather than using open addressing and linear search to place BITCOIN in the first open index, Chaining would create what is called a Linked List. It would look like the table below. QUANTALYSUS points to BITCOIN and in turns points to NEO.

	Value	Value	Value
0	CRYPTO ZOMBIES
1	JEFF
2
3	KYLE
4	QUANTALYSUS >>>>>	BITCOIN >>>>>	NEO >>>>>
5

Instances where the load factor is high, creating linked lists may be a faster method of data retrieval than linear search. When the load factor is low, closed addressing data together may not be as efficient as open addressing.

Clustering

Another concern to address in both forms of addressing is clustering. Let's say for example we have a 100 sized array but all of the values are sitting in indexes 45 - 75. There are plenty of open addresses in spaces 0 to 44 and 76 to 99 (remember the array indexing starts at 0). When a linear search is performed the search will comb through many blank spaces before it comes across the cluster. Other methods of open addressing mentioned above such as Quadratic Probing and Double Hashing attempt to space out key-values in the array.

Final Words of Hash Function

In blockchains such as Bitcoin the entire "database" of all transactions between addresses are stored on blocks chained together. This ledger is distributed to every node on the network. This distributed ledger is effectively a very large array that allows one to search for information such as "how much Bitcoin" does an address have and what transactions took place in block 1,476,203. Hashing algorithms and data storage structures allows the network to store data and retrieve valuable pieces of information quickly. The structure of how data is stored combined with how to locate data via a hashing function requires great care in a few following areas:

Resolve and minimize collisions
Highly distributed hash values (reduce clustering)
Easy to verify

The beauty in Bitcoin is the use of cryptography to generate addresses and private keys. Knowing the private key to an address applied to Bitcoin's hashing algorithm allows the user to access the address and to use the protocol to send and receive Bitcoin. Therein also lies the dangers in that losing or sharing your private key unceremoniously affords you no protection on a distributed, decentralized network. There is no IT department to go to for recourse. If you haven't already, please read our post The History of Bitcoin as well.

Thanks for reading.

Other resources:

Thank you for coming to the site. Quantalysus publishes blockchain research and analysis for the crypto community. Please follow on Twitter, Steem (please follow and upvote if you can – thanks!), Telegram channel (New!), and Medium to stay up to date.

If you want to earn Aelf (ELF) tokens for just using Twitter and Reddit, sign up for their candy / bounty program.

If you learned something: