Merkle Patricia Tries Made Easy

cseberino (58)in #ethereumclassic • 6 years ago (edited)

Tries are tree structures that represent sets of key value pairs in software. Tries are composed of connected nodes that store key values. Different nodes imply different keys based on their positions. Merkle Patricia tries add special nodes to minimize the number of unused nodes as well as hashes for security. I will describe Merkle Patricia tries and provide an implementation.

Tries

In the trie diagram above, the dots represent nodes and line segments represent connections. Note that many dots have left and right line segments below them. The key corresponding to each node is found as follows:

Find the shortest path from the top to the node (dot).
For each left line segment along the path, add a “0” to the key.
For each right line segment along the path, add a “1” to the key.

Therefore, the yellow dot corresponds to the key “010”. The blue dot corresponds to the key “11”. A similar procedure is used for all tries to determine keys.

The diagram above represents a trie where many nodes connect to 16 nodes below them. Each line segment corresponds to one of 16 possible nibbles added to the keys. Therefore, the yellow dot corresponds to the key with the hexadecimal representation “0f2”. The first trie diagram represents 15 nodes. This one represents 4369 nodes!

Compression

Merkle Patricia trie nodes can connect to up to 16 nodes below them as in the last trie diagram. To avoid thousands of unused nodes, Merkle Patricia tries introduce special node types. If the hexadecimal representations of keys have corresponding segments in common, those segments are represented by special nodes. For example, consider the following list of hexadecimal representations of keys:

0xa1111bc1111def1111ghij
0xa2222bc2222def2222ghij
0xa3333bc3333def3333ghij
0xa4444bc4444def4444ghij

Notice all the keys have the corresponding “0xa”, “0xbc”, “0xdef” and “0xghij” segments in common. Therefore, they would all be represented by special nodes in a Merkle Patricia trie for the set. Consider also this next list of hexadecimal representations of keys:

0xa1111bc1111def1111ghij
0xa2222bc2222def2222
0xa3333bc3333
0xa4444

All the keys are not long enough to have the same corresponding segments in common. However, the subsets of keys of sufficient length do have them in common. For example, only the first two keys are long enough to possibly have the corresponding “0xdef” segments in common and they do. Also, only the first three keys are long enough to possibly have the corresponding “0xbc” segments in common and they do. Therefore, the same four segments would still be represented by special nodes in a Merkle Patricia trie for the set.

Prefixes

Special nodes prepend one or two nibbles to the key segments they represent. This helps distinguish between key segments occurring at the ends of keys and elsewhere. It also leads to only whole numbers of bytes (even numbers of nibbles) being represented by special nodes. The following table gives the prefixes for even and odd numbers of key segment nibbles:

special node type	even	odd
not key ending	0x00	0x1
key ending	0x20	0x3

For example, the aforementioned “0xghij” key segments always occur at the ends of keys. They also have an even number of nibbles. Therefore, “0x20” is prepended giving “0x20ghij”. The aforementioned “0xa” key segments never occur at the ends of keys. They have an odd number of nibbles. Therefore, “0x1” is prepended giving “0x1a”. Tries that utilize this method of compression are sometimes referred to as Patricia tries.

Security

For security reasons, Merkle Patricia tries are represented using hashes. All nodes, except the top node and unused nodes, are replaced with the Keccak 256 hashes of their Recursive Length Prefix (RLP) encodings if the lengths of the RLP encodings are the hash length or greater. Unused nodes are replaced with empty strings. This transformation is applied recursively. The final representation is similar to a Merkle tree.

Code

Here is a Merkle Patricia trie Python implementation that uses the PySHA3 package. The first file implements RLP:

#!/usr/bin/env python3

import math

BYTE_LEN = 8

def n_bytes(integer):
        """
        Finds the numbers of bytes needed to represent integers.
        """

        return math.ceil(integer.bit_length() / BYTE_LEN)

def get_len(input, extra):
        """
        Finds the lengths of the longest inputs using the given extra values.
        """

        n_bytes = input[0] - extra

        return 1 + n_bytes + int.from_bytes(input[2:2 + n_bytes], "big")

def encode(input):
        """
        Recursive Length Prefix encodes inputs.
        """

        if isinstance(input, bytes):
                body = input
                if   (len(body) == 1) and (body[0] < 128):
                        header = bytes([])
                elif len(body) < 56:
                        header = bytes([len(body) + 128])
                else:
                        len_   = len(body)
                        len_   = len_.to_bytes(n_bytes(len_), "big")
                        header = bytes([len(len_) + 183]) + len_
                result = header + body
        else:
                body = bytes([])
                for e in input:
                        body += encode(e)
                if len(body) < 56:
                        header = bytes([len(body) + 192])
                else:
                        len_   = len(body)
                        len_   = len_.to_bytes(n_bytes(len_), "big")
                        header = bytes([len(len_) + 247]) + len_
                result = header + body

        return result

def decode(input):
        """
        Recursive Length Prefix decodes inputs.
        """

        if   input[0] < 128:
                result = input
        elif input[0] < 184:
                result = input[1:]
        elif input[0] < 192:
                result = input[1 + (input[0] - 183):]
        else:
                result = []
                if input[0] < 248:
                        input = input[1:]
                else:
                        input = input[1 + (input[0] - 247):]
                while input:
                        if   input[0] < 128:
                                len_ = 1
                        elif input[0] < 184:
                                len_ = 1 + (input[0] - 128)
                        elif input[0] < 192:
                                len_ = get_len(input, 183)
                        elif input[0] < 248:
                                len_ = 1 + (input[0] - 192)
                        else:
                                len_ = get_len(input, 247)
                        result.append(decode(input[:len_]))
                        input = input[len_:]

        return result

The next file contains the merkle_patricia function which converts Python dictionaries to Merkle Patricia tries. It also contains the similar patricia function which omits the aforementioned security minded representation. The info function can be used to display detailed information:

#!/usr/bin/env python3

import sha3
import rlp
import pprint

HASH_LEN = 32
HEXADEC  = 16

def remove(dict_, segment):
        """
        Removes initial key segments from the keys of dictionaries.
        """

        return {k[len(segment):] : v for k, v in dict_.items()}

def select(dict_, segment):
        """
        Selects dictionary elements with given initial key segments.
        """

        return {k : v for k, v in dict_.items() if k.startswith(segment)}

def find(dict_):
        """
        Finds common initial segments in the keys of dictionaries.
        """

        segment = ""
        for i in range(min([len(e) for e in dict_.keys()])):
                if len({e[i] for e in dict_.keys()}) > 1:
                        break
                segment += list(dict_.keys())[0][i]

        return segment

def patricia_r(dict_):
        """
        Creates Patricia tries that begin with regular nodes.
        """

        pt = (HEXADEC + 1) * [None]
        if "" in dict_:
                pt[-1] = dict_[""]
                del(dict_[""])
        for e in {e[0] for e in dict_.keys()}:
                pt[int(e, HEXADEC)] = patricia(remove(select(dict_, e), e))

        return pt

def patricia_s(dict_):
        """
        Creates Patricia tries composed of one key ending special node.
        """

        pt = list(dict_.items())[0]
        if len(pt[0]) % 2 == 0:
                pt = (bytes.fromhex("20" + pt[0]), pt[1])
        else:
                pt = (bytes.fromhex("3"  + pt[0]), pt[1])

        return pt

def patricia(dict_):
        """
        Creates Patricia tries from dictionaries.
        """

        segment = find(dict_)
        if   len(dict_) == 1:
                pt = patricia_s(dict_)
        elif segment:
                dict_ = remove(dict_, segment)
                if len(segment) % 2 == 0:
                        pt = [bytes.fromhex("00" + segment), patricia_r(dict_)]
                else:
                        pt = [bytes.fromhex("1"  + segment), patricia_r(dict_)]
        else:
                pt = patricia_r(dict_)

        return pt

def merkle(element):
        """
        Encodes Patricia trie elements using Keccak 256 hashes and RLP.
        """

        if   not element:
                merkle_ = b""
        elif isinstance(element, str):
                merkle_ = bytes.fromhex(element)
        elif isinstance(element, bytes):
                merkle_ = element
        else:
                merkle_ = [merkle(e) for e in element]
                rlp_    = rlp.encode(merkle_)
                if len(rlp_) >= HASH_LEN:
                        merkle_ = sha3.keccak_256(rlp_).digest()

        return merkle_

def merkle_patricia(dict_):
        """
        Creates Merkle Patricia tries from dictionaries.
        """

        return [merkle(e) for e in patricia(dict_)]

def info(dict_):
        """
        Prints info about the Merkle Patricia tries created from dictionaries.
        """

        print("\nPatricia trie:\n")
        pprint.pprint(patricia(dict_))
        print("\nMerkle Patricia trie:\n")
        pprint.pprint(merkle_patricia(dict_))
        print("\nHash of the RLP encoding of the Merkle Patricia trie:\n")
        print(sha3.keccak_256(rlp.encode(merkle_patricia(dict_))).hexdigest())

Here are some examples of usage. Key ending special nodes are represented by Python tuples. Other special nodes are represented by two item Python lists. Regular nodes are represented by 17 item Python lists. Python dictionary keys should be string hexadecimal representations of Merkle Patricia trie keys. Python dictionary key values should be byte representations of Merkle Patricia trie key values. (Save both files in an accessible location with the names rlp.py and mpt.py respectively.):

>>> import mpt

>>> mpt.info({"a" : b"dog"})

Patricia trie:

(b':', b'dog')

Merkle Patricia trie:

[b':', b'dog']

Hash of the RLP encoding of the Merkle Patricia trie:

6bde2e12a5d4e44d4c814f3fd9450cafa8e66e8c7af5de33eec8aca0be0386d7

>>> mpt.info({"c" : b"This long key value is part of the top node."}) 

Patricia trie:

(b'<', b'This long key value is part of the top node.')

Merkle Patricia trie:

[b'<', b'This long key value is part of the top node.']

Hash of the RLP encoding of the Merkle Patricia trie:

9e84fb4b9ab392a2839bfbfb746d98740fb8b50b608df72ba134732842b7197d

>>> mpt.info({"2" : b"apple", "a" : b"dog", "b" : b"cat"})

Patricia trie:

[None,
 None,
 (b' ', b'apple'),
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 (b' ', b'dog'),
 (b' ', b'cat'),
 None,
 None,
 None,
 None,
 None]

Merkle Patricia trie:

[b'',
 b'',
 [b' ', b'apple'],
 b'',
 b'',
 b'',
 b'',
 b'',
 b'',
 b'',
 [b' ', b'dog'],
 [b' ', b'cat'],
 b'',
 b'',
 b'',
 b'',
 b'']

Hash of the RLP encoding of the Merkle Patricia trie:

e6ed7131239ec0e3a9919819f7dc1c6ce199e32f0db0720a352f126161ded326

>>> mpt.info({"2" : b"apple", "a" : b"dog", "b" : b"This long key value is NOT part of the top node."})

Patricia trie:

[None,
 None,
 (b' ', b'apple'),
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 (b' ', b'dog'),
 (b' ', b'This long key value is NOT part of the top node.'),
 None,
 None,
 None,
 None,
 None]

Merkle Patricia trie:

[b'',
 b'',
 [b' ', b'apple'],
 b'',
 b'',
 b'',
 b'',
 b'',
 b'',
 b'',
 [b' ', b'dog'],
 b'\xe6\xa1E\xec\x16\xca\x8e\xd6\x15\xa8\xe4\x16@O:\xcc\xf3\x9c\x85&'
 b'\x8c\xf9\xe8\xbe\x91\xd7\tJ\xf1\xb5\x8a\xe8',
 b'',
 b'',
 b'',
 b'',
 b'']

Hash of the RLP encoding of the Merkle Patricia trie:

0e6c255840d143f718a65bad138c340be1331600a91e192508dfab9d0587501d

>>> dict_ = {b"do".hex() : b"verb", b"dog".hex() : b"puppy", b"doge".hex() : b"coin", b"horse".hex() : b"stallion"}

>>> mpt.info(dict_)

Patricia trie:

[b'\x16',
 [None,
  None,
  None,
  None,
  [b'\x00o',
   [None,
    None,
    None,
    None,
    None,
    None,
    [b'\x17',
     [None,
      None,
      None,
      None,
      None,
      None,
      (b'5', b'coin'),
      None,
      None,
      None,
      None,
      None,
      None,
      None,
      None,
      None,
      b'puppy']],
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    None,
    b'verb']],
  None,
  None,
  None,
  (b' orse', b'stallion'),
  None,
  None,
  None,
  None,
  None,
  None,
  None,
  None]]

Merkle Patricia trie:

[b'\x16',
 b'\xbd>\xe5\x07\xe6\xc6|\xfe\xfc\xa9\x8f\x84\xbeG\xc1\xbb\xc0\t1_\xab\xc4@]'
 b'\xb4\xba2\x19\x03tW*']

Hash of the RLP encoding of the Merkle Patricia trie:

5991bb8c6514148a29db676a14ac506cd2cd5775ace63c30a4fe457715e9ac84

Conclusion

Merkle Patricia tries compactly and securely represent sets of key value pairs. Hopefully between my description and my implementation you can begin to master this topic.