Understanding Blockchain Databases - From Card Files to the Global Database

in #blockchain8 years ago

B


lockchains are not well understood by most people and the majority of people have many misconceptions about them. So I am going to lay out the pathway of development of databases, and what things converged to give us the modern blockchain database.

Tables and Card Files

The most primordial entity in the realm of databases is the table. Tables are a collection of usually named lists containing a collection of items which can be anything from even a single bit (yes or no) to a whole video file.

It is a useful analogy to talk about old card files, like such as libraries used to have until about 20 years ago.

Each card contains a primary key, the item that determines the ordering of the cards, which would in the case of a library catalogue be Dewey Decimal codes for the subject of the book, or in a smaller database it might just be the author, publication date or title. Indeed often when a collection of items has the same Dewey code, then maybe the cards might be sorted by author.

A table entry is like the cards, it contains all of the attributes you might want to know, in this case, Subject, Genre, Author, Title, Year, and maybe a cover or summary.

The main purpose is to have the data broken down into sections, in an order familiar to the user, so that one can more quickly find where an item can be found.

Electronic Databases, Sorting and Searching

It can be said that our drawers full of cards are ordered by what we consider to be the most useful criteria. But in the cards there is a natural, unmentioned property, the individual cards are physically distinct.

In a computer database, this can be represented by a sequential or random hash or the hash of one or more records. If you know the primary key, the record can be very quickly retrieved.

To speed up searching, desired search table fields can be separately indexed. This is like having a second set of cards that lets you quickly find by author when they write in several categories, which like Dewey numbers, you can then find by author and each record for an author might have cards with a list of the category codes where the author can be found.

Relations Between Records, and Graphs

A graph is a map that shows relationships between things, along an axis of time or space in some way.

In the examples above, we have axes defined by each table field, such as author's name, Dewey code, year of publication, and many others. From A to Z, from 000 to 999, from 10000BC to this year. Each forms a new map above the sort order of recording, and the entire database could be sorted in any one of these orders.

Thus for example you could say a node exists on 101 Dewey code, or Smith, John, which you can represent as a graph. From the nodes formed by database records, you have edges, the references, that point to other nodes or particular records.

These paths created by references or commonalities between records create multidimensional relationships between entries in the tables. Through the use of SQL and indexes of each possible search criteria, you at last come to the current state of a simple database within a single computing and storage domain, prior to the internet and clustering, which was necessary for such large databases like Facebook or Twitter, fast and searchable.

Blockchains are about bringing this scale and reliability with veracity without requiring trust.

This collection of primary tables and index tables gives you the structure of what is called a Relational Database. These are designed to speed up searching of databases.

I will not go into full text search databases because it is a somewhat narrow and actually quite redundant application of the same basic principles where the body of a text is indexed and sorted in other ways, like as visually represented in word clouds - by frequency of occurrence, or the basis of Google's ranking system, counting inbound references to give a metric of relevance in an otherwise cluttered jumble of text matches.

The same principle applies so it does not need to be repeated. You are reading this because you want to get a grasp on Blockchains, and first you need to understand keys, sorting and indexing.

Databases built on Graphs - First comes the Log

So, obviously, all databases have graph properties, built on tables, and linked together with index tables. From a conceptual basis, you can see that if you instead encode the data as graphs, you gain a lot of flexibility in structure, and a very important feature for databases: veracity. By distributing the data to other parties, and making it difficult to make a fake transaction, it makes crimes of fraud much harder to perform, which is a social good.

You can easily expand a database like this with new record types, that even may remove or substitute fields as the data piles up. Graph databases integrate schema and indexes together into a single log, and follow the MVC (model, viewer, controller) principle of separating representation from storage and delivery. The data is recomposed into caches for sorting and searching the data according to the specific needs of users.

Blockchains are built first and foremost a log, the word "chain" in the name is a reference to the primary sort order of out database - in order the records are added.

Blockchains then add a feature intended to eliminate false records, and to converge the whole database as accessible from any node, to a single state. This is done by verifying signatures on records, and signing the record to affirm that the signature is correct.

A lot of people think blockchains are about mathematical puzzle competitions called Proof of Work. Proof of Work is not about veracity of data, only about distributing the right to assign ownership to a new token, without a central arbitrator.

In fact, a blockchain could store any kind of data, such as IP addresses accessing a webpage, but the key is that more than one database server exists, and all nodes cooperate to ensure a single canonical state exists in the database. It is possible for separate parts of the blockchain can diverge and disagree, but using various criteria, the network consensus is discovered and eventually recognised by all nodes.

So, when someone says the Bitcoin blockchain contains some 90 gigabytes, this is complete, network consensus order, of every new entry in the database, from the first record added to it, to now.

Network Routing Protocols

There already exists a much smaller, lighter weight distributed database system used by internet routers to discover and propagate accurate maps with which routers can forward packets of data and for them to find their destination with the shortest path traveled.

Each router has their own version of the database, and only stores information relating to address regions and which port out of the router the packet must go to follow the correct path for that region.

Compared to Blockchains, routing databases are quite small. They contain references to segments of the number-space available in the addresses, 32 bit for IP version 4 and 128 bits for version 6.

A routing table database thus can never grow beyond the maximum number of nodes, plus some arbitrary number of paths between them. This is optimised further by not requiring each node to know anything more than the address ranges for which they have a complete table, and the addresses of routers to pass packets to for other ranges of addresses.

Why did I mention this particular type of database?

Because it is a graph. Between each router in the network, there is two edges (a single direction of transit) and through a process of broadcasting changes in this network topology, routers get every packet from one place to another by the shortest path.

So what is the minimum definition of a Blockchain?

Essentially, a database made of fixed sized blocks in log format, assembled by a number of untrusted computers using a verification protocol that allows all nodes to be able to provide data from a common database, while replicating it and rapidly and reliably building a consensus of the state of the database, so the data can be retrieved from any node, and verified from any other.

The first blockchain was Bitcoin. It has a fairly restrictive set of record primitives, new found keys, signatures asserting ownership, and a protocol to reassign ownership as well.

Being log based and the potential existing for introducing new primitives, and the necessary growth of code that comes from it, Blockchain databases are becoming increasingly complex and the number of types of records is growing.

It is for this reason they are increasingly becoming graph databases. Graph databases can have new record types added and changed over time, but the code for relating and certifying them is added into another type of database which Ethereum was an attempt to move the code into a form of data within the database itself. Blockchains like Steem and Bitshares and Dash keep the code for the protocols within the code. The code understands legacy data in blocks and retains relevant code. Ethereum attempted to move it all onto the blockchain. The graph structuring principle was pioneered with Ethereum, and is most developed in Graphene, the foundation of Steem's blockchain.

So Blockchain is a fluid term at present, and people are finding new ways to build upon them. But essentially it is a log form database of transactions, with hierarchic branches of relations mainly relating to signatures of reassignment and token splitting. This fluidity has to do with an increasing convergence towards increasing reliability and availability, and blockchains are being connected together more and more.

Graphs vs Tables and Keys

You probably have heard that the Steem blockchain uses a database engine called Graphene. It is so named because it understands the graph structure of data.

Graph databases have specific reasons for being used, but the primary reason is enabling the increase of record types, but as the databases scale and more transactions per second are occurring, the graph structure is becoming necessary for an emerging purpose related to scaling.

This has to do with a process of consensus building, which I am calling convergence. I am taking this term from network switching and routing, where convergence means that every switch within a network, or every router in an internetwork, has a consistent map of the network, that describes the way to move a packet from any node to any other reliably.

Convergence in Blockchains is called consensus, and divergence is called "forking". The word fork is used because when there is 2 or more different tracks at the head of the chain, to maintain consensus, the network has to agree which fork path is agreed and which is incorrect, and won't be included in the log.

Graphs and the need for increased throughput

It is a topic that is quite under discussion at the moment as the prospect of growth of a single database like Steem to service a very large number of users. The graphs within the database have regions due to regularity of transactions between users that can be divided up more than it currently is.

Most likely the necessary level of division to split the blockchain, both the leading edge in process of converging, but also the old data, which as it ages has less data that ensures the chain of custody is unbroken, whereas some transactions may have a very big gap in time until a transactions relating to them. Many will be obsolete, probably most, but the old blocks have to be kept to enable some reasonable degree of auditing. Ideally the trail back to the genesis block where an account or token's inception is auditable.

Obviously it is not going to work to let the blockchains keep growing while not permitting nodes to only store some of the data, indefinitely. Storage is increasing but data increases geometrically by network effects from adoption, and further, by the increase of applications and venues for using the blockchain.

The challenge is to disperse the data in such a way that clients can see all transactions relevant to them, without having to trust the nodes, but to be able to trust the data.

I will discuss more how I see this unfolding in the near future, but I will just say that ultimately reliable record keeping without the problems of data loss, corruption and fraud, and ideally, scalable to any number of users and enabling many different kinds of computers to form some part of the network.

Blockchain is just the beginning. The goal is a globally connected computation and data storage system that is proof against data loss, private, and yet available anywhere to a keyholder to access and change. I personally think the systems will converge into a protocol, and then into an appliance that underpins the very fabric of human society.


We can't stop here! This is Whale country!

steemit.com/@l0k1

Loki was born in Australia, now is wandering Amsterdam again after 9 months in Sofia, Bulgaria. IT generalist, physics theorist, futurist and cyber-agorist. Loki's life mission is to establish a secure, distributed layer atop the internet, and enable space migration, preferably while living in a beautiful mountain house somewhere with a good woman, and lots of farm animals and gardens, where he can also go hunting and camping.

"I'm a thoughtocaster, a conundrummer in a band called Life Puzzler. I've flipped more lids than a monkey in a soup kitchen, of the mind."
- Xavier, Renegade Angel

All images in the above post are either original from me, or taken from Google Image Search, filtered for the right of reuse and modification, and either hotlinked directly, or altered by me

Sort:  

Genuinely impressive post!

One thing that you implied, but that I would like to make explicit is that

It is possible for separate parts of the blockchain can diverge and disagree, but using various criteria, the network consensus is discovered and eventually recognised by all nodes.

That consensus has a probability. Each block in the chain has a certain level of uncertainty surrounding the consensus of that block. The further back in the chain you go the less uncertain you are about the veracity of that block. An example is as you described where you would have a divergence of consensus on the latest block, until the network decides. Rinse and repeat.

Thank you for this! I understand it much better now. :-)

This post has been ranked within the top 25 most undervalued posts in the second half of Dec 02. We estimate that this post is undervalued by $13.25 as compared to a scenario in which every voter had an equal say.

See the full rankings and details in The Daily Tribune: Dec 02 - Part II. You can also read about some of our methodology, data analysis and technical details in our initial post.

If you are the author and would prefer not to receive these comments, simply reply "Stop" to this comment.

Coin Marketplace

STEEM 0.19
TRX 0.13
JST 0.030
BTC 63595.77
ETH 3415.98
USDT 1.00
SBD 2.49