A very nice overview and contrasting of the four main NoSQL database categories: (i) Key-value stores, (ii) BigTable clones, (iii) Document databases, and (iv) Graph databases. One particular insight I picked up from the presentation is the pervasiveness of key-value data representations across all four categories. Quoting Eifrem:
[Document databases – e.g. CouchDB and MongoDB] are inspired by Lotus Notes. CouchDB was founded by the guy that wrote Lotus Notes at IBM, and it basically has a data model of a collection of key-value pairs that they call documents – a JSON document – and then a collection of those, sometimes hierarchically organized. …
The fourth category are Graph databases … The data model here is nodes, with relationships between nodes. And then key-value pairs that you can attached to both nodes and relationships.
The prominence of the Key-Value structure relates to two fundamental advantages to NoSQL datastores: (i) their ability to manage flexible schemas and complex data structures, and (ii) their ability to scale. Eifrem makes the case that different families of NoSQL databases make different design decisions that involve tradeoffs between ease of scaling vs. ease of representing complex data. This is illustrated by the slide below:
Again quoting Eifrem from the presentation:
If you look at [these 4 families of NoSQL databases] they’re all about scaling. But there are two aspects to scaling: data complexity … and scaling to size. If you map these models you see they’re positioned differently along [the two axes].
We have Key Value stores [at the top left] – an extremly simple data model, which means it’s poor at handling complex data. It’s just a hash table right. But the fact that it has such a simple data model means that it’s really easy to scale out …
The Big Table clones have a less simple, more capable data model that can capture [semi-structured data]. But have slightly less ability to scale to size. … It’s more difficult to get HBase to scale to get Voldermort to scale to insane size.
More over to the right we have the document databases. A more capable data model, but you can’t push to scale to the size [of the previous models.]
And finally all the way out to the right are the Graph databases. It’s the data model which is most capable of dealing with complexity. It’s easiest to model complex domains. But it’s most challenging to get it to scale to size.
The interesting thing about these data models is that they’re all isomorphic. If you have data, you can squeeze it into a graph database or into a key value store, or into [the other two models].
For example, we sometimes jokingly say about document databases that if you want a document database, just take a graph database but remove the relationships. The nodes are key-value pairs just like the documents. So from a data model perpsective, a graph database is clearly a superset of a document database.
And one document is sort of like the entire key value store in the key value store model. Now this is a pretty theoretic exercise … so when it comes down to specifics, obviously there are a bunch of things that [differentiate] a document database from a graph database, in terms of the REST API, in terms of how we handle indexes, and other things like that.
To understand more about the fundamentals of the Neo4j database, please see The Neo Database: a Technology Introduction.
Slides to the talk can be found here. There is one particular slide in the presentation that I found extremely beautiful. It is this one:
What is so extremely elegant about this structure is that the Index structures are of the same form (i.e. a Graph) as the relational structure of the core domain model. Quoting Rodriguez:
So now this is what a Graph Database starts to look like. You have your domain model, this is the human world that we think about. And then you have these others structures on top – that how you are partitioning thatworld. And that’s more the computer’s interpretations of the world.
And again it’s just nodes and edges, it’s one atomic entity.
I can’t speak to the computational efficiency of this model. But clearly there’s a conceptual elegance that feels very natural. I highly recommend watching the entire presentation.