Informative presentation by Sean Cribbs of Basho Technologies on schema design when writing data-driven apps on top of NoSQL database Riak.
A nice overview of thinking about schemas for key-value store databases. Presentation slides can be found here.
Cribbs also give a brief overview of Links and Link Walking in Riak in the brief video below:
The is a pretty simple form of link traversal, but it does inspire me to understand more about graph-oriented databases and traversal mechanisms.
Nice presentation by Eben Hewitt on Apache Cassandra at Strange Loop 2010.
Probably the best introduction I’ve seen to NoSQL databases:
Insightful presentation by Matthew Aslett of The 451 Group on the changing open source database landscape. Aslett’s presentation is delivered at the beginning of the Cloudera webinar How AOL Accelerates Ad Targeting Decisions with Hadoop and Membase Server, which can be accessed at the bottom of this link.
And here’s how the picture looks as we head into 2011:
I also like the following slide, which illustrates the differing requirements for real-time transactional data processing, and large-scale batch-processing of data – aka Big Audience vs. Big Data:
Fascinating to hear panel members talk about how Hadoop and NoSQL architectures have ripped apart the traditional relationship database model into distinct, open layers. Here’s some of the discussion.
Selected Transcript from the panel
Owen Thomas (Moderator): What do you all make of NoSQL? What does it really mean? What are people really saying when they say NoSQL? And how does it relate to what you’re doing with Big Data?
Amr Awadallah, Cloudera: [NoSQL] implies a message of agility. And that’s really what this is about. It’s about being agile in two dimensions. Agile on the dimension of how you store your data. Because in traditional database systems, before you can load your data in, you have to create a schema first. And the schema has to have columns, and types for these columns.
And that is good. It implies structure, it implies control, and it implies governance and a common language that you can use across your organization. It has many many benefits.
However, it also causes a loss of agility. You can’t move as fast. Because every time now you want to change that schema, you have to call up the schema architect – or the governator for that schema – and ask and plead to him to please add that new column for you. Which can take months, because there are lots of committees that have to approve that.
And then after that you have to talk to the ETL folks to load the data into that new column for you from source, which again is going to take a few more months. So it causes a big loss of agility.
With Hadoop and similar systems, it’s the other way around – in the sense that you don’t define the schema when you’re writing the data, you define the schema on the fly when you’re reading the data. So you just copy the files in. You stick any files you would like in whether they be images, videos, log files from Apache, log files from a Java application server, whatever.
And then you apply the lens of how you want to parse that file and extract the schema you would like at read time. Which of course means that read performance will be lower, because now you [have that] parsing overhead. However, it gives you the agility now because you don’t have to wait. If you’re launching a new feature somewhere, you stick it in the logs, it shows up and you start analyzing it without having to wait.
So that’s the first dimension of agility, agility of data types going beyond the relational model and the chains that can sometimes impose. The second form of agility [concerns] SQL itself, the language.
Joshua Klahr, Yahoo!: The need for the agility of the data itself is something that I deal with on a day-to-day basis … Not every page on Yahoo! looks the same. Not every piece of information that someone who’s managing the Sports property vs. a Front Page property vs. a Social Networking property – they all want to ask different questions. And if they launch a new feature, they don’t want to have to go and figure out how do I instrument this new element and have it flow nicely into a database. They want some level of flexibility.
James Phillips, Northscale: I agree with everything that’s just been said. At the end of the day, Northscale is a NoSQL database company. If you look at the Membase project, it is about providing a very efficient place to store large quantities of data without ex-ante decisions about schemas.
Joydeep Sen Sarma, Facebook: … The core issue is what James was talking about. We’ve taken the whole RDBMS stack – which had how you stored the data, how you retrieved the data, how you process it, how you perform transactions, how you replicate it, how you query [the data] – and we’ve taken this, and we’ve completely torn it apart and said every one of these layers in the system is now an open and independent layer.
So if you look at Hadoop, we start off with a file system. And the file system is just a stream of bytes. And there’s no proprietary data storage method or format. You can store data in whatever format you choose to. … And now you’ve got this kind of democratization of the data format. All kinds of people can submit data into this platform which they could not do in a classic, traditional RDBMS that had a proprietary data format that you were locked into …
And so on and so forth. And then you take the processing stack and say OK, well I’m just going to give you like a very [simple] data processing primative. So in the case of online processing, I’ll just give you an Index. And if you talk to database architects, they’ll tell you there’s an index at the heart of every database engine – a key-value paradigm is sort of fundamental to how databases are built out.
And so we’ve taken that part of the system, and sort of created that as a separate layer that can be used independent of all the other parts. Similarly, on the Analytics side we have built these primatives Map and Reduce that again are found in all the relational databases – Sort-Merge-Join is one of the oldest ways of doing Join, and then there’s Hash Join, and all this stuff. We’ve taken this out and said, “Hey, that’s a primitive that goes into the RDBMS stack. And now it’s available separately.”
And now we have all these components that were earlier sort of glued together in this thing that you couldn’t tease apart, and you were locked in and couldn’t do stuff. And now we’ve taken all these out, put them on a board and said, “Hey, rack and stack”, right? Choose what you want, what’s best for you. You’re doing text processing, take the text file, take Map Reduce, do GREB or whatever and you’re done. If you have more structured data, well then you put it in a tab-separated file. And if you’re using Hive you put a schema on top of it, and you start calling them columns, and you do SQL. And if one of those columns happens to be a text column, well the data scientist who just wants to do NLP on that column can get at it by a different means.
So to me, that’s what I’ve learned from looking at the systems, is to take existing architectures, well-understood principles, and to tease them apart and expose them as primitives, which I think is a much better way of building things.
In Summary …
Wow. OK, so now I get it. Awesome.
This is my fourth of a series of posts exploring the topic of Big Data. The previous posts in this series are:
- Designing Large-Scale Retrieval Systems at Google – Jeff Dean from 2009
- Introduction to Hadoop – understanding Big Data
- Keen Insight on Big Data from Cloudera CEO Mike Olson
This post provides two videos in which Facebook’s David Recordon discusses Facebook’s architectural stack as a platform that must scale to massive amounts of data and traffic. The first video is a short video where Recordon discusses Facebook use of the LAMP stack at OSCON 2010:
On Database Technology and NoSQL Databases at Facebook
In the first video, Recordon first addresses how Facebook implements database technology generally, and the topic of NoSQL databases. Says Recordon:
The primary way that we store data – all of our user data that you’re going and accessing when we’re working on the sight, with the exception of some services like newsfeed – is actually stored in MySQL.
So we run thousands of nodes of a MySQL cluster – but we largely don’t care that MySQL is a relational database. We generally don’t use it for joins. We’re not going and running complex queries that are pulling multiple tables together inside a database using views or anything like that.
But the fundamental idea of a relational database from the ’70s hasn’t gone away. You still need those different components.
Recordon says that there are really three different layers Facebook thinks about when working with data, illustrated in the following visual:
You have the database, which is your primary data store. We use MySQL because it’s extremely reliable. [Then you have] Memcache and our web servers.
So we’re going and getting the data from our database. We’re actually using our web server to combine the data and do joins. And this is some of where HipHop becomes so important, because our web server code is fairly CPU-intensive because we’re going and doing all these different sorts of things with data.
And then we use Memcache as our distributed secondary index.
These are all the components that you would traditionally use a relational database for:
[These are the same layers that were] talked about 30-40 years ago in terms of database technology, but they’re just happening in different places.
And so whether you’re going and using MySQL, or whether you’re using a NoSQL database, you’re not getting away from the fact that you have to go and combine data together, that you’re needing to have a way to look it up quickly, or any of those things that you would traditionally use a database for.
On the topic of NoSQL databases, Recordon says:
And then when you dig into the NoSQL technology stack, there are a number of different families of NoSQL databases which you can go and use. You have document stores, you have column family stores, you have graph databases, you have key-value pair databases.
And so the first question that you really have is what problem am I trying to solve, and what family of SQL database do I want to go and use.
And then even when you dig into one of these categories – if we just go and look at Cassandra and HBase – there are a number of differences inside of this one category of database. Cassandra and HBase make a number of different tradeoffs from a consistency perspective, from a relationship perspective. And so overall you really go and think about what problem am I trying to solve; how can I pick the best database to do that, and use it.
While we store the majority of our user data inside of SQL, we have about 150 terabytes of data inside Cassandra, which we use for Inbox search on the site. And over 36 petabytes of uncompressed data in Hadoop overall.
On the topic of Big Data
So that leads me into Big Data. We run a Hadoop cluster with a little over 2,200 servers, about 23,000 CPU cores inside of it. And we’ve seen the amount of data which we go and store and process growing rapidly – it’s increased about 70 times over the past 2 years. And by the end of the year, we expect to be storing over 50 petabytes of uncompressed information – which is more than all the works of mankind combined together.
The diagram below shows Facebook’s Big Data infrastructure:
So this is the infrastructure which we use. I’ll take a minute to walk through it.
With all our web servers we use an open source technology we created called Scribe to go and take the data from tens of thousands of web servers, and funnel them into HDFS and into our Hadoop warehouses. The problem that we originally ran into was too many web servers going and trying to send data to one place. And so Scribe really tries to break it out into a series of funnels collecting this data over time.
This data is pushed into our Platinum Hadoop Cluster about every 5-to-15 minutes. And then we’re also going and pulling in data from our MySQL clusters on about a daily basis. Our Platinum Hadoop Cluster is really what is vital to the business. It is the cluster where if it goes down, it directly affects the business. It’s highly maintained, it’s highly monitored. Every query that’s being run across it, a lot of thought has gone into it.
We also then go and replicate this data to a second cluster which we call the Silver Cluster – which is where people can go and run ad-hoc queries. We have about 300 to 400 people which are going running Hadoop and Hive jobs every single month, many of them outside of engineering. We’ve tried to make this sort of data analysis to help people throughout the company make better product decisions really accessible.
And so that’s one of the other technologies which we use, Apache Hive, which gives you an SQL interface on top of Hadoop to go and do data analysis. And all of these components are open source.
So when Facebook thinks about how there stack has evolved over the past few years, it looks something like this:
Where the major new component is the Hadoop technology stack and its related components to manage massive amounts of data, and do data analysis over top of that data.
A deeper look at Scaling challenges at Facebook
The second video is a presentation delivered by David Recordon and Scott MacVicar – both Facebook software engineers – at FOSDEM in February 2010 provides a deeper look into Facebook’s use of open source technology to provide a massively scalable infrastructure:
The question that I am interested in, and isn’t answered in these videos, is how Facebook implements its Open Graph data model in its infrastructure. That would be very interesting to learn. For more specifically about Facebook’s Open Graph technology, please see Facebook’s Open Graph and the Semantic Web – from Facebook F8.
Very interesting stuff.