Analyzing Big Data – Wonderful SDForum panel on NoSQL and Big Data
Fascinating to hear panel members talk about how Hadoop and NoSQL architectures have ripped apart the traditional relationship database model into distinct, open layers. Here’s some of the discussion.
Selected Transcript from the panel
Owen Thomas (Moderator): What do you all make of NoSQL? What does it really mean? What are people really saying when they say NoSQL? And how does it relate to what you’re doing with Big Data?
Amr Awadallah, Cloudera: [NoSQL] implies a message of agility. And that’s really what this is about. It’s about being agile in two dimensions. Agile on the dimension of how you store your data. Because in traditional database systems, before you can load your data in, you have to create a schema first. And the schema has to have columns, and types for these columns.
And that is good. It implies structure, it implies control, and it implies governance and a common language that you can use across your organization. It has many many benefits.
However, it also causes a loss of agility. You can’t move as fast. Because every time now you want to change that schema, you have to call up the schema architect – or the governator for that schema – and ask and plead to him to please add that new column for you. Which can take months, because there are lots of committees that have to approve that.
And then after that you have to talk to the ETL folks to load the data into that new column for you from source, which again is going to take a few more months. So it causes a big loss of agility.
With Hadoop and similar systems, it’s the other way around – in the sense that you don’t define the schema when you’re writing the data, you define the schema on the fly when you’re reading the data. So you just copy the files in. You stick any files you would like in whether they be images, videos, log files from Apache, log files from a Java application server, whatever.
And then you apply the lens of how you want to parse that file and extract the schema you would like at read time. Which of course means that read performance will be lower, because now you [have that] parsing overhead. However, it gives you the agility now because you don’t have to wait. If you’re launching a new feature somewhere, you stick it in the logs, it shows up and you start analyzing it without having to wait.
So that’s the first dimension of agility, agility of data types going beyond the relational model and the chains that can sometimes impose. The second form of agility [concerns] SQL itself, the language.
Joshua Klahr, Yahoo!: The need for the agility of the data itself is something that I deal with on a day-to-day basis … Not every page on Yahoo! looks the same. Not every piece of information that someone who’s managing the Sports property vs. a Front Page property vs. a Social Networking property – they all want to ask different questions. And if they launch a new feature, they don’t want to have to go and figure out how do I instrument this new element and have it flow nicely into a database. They want some level of flexibility.
James Phillips, Northscale: I agree with everything that’s just been said. At the end of the day, Northscale is a NoSQL database company. If you look at the Membase project, it is about providing a very efficient place to store large quantities of data without ex-ante decisions about schemas.
Joydeep Sen Sarma, Facebook: … The core issue is what James was talking about. We’ve taken the whole RDBMS stack – which had how you stored the data, how you retrieved the data, how you process it, how you perform transactions, how you replicate it, how you query [the data] – and we’ve taken this, and we’ve completely torn it apart and said every one of these layers in the system is now an open and independent layer.
So if you look at Hadoop, we start off with a file system. And the file system is just a stream of bytes. And there’s no proprietary data storage method or format. You can store data in whatever format you choose to. … And now you’ve got this kind of democratization of the data format. All kinds of people can submit data into this platform which they could not do in a classic, traditional RDBMS that had a proprietary data format that you were locked into …
And so on and so forth. And then you take the processing stack and say OK, well I’m just going to give you like a very [simple] data processing primative. So in the case of online processing, I’ll just give you an Index. And if you talk to database architects, they’ll tell you there’s an index at the heart of every database engine – a key-value paradigm is sort of fundamental to how databases are built out.
And so we’ve taken that part of the system, and sort of created that as a separate layer that can be used independent of all the other parts. Similarly, on the Analytics side we have built these primatives Map and Reduce that again are found in all the relational databases – Sort-Merge-Join is one of the oldest ways of doing Join, and then there’s Hash Join, and all this stuff. We’ve taken this out and said, “Hey, that’s a primitive that goes into the RDBMS stack. And now it’s available separately.”
And now we have all these components that were earlier sort of glued together in this thing that you couldn’t tease apart, and you were locked in and couldn’t do stuff. And now we’ve taken all these out, put them on a board and said, “Hey, rack and stack”, right? Choose what you want, what’s best for you. You’re doing text processing, take the text file, take Map Reduce, do GREB or whatever and you’re done. If you have more structured data, well then you put it in a tab-separated file. And if you’re using Hive you put a schema on top of it, and you start calling them columns, and you do SQL. And if one of those columns happens to be a text column, well the data scientist who just wants to do NLP on that column can get at it by a different means.
So to me, that’s what I’ve learned from looking at the systems, is to take existing architectures, well-understood principles, and to tease them apart and expose them as primitives, which I think is a much better way of building things.
In Summary …
Wow. OK, so now I get it. Awesome.