Archive

Archive for the ‘Big Data’ Category

Large-scale Machine Learning and Data Mining using Hadoop – Hadoop World 2010

March 11, 2011 Leave a comment

A couple interesting videos on large-scale machine learning and data mining using Hadoop from Hadoop World 2010.

1 – Large-Scale Text Analytics at AOL

The first is a presentation on Text Analysis from AOL:

Slides for the presentation can be found here.

AOL’s high-level text analytics architecture – built on top of HDFS – is shown in the visual below:

Related presentations on AOL’s use of Hadoop for Content Analytics and Ad Targeting can be seen below:

The Text Analytics modules perform analysis that is then fed into important AOL applications. Two targeted advertising examples – shown below – are Location-Aware Contextual Advertising and User Aware Ad Targeting:

2 – Sentiment Analysis at GE

The second presentation is from GE on large-scale Sentiment Analysis using Hadoop:

glenn

Changing Open Source Database Landscape – The 451 Group

March 10, 2011 Leave a comment

Insightful presentation by Matthew Aslett of The 451 Group on the changing open source database landscape. Aslett’s presentation is delivered at the beginning of the Cloudera webinar How AOL Accelerates Ad Targeting Decisions with Hadoop and Membase Server, which can be accessed at the bottom of this link.

A few interesting slides presented by Aslett, which illustrate both the current dramatic growth of NoSQL databases, as well as the rapid adoption of the Hadoop platform across industry verticals.

And here’s how the picture looks as we head into 2011:

I also like the following slide, which illustrates the differing requirements for real-time transactional data processing, and large-scale batch-processing of data – aka Big Audience vs. Big Data:

glenn

Categories: Big Data, NoSQL Databases Tags: ,

Big Data as a source of competitive advantage at Bank of America – Abhishek Mehta at Hadoop World 2010

March 10, 2011 Leave a comment

Very nice presentation by Abhishek Mehta of Bank of America on adopting Big Data technologies and Hadoop and Bank of America. Here’s the presentation:

Data as the new competitive advantage

My favorite insight from the presentation follows:

The last piece of it, which kind of goes with the fact that Data Quants [or Data Scientists] are really undervalued in the market today, was the fact that Modeling Quants are overvalued …

Our learning was this: [Algorithms] are only as good as what you feed in them. So if you truly had a discipline built around data – collected over multiple sources, structured and unstructured, many variables going back many years – the [algorithms] become less important. Because simple models over Big Data are more powerful than the most complex model using some approximation behind it. And we fully believe in it.

Secondly, with this massive democratization of data and the tools around it – and the next phase of it being the algorithms also probably following the same track. Most of the best [algorithms] that you need to use are already open-sourced – they have been written. So the algorithms, or the art of writing algorithms, is no longer proprietary. That is NOT your competitive advantage. The advantage you have is going to be how you apply the algorithm to a particular business problem, and that’s going to be the competitive differentiator.

Now that’s very interesting stuff.

In reference to the above slide, Mehta has this to say:

As an example, the graph algorithm – the same algorithm:

  1. Powers the People You May Know app at LinkedIn
  2. Can be used and deployed to classify people into behavioral tribes, or behavioral networks
  3. And the same algorithm can be used to look at risk concentration ratios

The algorithm has already been written in MapReduce. … But which problem you apply it to – between people you may know, marketing tribes or risk concentrations – is what is going to be the competitive advantage.

So we spend a lot more time in building what I call Data Algorithms, or data modeling, than actually in writing algorithms. And that is a massive change in the science around data, and big data.

Data Factories – the next Industrial Revolution

The closing segment of Mehta’s talk begins with the following slide:

And he has some provocative thoughts on the topic:

I believe that we are witnessing the birth of the next industrial revolution. It’s going to be powered by data. It’s already begun. It’s still very early, but it’s already begun.

And the concept around data factories, of the ability to take data in, automate – just like factories do today – automate the data pipeline, and produce data products then can then be fed to solve multiple problems, is truly game changing. What we are building at Bank of America is the first data factory in financial services – to do exactly that.

Now data factories exist today. … Google and Facebook are some of the well-known data factories, as I classify them. Some of the not-so-well-known ones are comScore and Zynga. They do the exact same thing. Data is their core asset. They know how to monetize it, and they’ve tried to build an automated process to take raw data and push it forward.

… Buy into the fact that [Big Data] is going to change the world, and massively disrupt existing economic models. I look at Hadoop today as Linux was 20 years ago. We all have seen what Linux has done in the enteprise software space. It’s been massively disruptive. Hadoop will do the same. It’s not a question of if, it’s a question of when … across all verticals, not just in web properties.

Fantastic presentation.

glenn

HBase in Production at Facebook – Jonathan Gray at Hadoop World 2010

March 9, 2011 Leave a comment

Interesting presentation from Facebook’s Jonathan Gray at Hadoop World 2010 on Facebook’s current and future plans for using HBase in their data platform. Here’s the video:

A couple of slides that position the role of HBase within Facebook. First, Facebook’s core application/data platform – a LAMP stack with Memcache and Hadoop/Hive:

And then a slide that hints at the impact HBase has on various elements of the stack:

Note that HBase does not actually replace any of these element of the stack, but rather plays an interesting intermediate role between online transactional and offline batch data processing.

In his presentation, Gray speaks to the advantages of HBase. Here’s a few snippets:

And then on the Data Analysis side, HBase doesn’t actually do data analysis. And it doesn’t actually store data. But HDFS stores data, and Hive does the analysis. But with HBase in the middle you can do random access and you can do incremental updating.

You also have fast Index writes. HBase is a sorted store. So every single table is sorted by row, every single row is sorted by the column. And then columns are sorted by versions. That’s a really powerful thing that you can build inverted search indexes on, you can build secondary indexes on. So you can do a lot more that just what you can do with a key-value store. So it has a very powerful data structure.

And lastly, there’s real tight integration with Hadoop. And my favorite thing: It’s an interesting product that kind of bridges this gap between the online world and the offline world – the serving, transactional world and the offline, batch-processing world.

HBase Use Case #1: Near real-time Incremental updates to the Data Warehouse

Says Gray:

Right now (at Facebook), we’re doing night updates of UDBs into the data warehouse. And the reason we’re doing that is because HDFS doesn’t have incremental operations. I can only append to something. I can’t edit something, I can’t delete something. So merging in the changes of transactional data, you basically have to rewrite the entire thing.

But with HBase, what we’re able to do is, all of our MySQL data is already being replicated. So we already have existing replication streams. So we can actually hook directly into those replication streams, and then write them into HBase. And then HBase then allows us to expose Hive tables, so we can actually have completely up-to-date UBD data in the datawarehouse. So now we can have UDB data into our data warehouse in minutes, rather than in hours or a day.

HBase Use Case #2: High Frequency Counters and Real-time Analytics

Again quoting Gray:

The second use case is around high-frequency counters, and then real-time analytics of those counters. This is something I think a lot of people have used HBase for for a long time. …

It’s a really interesting use case. Counters aren’t writes, they’re read-modify-writes. So they’re actually a real expensive operation. In a relational database, I’d actually have to do a read lookup, and then write that data back. So it’s a real expensive thing. And if you’re talking about billions of counter updates per hour – or I think right now on one of our clusters it’s about 100,000 updates per second. So doing a 100,000 increments a second on a SQL machine, it’s a cluster of machines now – a lot of machines.

And then the other part is, well now that I’m taking all this increment data, I want to be able to do analysis on it. If I’m taking click-stream data, I want to say What’s the most popular link today? And this past hour? And of all time? So I want to be able to do all that stuff, and if I have all my data sitting in MySQL or HDFS, it’s not necessarily very efficient to compute these things.

So the way we do it now is Scribe Logs. So everytime you click on something, for example, that’s going into Scribe as a log line saying this user clicked this link. That’s being fed into HDFS. And then periodically we’re saying, OK once an hour, or once a day or whatever, let’s take all of our click data and do some analysis on it. Let’s do sums and max-mins and averages and group-by, and different kinds of queries like that. And then once we have our computations, let’s feed it back into the UDBs so people can read it.

So looking at this flow here, we have Scribe going downline into HDFS. And then once we’re in HDFS, we’re writing things as Hive tables so we can get the dimensions that we need. And then we’re doing these huge MapReduce joins to join everything by URL. So it takes a long time to do that job. It’s really, really inefficient. It uses lots and lots of I/O. And it’s not real-time. If this job takes an hour to run, we’ll always have at least an hour of stale data.

But with HBase what we’re doing is we’re going from Scribe directly into HBase. Which means that as soon as that edit comes into HBase, it’s available. You can read it. You can randomly read it. You can do MapReduce on it. You can do whatever you want. And like I was talking about before, you can do real-time reads of it. And I could say “How many clicks have there been for newyorktimes.com today, and I can just grab that data out of HBase.

Or, if I want to do things like “What are the top 10 links across all domains?” Well, the way we do that is kind of like through a trigger-based system. Because the increments are so efficient, I can actually increment 10 things for each increment. So I can say “Increment this domain. But then increment the link. Increment it for today. Increment it for these demographics.” Just do a whole bunch of increments because the increments are so efficient.

So you can actually pre-compute a lot of this stuff. And then when you want to do big aggregations, you can do a MapReduce directly on HBase. And then when you’re done and you have your results, rather than having to feed them back to the UDBs, you just put them in HBase and they’re there.

So it’s really, really cool – storage, serving, analysis as one system. And we’re able to basically keep up with huge, huge numbers of increments and operations, and at the same time do analytics on it.

HBase Use Case #3: User-facing Database for Write-intensive workloads

And the third scenario:

The last case I want to talk about … is using HBase as a user-facing database – almost as a transactional database, and specifically for Write workloads. When you have lots and lots of Writes and very few Reads, or you have a huge amount of data liked I talked about before. If I’m storing 500K, I don’t necessarily want to put that in my UDB. …

I’m not going to elaborate further on this use case. Please listen to the presentation for a full discussion.

HBase and Hive Integration

On production development of HBase at Facebook, Gray has this to say:

But the first thing we did was the Hive integration. … This unlocks a whole new potential, and not just the way I was describing it earlier that we can now randomly write into our data warehouse. You can also randomly read into the data warehouse. So for certain kinds of joins, for example, rather than having to stream the joins we can actually do point lookups into HBase tables. So it unlocks a whole new bunch of ways that we can potentially optimize Hive queries.

But the base of Hive integration is really HBase tables become Hive tables. So you can map Hive tables into HBase. You can use that then as an ETL data target, meaning that we can write our data into it. It can also be a Query data source so we can read data from it. And like I was saying, the Hive integration supports different read and write patterns.

So on the Write side, it supports API random writing like we would do with UDBs. It also supports this bulk load facility through something called HFile output format. So HFile is the ondisk format that HBase uses. And it just looks like a sequence file or a Map file or anything else, but it has some special facilities for HBase.

And we extensively are doing this, which is taking data, writing it out as HFiles, which basically means you’re writing into HBase at the same speed you write to HDFS. And then you just kind of hit a button, and HBase loads all those files in. And now you have really efficient random access to all that data. We’re using that extensively.

Also, on the Read side. You can randomly read into stuff. Or you can do full table scans. Or you can do range scans. All that kind of stuff through Hive.

In Summary

Another great presentation from Hadoop World.

glenn

Architecting AOL’s Data Layer for Content Analytics – Ian Holsman, October 2010

March 9, 2011 3 comments

In the presentation below at Hadoop World 2010, AOL’s Ian Holsman talks about AOL implementation of a new data layer for personalizing content and increasing clickthroughs on news pages, and how this led to AOL’s involvement with Big Data technologies.

Starting with a business goal – increase relevance of content to increase clickthroughs

Holsman starts off:

What the Data Layer project was about initially was trying to make sense of the data that we have coming in at AOL. Before AOL, I worked at CNET … and they [had done] a project where they had Top Stories (or most popular stories) at the top of various pages. And that was getting 1-4% clickthroughs …

And what they then decided to do was personalize that a bit. And they [increased the clickthrough percentage] from 4% to 20% clickthrough. … So the whole aim of this project was to try to implement this at AOL. So instead of going just for most popular stories, we tried to get more related stories.

So that’s the background on how [AOL’s involvement with Big Data] started. … This started in 2008, so we’ve been doing this for a while now. And it’s morphed from what we originally started into something much bigger.

Holsman continues:

So it started with a question, “Can we do better than a Top Stories link?” So like a said before, at CNET they did studies and they [increased their clickthrough rates on news stories] from 4% to 20%. And I thought we could do something like that [at AOL]. And so I put a proposal through.

What that morphed into in the business requirements was to increase recirculation of the pages – basically trying to get users to click through more. And also to improve the revenue per page. The COO of the time basically had a chat with Yahoo!, and he asked us why has Yahoo! been able to get much more higher value-adds on their pages? And one of the reasons for that is they know more about the user.

So the three major goals of this inititiave was (i) to get better [i.e. more relevant] ads on the page, (ii) get better reader engagement [with content on the page], and (iii) enable the user to click through [on a piece of content and discover other related content on AOL].

AOL at the time had 72 major properties. Most people probably only know about 3 or 4 of them. So ideally we’d be able to push you through to other places and you’ll start using more of the AOL network.

And what we translated the mission to was a Related Page module. Initially [the scope] was site-specific … so if you were on Shopping we didn’t push you to Real Estate. But the plan was to eventually make it network-wide.

Architecting the Data Analytics Platform v.1

So, how did AOL solution this? Again from Holsman:

Most of the stuff we did at the time was bleeding edge technology. … We got a guy to write some Javascript for us to start measuring things. That’s probably the first thing to get your toe wet, is you have to start measuring things, and start getting the data into the cluster itself.

We wrote a custom Apache module to do third-party cookies. So the problems we had were (i) getting the data, (ii) making sure we can identify the user across sites – so we created a custom module to create a cookie which is shared across multiple domains.

We wrote a custom load processing module to push the data every 15 minutes to a Hadoop cluster. And we wrote MapReduce jobs to get the data, crunch through it, and produce reports and MySQL databases with the aggregated data so other groups can use it.

Holsman adds that one of major aims AOL had when they began collecting data was around Privacy issues. He elaborates:

[We try to make sure that people’s personal data – i.e. people’s names, addresses, e-mail addresses] (a) isn’t collected, and (b) isn’t made available to anybody – internal and external.

So we tried to keep it anonymous. We basically decided to ditch the IP numbers completely. So if you look at our data collection, we use something called WOEIDs, which is geographic location. … So the IP number was basically never sent to disk anywhere.

Most of the stuff we do with data has a Privacy guy involved. And that’s probably important … when you’re dealing with large amounts of data, you have to think about the privacy. What happens if this gets out – if an internal user starts exposing it, or we have an Oops, and we have it [made public]? Especially with this level of data, and the amount of data you’re collecting.

The following diagram presents AOL’s initial architecture circa 2008 (sorry, I know it’s a bit small and hard to read):

Key elements of the architecture – split along East and West coasts – include:

  • Beacons – collect interaction data. Beacons are provided by various analytics data gatherers/tools like Omniture and Comscore.
  • Web Server Logs – AOL captures this data in Web Server logs, and sends the log data to the Hadoop cluster using the Hadoop protocol
  • Hadoop Cluster – where data is processed using basically ETL-type transforms, that’s where all the jobs run
  • Processed Analytics data – Data processed in the Hadoop cluster is sent to MySQL databases for real-time application access, as well as AOL’s Neteeza data warehouse for enterprise analytics reporting

This data flow – from Web Server to Hadoop to real-time MySQL databases, available for use by Web Servers – was happening every 15 minutes. AOL is currently redesigning the architecture to process this data in real-time (remember, this is 2008).

Holsman elaborates on the concept of Web Beacons:

If you ever look on a web page from a major web site, you’ll find that there’s various web collection “bugs” or beacon servers on their pages. So one of the ones we use is Omniture, and they give us page views. What this project was designed to do is grab this [beacon information] from Omniture and integrate it into our existing [infrastracture]. There’s also Comscore and various [advertising-related beacons] – it’s kind of scary how many beacons there are on most web pages.

Our initial goal was not to replace Omniture for page-view information, it was originally designed to collect “related site” information. We started to learn we might be able to replace Omniture with this infrastructure, but that was never our goal. There’s also advanced analytics things that Omniture does that we could start doing, but we’re not at that stage yet. That’s a big, probably multi-year, project to do that.

How AOL’s data team got started

Holsman goes on to talk about how they started. It basically started as a skunkswork project with a some spare machines lying around. Installed Hadoop on the servers. Installed a beacon on the Real Estate site, and started collecting data. It was important, says Holsman, that the data team didn’t wait for enterprise consensus from all the players in the organization. Rather, they started with a single channel – the Real Estate channel – and then began collecting data.

It took AOL 2-3 months to get the infrastructure installed and the data logs starting to come through. And then we started rolling it out to other sites across the enterprise. AOL now have all their 200+ sites feeding data into their data analytics platform.

Says Holsman:

At the time, Hadoop was pretty new to AOL. So we also used the Hadoop platform to build applications that the business people could see and derive value from. And then let the business drive [investment in the platform based on their needs].

When we looked at a web analytics platform for Bebo, it was cost-prohibitive to use Omniture. So we used our Hadoop cluster to track basic page views, unique views, and other basic information.

With the Bebo implementation, it became time to make this initiative a “real project” at AOL. At this point, AOL ran into a few problems. The most significant was that MapReduce was slow-to-write, and inflexible (note that this was before Pig was released, which greatly simplifies writing data transformations on top of the Hadoop platform). And Hadoop kept on hanging, which was a headache. The other issue the AOL team had was upgrading Hadoop – from 0.18 to 0.19, and then to 0.20. Holsman adds that most of this stuff has been fixed, and these issues are no longer a problem for AOL. But at the time they presented challenges.

People knowledgeable with working with Hadoop technologies was also a challenge. The AOL team didn’t have access to a consulting group at the time to assist them with their deployment. But this problem was also addressed through learning and training.

Yahoo! open-sources Pig

Then around 2008/2009, Yahoo! open-sourced Pig. Says Holsman:

Pig actually solved a lot of the people issues we had. It was much easier to use. Training was much simpler. And we could then basically push [development on the Hadoop platform] out to regular developers. Before Pig came out, we had a central team that [did all the application development on Hadoop]. And they were basically a bottleneck. We had 5 people that basically wrote MapReduce jobs all day … and there basically weren’t enough people to [properly service the business].

When Pig came along, we could then handoff the jobs to [other AOL developers], and they could write their own scripts and run it on our cluster. And what we then became – rather than a central processing house – was a data provider. We provided the data. We provided the machines [for the Hadoop cluster to run on]. We provided training for internal teams on how to use the stuff. And then we let them go wild.

The Result … Business Innovation

Holsman continues:

And “letting them go wild” was kind of risky, because it was like “Oh my godness, they’re going to hang the cluster …”

But what it actually did lead to was a lot of new innovations. I mean the channel developers are really smart guys in their own areas. They knew the business better than we did. We knew the data. They knew the business requirements.

So we basically opened it up, let them have access to their [data], and showed them how to use it. And they used it in ways that we never expected.

So the Hadoop Analytics platform basically became a source of innovation and product development at AOL. Here’s an example:

Again this is a bit hard to see. But the Auto Channel GUI designers are able to see a Heatmap of the page where users are clicking through, what links they are clicking on, and how good the page was. The Auto design team could then do A/B testing to see which pages produced better results. They could launch a new page, and within 15 minutes see where activity was happing on the page. And this was the #1 use case that prompted business-driven adoption of the Hadoop Analytics platform throughout the enterprise.

Analytics-driven Applications built on top of Hadoop Data

AOL has also built applications on top of Hadoop. For example, AOL developed a Shopping Recommendations site using Mahout machine learning and data mining library. Holsman elaborates:

At the time, we were looking at some [Shopping Recommendation] vendors. The Shopping site actually wanted to use an external vendor for this. We had two people at the time, and we wrote [a Shopping Recommendation Engine] internally. We used A/B testing to compare our results with 3rd-party results.

And we did better just using the algorithms that were available in Mahout, which we just downloaded. There’s no PhD’s working in the group. We understand Clustering to a certain degree. But we just downloaded the clustering algorithms, and just ran them. And they produced better results that what was available [from third-parties].

… We deployed the system on one site. Got it working. And now we can basically use the same algorithms on other sites.

AOL also built a User Recommendation capability (which had not been released at the time the talk was given) to recommend personalized news content to users leveraging the Hadoop data platform.

At the time of the talk, Holsman commented that AOL had the content side of the platform working. And that AOL was currently working to integrate Advertising and Lifestream platforms into an overall Analytics/Targeting platform.

Moving Forward …

Here are the current goals for the Data Analytics team at AOL:

  1. Get more information about our customers
  2. Build metrics into our platform
  3. Build intelligence on the page – for example: Collaborative Filtering, Product Recommendations, Top-K Type Lists
  4. Make the analytics platform closer to real-time

AOL’s Data Analytics Infrastructure today (circa 2010)

Here’s a diagram of AOL’s data layer infrastructure today:

Elements that are included in this infrastructure that are not seen in the 2008 version include:

  • Publishing Platforms
  • Advertising Web Servers – in addition to web servers that deliver content
  • Relegince – AOL’s in-house semantic content platform
  • Pig – For writing data flow/transformation scripts (not shown in this diagram)
  • 2 Cassandra databases – 1 for storing and servering real-time user information, and 1 for storing some type of clustering information
  • Redis Key-value data store – not sure what it’s place in the architecture is

In Summary

A very insightful talk! A quite fascinating glimpse into how AOL is architecting a near real-time semantic content and ad-serving platform, as well as a data analytics platform that powers the real-time semantic content platform.

glenn

Categories: Big Data Tags: , ,

Hadoop at Twitter – Dmitriy Ryaboy from April 2010

March 9, 2011 Leave a comment

Another in a series of posts on Big Data, NoSQL, and Hadoop. Previous recent related posts include:

In the video below, Dmitriy Ryaboy of Twitter provides an overview of the Hadoop stack at Twitter (from April 2010):

In a nutshell, here’s Twitter’s Big Data/Hadoop technology stack:

Also some nice discussion on the use of Pig for large-scale data analysis in Hadoop (without having to write sequences of data manipulation in MapReduce). The following slide summarize the benefits of using Pig:

And here’s a summary slide highlighting why Twitter uses Pig over SQL for expressing data query and transformation syntax:

glenn

Categories: Big Data Tags: , ,

Analyzing Big Data – Wonderful SDForum panel on NoSQL and Big Data

March 7, 2011 1 comment

A wonderful panel on Big Data and NoSQL at the SDForum from May 2010:

Fascinating to hear panel members talk about how Hadoop and NoSQL architectures have ripped apart the traditional relationship database model into distinct, open layers. Here’s some of the discussion.

Selected Transcript from the panel

Owen Thomas (Moderator): What do you all make of NoSQL? What does it really mean? What are people really saying when they say NoSQL? And how does it relate to what you’re doing with Big Data?

Amr Awadallah, Cloudera: [NoSQL] implies a message of agility. And that’s really what this is about. It’s about being agile in two dimensions. Agile on the dimension of how you store your data. Because in traditional database systems, before you can load your data in, you have to create a schema first. And the schema has to have columns, and types for these columns.

And that is good. It implies structure, it implies control, and it implies governance and a common language that you can use across your organization. It has many many benefits.

However, it also causes a loss of agility. You can’t move as fast. Because every time now you want to change that schema, you have to call up the schema architect – or the governator for that schema – and ask and plead to him to please add that new column for you. Which can take months, because there are lots of committees that have to approve that.

And then after that you have to talk to the ETL folks to load the data into that new column for you from source, which again is going to take a few more months. So it causes a big loss of agility.

With Hadoop and similar systems, it’s the other way around – in the sense that you don’t define the schema when you’re writing the data, you define the schema on the fly when you’re reading the data. So you just copy the files in. You stick any files you would like in whether they be images, videos, log files from Apache, log files from a Java application server, whatever.

And then you apply the lens of how you want to parse that file and extract the schema you would like at read time. Which of course means that read performance will be lower, because now you [have that] parsing overhead. However, it gives you the agility now because you don’t have to wait. If you’re launching a new feature somewhere, you stick it in the logs, it shows up and you start analyzing it without having to wait.

So that’s the first dimension of agility, agility of data types going beyond the relational model and the chains that can sometimes impose. The second form of agility [concerns] SQL itself, the language.

Joshua Klahr, Yahoo!: The need for the agility of the data itself is something that I deal with on a day-to-day basis … Not every page on Yahoo! looks the same. Not every piece of information that someone who’s managing the Sports property vs. a Front Page property vs. a Social Networking property – they all want to ask different questions. And if they launch a new feature, they don’t want to have to go and figure out how do I instrument this new element and have it flow nicely into a database. They want some level of flexibility.

James Phillips, Northscale: I agree with everything that’s just been said. At the end of the day, Northscale is a NoSQL database company. If you look at the Membase project, it is about providing a very efficient place to store large quantities of data without ex-ante decisions about schemas.

Joydeep Sen Sarma, Facebook: … The core issue is what James was talking about. We’ve taken the whole RDBMS stack – which had how you stored the data, how you retrieved the data, how you process it, how you perform transactions, how you replicate it, how you query [the data] – and we’ve taken this, and we’ve completely torn it apart and said every one of these layers in the system is now an open and independent layer.

So if you look at Hadoop, we start off with a file system. And the file system is just a stream of bytes. And there’s no proprietary data storage method or format. You can store data in whatever format you choose to. … And now you’ve got this kind of democratization of the data format. All kinds of people can submit data into this platform which they could not do in a classic, traditional RDBMS that had a proprietary data format that you were locked into …

And so on and so forth. And then you take the processing stack and say OK, well I’m just going to give you like a very [simple] data processing primative. So in the case of online processing, I’ll just give you an Index. And if you talk to database architects, they’ll tell you there’s an index at the heart of every database engine – a key-value paradigm is sort of fundamental to how databases are built out.

And so we’ve taken that part of the system, and sort of created that as a separate layer that can be used independent of all the other parts. Similarly, on the Analytics side we have built these primatives Map and Reduce that again are found in all the relational databases – Sort-Merge-Join is one of the oldest ways of doing Join, and then there’s Hash Join, and all this stuff. We’ve taken this out and said, “Hey, that’s a primitive that goes into the RDBMS stack. And now it’s available separately.”

And now we have all these components that were earlier sort of glued together in this thing that you couldn’t tease apart, and you were locked in and couldn’t do stuff. And now we’ve taken all these out, put them on a board and said, “Hey, rack and stack”, right? Choose what you want, what’s best for you. You’re doing text processing, take the text file, take Map Reduce, do GREB or whatever and you’re done. If you have more structured data, well then you put it in a tab-separated file. And if you’re using Hive you put a schema on top of it, and you start calling them columns, and you do SQL. And if one of those columns happens to be a text column, well the data scientist who just wants to do NLP on that column can get at it by a different means.

So to me, that’s what I’ve learned from looking at the systems, is to take existing architectures, well-understood principles, and to tease them apart and expose them as primitives, which I think is a much better way of building things.

In Summary …

Wow. OK, so now I get it. Awesome.

glenn

Categories: Big Data, NoSQL Databases Tags:

Facebook’s Architectural Stack – designing for Big Data

March 6, 2011 1 comment

This is my fourth of a series of posts exploring the topic of Big Data. The previous posts in this series are:

This post provides two videos in which Facebook’s David Recordon discusses Facebook’s architectural stack as a platform that must scale to massive amounts of data and traffic. The first video is a short video where Recordon discusses Facebook use of the LAMP stack at OSCON 2010:

On Database Technology and NoSQL Databases at Facebook

In the first video, Recordon first addresses how Facebook implements database technology generally, and the topic of NoSQL databases. Says Recordon:

The primary way that we store data – all of our user data that you’re going and accessing when we’re working on the sight, with the exception of some services like newsfeed – is actually stored in MySQL.

So we run thousands of nodes of a MySQL cluster – but we largely don’t care that MySQL is a relational database. We generally don’t use it for joins. We’re not going and running complex queries that are pulling multiple tables together inside a database using views or anything like that.

But the fundamental idea of a relational database from the ’70s hasn’t gone away. You still need those different components.

Recordon says that there are really three different layers Facebook thinks about when working with data, illustrated in the following visual:

Continues Recordon:

You have the database, which is your primary data store. We use MySQL because it’s extremely reliable. [Then you have] Memcache and our web servers.

So we’re going and getting the data from our database. We’re actually using our web server to combine the data and do joins. And this is some of where HipHop becomes so important, because our web server code is fairly CPU-intensive because we’re going and doing all these different sorts of things with data.

And then we use Memcache as our distributed secondary index.

These are all the components that you would traditionally use a relational database for:

Recordon continues:

[These are the same layers that were] talked about 30-40 years ago in terms of database technology, but they’re just happening in different places.

And so whether you’re going and using MySQL, or whether you’re using a NoSQL database, you’re not getting away from the fact that you have to go and combine data together, that you’re needing to have a way to look it up quickly, or any of those things that you would traditionally use a database for.

On the topic of NoSQL databases, Recordon says:

And then when you dig into the NoSQL technology stack, there are a number of different families of NoSQL databases which you can go and use. You have document stores, you have column family stores, you have graph databases, you have key-value pair databases.

And so the first question that you really have is what problem am I trying to solve, and what family of SQL database do I want to go and use.

And then even when you dig into one of these categories – if we just go and look at Cassandra and HBase – there are a number of differences inside of this one category of database. Cassandra and HBase make a number of different tradeoffs from a consistency perspective, from a relationship perspective. And so overall you really go and think about what problem am I trying to solve; how can I pick the best database to do that, and use it.

While we store the majority of our user data inside of SQL, we have about 150 terabytes of data inside Cassandra, which we use for Inbox search on the site. And over 36 petabytes of uncompressed data in Hadoop overall.

On the topic of Big Data

Recordon:

So that leads me into Big Data. We run a Hadoop cluster with a little over 2,200 servers, about 23,000 CPU cores inside of it. And we’ve seen the amount of data which we go and store and process growing rapidly – it’s increased about 70 times over the past 2 years. And by the end of the year, we expect to be storing over 50 petabytes of uncompressed information – which is more than all the works of mankind combined together.

And I think this is really both the combination of the increase in terms of user activity on Facebook … But also just in terms of how important data analysis has become to running large, successful websites.

The diagram below shows Facebook’s Big Data infrastructure:

Says Recordon:

So this is the infrastructure which we use. I’ll take a minute to walk through it.

With all our web servers we use an open source technology we created called Scribe to go and take the data from tens of thousands of web servers, and funnel them into HDFS and into our Hadoop warehouses. The problem that we originally ran into was too many web servers going and trying to send data to one place. And so Scribe really tries to break it out into a series of funnels collecting this data over time.

This data is pushed into our Platinum Hadoop Cluster about every 5-to-15 minutes. And then we’re also going and pulling in data from our MySQL clusters on about a daily basis. Our Platinum Hadoop Cluster is really what is vital to the business. It is the cluster where if it goes down, it directly affects the business. It’s highly maintained, it’s highly monitored. Every query that’s being run across it, a lot of thought has gone into it.

We also then go and replicate this data to a second cluster which we call the Silver Cluster – which is where people can go and run ad-hoc queries. We have about 300 to 400 people which are going running Hadoop and Hive jobs every single month, many of them outside of engineering. We’ve tried to make this sort of data analysis to help people throughout the company make better product decisions really accessible.

And so that’s one of the other technologies which we use, Apache Hive, which gives you an SQL interface on top of Hadoop to go and do data analysis. And all of these components are open source.

So when Facebook thinks about how there stack has evolved over the past few years, it looks something like this:

Where the major new component is the Hadoop technology stack and its related components to manage massive amounts of data, and do data analysis over top of that data.

A deeper look at Scaling challenges at Facebook

The second video is a presentation delivered by David Recordon and Scott MacVicar – both Facebook software engineers – at FOSDEM in February 2010 provides a deeper look into Facebook’s use of open source technology to provide a massively scalable infrastructure:

The question that I am interested in, and isn’t answered in these videos, is how Facebook implements its Open Graph data model in its infrastructure. That would be very interesting to learn. For more specifically about Facebook’s Open Graph technology, please see Facebook’s Open Graph and the Semantic Web – from Facebook F8.

Very interesting stuff.

glenn

Keen Insight on Big Data from Cloudera CEO Mike Olson

March 5, 2011 2 comments

Interesting insights on the origins of, and trends in, Big Data from Cloudera CEO Mike Olson:

For more on Big Data, see my previous post: Introduction to Hadoop – Understanding Big Data.

glenn

Categories: Big Data Tags: , ,

Introduction to Hadoop – understanding Big Data

March 5, 2011 4 comments

This is my second post on my journey of understanding Big Data. My first post looked at the design of large-scale retrieval systems at Google. This post looks at Hadoop – a framework for processing massive data sets across multiple nodes, insipired by Google’s MapReduce and GFS architectures.

There are two nice video introductions to Hadoop – both sponsored by O’Reilly Media and both featuring Tom White, author of Hadoop: The Definitive Guide. The first webcast is from July 2009 titled An Introduction to Hadoop:

The second webcast is from September 2010 titled The State of Hadoop:

For a wonderful overview of the state of Big Data, please see Making Sense of Big Data, a PWC Technology forecast from 2010.

More to come on Hadoop and Big Data in future posts.

glenn

Categories: Big Data Tags: , ,