Architecting AOL’s Data Layer for Content Analytics – Ian Holsman, October 2010
In the presentation below at Hadoop World 2010, AOL’s Ian Holsman talks about AOL implementation of a new data layer for personalizing content and increasing clickthroughs on news pages, and how this led to AOL’s involvement with Big Data technologies.
Starting with a business goal – increase relevance of content to increase clickthroughs
Holsman starts off:
What the Data Layer project was about initially was trying to make sense of the data that we have coming in at AOL. Before AOL, I worked at CNET … and they [had done] a project where they had Top Stories (or most popular stories) at the top of various pages. And that was getting 1-4% clickthroughs …
And what they then decided to do was personalize that a bit. And they [increased the clickthrough percentage] from 4% to 20% clickthrough. … So the whole aim of this project was to try to implement this at AOL. So instead of going just for most popular stories, we tried to get more related stories.
So that’s the background on how [AOL’s involvement with Big Data] started. … This started in 2008, so we’ve been doing this for a while now. And it’s morphed from what we originally started into something much bigger.
So it started with a question, “Can we do better than a Top Stories link?” So like a said before, at CNET they did studies and they [increased their clickthrough rates on news stories] from 4% to 20%. And I thought we could do something like that [at AOL]. And so I put a proposal through.
What that morphed into in the business requirements was to increase recirculation of the pages – basically trying to get users to click through more. And also to improve the revenue per page. The COO of the time basically had a chat with Yahoo!, and he asked us why has Yahoo! been able to get much more higher value-adds on their pages? And one of the reasons for that is they know more about the user.
So the three major goals of this inititiave was (i) to get better [i.e. more relevant] ads on the page, (ii) get better reader engagement [with content on the page], and (iii) enable the user to click through [on a piece of content and discover other related content on AOL].
AOL at the time had 72 major properties. Most people probably only know about 3 or 4 of them. So ideally we’d be able to push you through to other places and you’ll start using more of the AOL network.
And what we translated the mission to was a Related Page module. Initially [the scope] was site-specific … so if you were on Shopping we didn’t push you to Real Estate. But the plan was to eventually make it network-wide.
Architecting the Data Analytics Platform v.1
So, how did AOL solution this? Again from Holsman:
We wrote a custom Apache module to do third-party cookies. So the problems we had were (i) getting the data, (ii) making sure we can identify the user across sites – so we created a custom module to create a cookie which is shared across multiple domains.
We wrote a custom load processing module to push the data every 15 minutes to a Hadoop cluster. And we wrote MapReduce jobs to get the data, crunch through it, and produce reports and MySQL databases with the aggregated data so other groups can use it.
Holsman adds that one of major aims AOL had when they began collecting data was around Privacy issues. He elaborates:
[We try to make sure that people’s personal data – i.e. people’s names, addresses, e-mail addresses] (a) isn’t collected, and (b) isn’t made available to anybody – internal and external.
So we tried to keep it anonymous. We basically decided to ditch the IP numbers completely. So if you look at our data collection, we use something called WOEIDs, which is geographic location. … So the IP number was basically never sent to disk anywhere.
Most of the stuff we do with data has a Privacy guy involved. And that’s probably important … when you’re dealing with large amounts of data, you have to think about the privacy. What happens if this gets out – if an internal user starts exposing it, or we have an Oops, and we have it [made public]? Especially with this level of data, and the amount of data you’re collecting.
The following diagram presents AOL’s initial architecture circa 2008 (sorry, I know it’s a bit small and hard to read):
Key elements of the architecture – split along East and West coasts – include:
- Beacons – collect interaction data. Beacons are provided by various analytics data gatherers/tools like Omniture and Comscore.
- Web Server Logs – AOL captures this data in Web Server logs, and sends the log data to the Hadoop cluster using the Hadoop protocol
- Hadoop Cluster – where data is processed using basically ETL-type transforms, that’s where all the jobs run
- Processed Analytics data – Data processed in the Hadoop cluster is sent to MySQL databases for real-time application access, as well as AOL’s Neteeza data warehouse for enterprise analytics reporting
This data flow – from Web Server to Hadoop to real-time MySQL databases, available for use by Web Servers – was happening every 15 minutes. AOL is currently redesigning the architecture to process this data in real-time (remember, this is 2008).
Holsman elaborates on the concept of Web Beacons:
If you ever look on a web page from a major web site, you’ll find that there’s various web collection “bugs” or beacon servers on their pages. So one of the ones we use is Omniture, and they give us page views. What this project was designed to do is grab this [beacon information] from Omniture and integrate it into our existing [infrastracture]. There’s also Comscore and various [advertising-related beacons] – it’s kind of scary how many beacons there are on most web pages.
Our initial goal was not to replace Omniture for page-view information, it was originally designed to collect “related site” information. We started to learn we might be able to replace Omniture with this infrastructure, but that was never our goal. There’s also advanced analytics things that Omniture does that we could start doing, but we’re not at that stage yet. That’s a big, probably multi-year, project to do that.
How AOL’s data team got started
Holsman goes on to talk about how they started. It basically started as a skunkswork project with a some spare machines lying around. Installed Hadoop on the servers. Installed a beacon on the Real Estate site, and started collecting data. It was important, says Holsman, that the data team didn’t wait for enterprise consensus from all the players in the organization. Rather, they started with a single channel – the Real Estate channel – and then began collecting data.
It took AOL 2-3 months to get the infrastructure installed and the data logs starting to come through. And then we started rolling it out to other sites across the enterprise. AOL now have all their 200+ sites feeding data into their data analytics platform.
At the time, Hadoop was pretty new to AOL. So we also used the Hadoop platform to build applications that the business people could see and derive value from. And then let the business drive [investment in the platform based on their needs].
When we looked at a web analytics platform for Bebo, it was cost-prohibitive to use Omniture. So we used our Hadoop cluster to track basic page views, unique views, and other basic information.
With the Bebo implementation, it became time to make this initiative a “real project” at AOL. At this point, AOL ran into a few problems. The most significant was that MapReduce was slow-to-write, and inflexible (note that this was before Pig was released, which greatly simplifies writing data transformations on top of the Hadoop platform). And Hadoop kept on hanging, which was a headache. The other issue the AOL team had was upgrading Hadoop – from 0.18 to 0.19, and then to 0.20. Holsman adds that most of this stuff has been fixed, and these issues are no longer a problem for AOL. But at the time they presented challenges.
People knowledgeable with working with Hadoop technologies was also a challenge. The AOL team didn’t have access to a consulting group at the time to assist them with their deployment. But this problem was also addressed through learning and training.
Yahoo! open-sources Pig
Then around 2008/2009, Yahoo! open-sourced Pig. Says Holsman:
Pig actually solved a lot of the people issues we had. It was much easier to use. Training was much simpler. And we could then basically push [development on the Hadoop platform] out to regular developers. Before Pig came out, we had a central team that [did all the application development on Hadoop]. And they were basically a bottleneck. We had 5 people that basically wrote MapReduce jobs all day … and there basically weren’t enough people to [properly service the business].
When Pig came along, we could then handoff the jobs to [other AOL developers], and they could write their own scripts and run it on our cluster. And what we then became – rather than a central processing house – was a data provider. We provided the data. We provided the machines [for the Hadoop cluster to run on]. We provided training for internal teams on how to use the stuff. And then we let them go wild.
The Result … Business Innovation
And “letting them go wild” was kind of risky, because it was like “Oh my godness, they’re going to hang the cluster …”
But what it actually did lead to was a lot of new innovations. I mean the channel developers are really smart guys in their own areas. They knew the business better than we did. We knew the data. They knew the business requirements.
So we basically opened it up, let them have access to their [data], and showed them how to use it. And they used it in ways that we never expected.
So the Hadoop Analytics platform basically became a source of innovation and product development at AOL. Here’s an example:
Again this is a bit hard to see. But the Auto Channel GUI designers are able to see a Heatmap of the page where users are clicking through, what links they are clicking on, and how good the page was. The Auto design team could then do A/B testing to see which pages produced better results. They could launch a new page, and within 15 minutes see where activity was happing on the page. And this was the #1 use case that prompted business-driven adoption of the Hadoop Analytics platform throughout the enterprise.
Analytics-driven Applications built on top of Hadoop Data
AOL has also built applications on top of Hadoop. For example, AOL developed a Shopping Recommendations site using Mahout machine learning and data mining library. Holsman elaborates:
At the time, we were looking at some [Shopping Recommendation] vendors. The Shopping site actually wanted to use an external vendor for this. We had two people at the time, and we wrote [a Shopping Recommendation Engine] internally. We used A/B testing to compare our results with 3rd-party results.
And we did better just using the algorithms that were available in Mahout, which we just downloaded. There’s no PhD’s working in the group. We understand Clustering to a certain degree. But we just downloaded the clustering algorithms, and just ran them. And they produced better results that what was available [from third-parties].
… We deployed the system on one site. Got it working. And now we can basically use the same algorithms on other sites.
AOL also built a User Recommendation capability (which had not been released at the time the talk was given) to recommend personalized news content to users leveraging the Hadoop data platform.
At the time of the talk, Holsman commented that AOL had the content side of the platform working. And that AOL was currently working to integrate Advertising and Lifestream platforms into an overall Analytics/Targeting platform.
Moving Forward …
Here are the current goals for the Data Analytics team at AOL:
- Get more information about our customers
- Build metrics into our platform
- Build intelligence on the page – for example: Collaborative Filtering, Product Recommendations, Top-K Type Lists
- Make the analytics platform closer to real-time
AOL’s Data Analytics Infrastructure today (circa 2010)
Here’s a diagram of AOL’s data layer infrastructure today:
Elements that are included in this infrastructure that are not seen in the 2008 version include:
- Publishing Platforms
- Advertising Web Servers – in addition to web servers that deliver content
- Relegince – AOL’s in-house semantic content platform
- Pig – For writing data flow/transformation scripts (not shown in this diagram)
- 2 Cassandra databases – 1 for storing and servering real-time user information, and 1 for storing some type of clustering information
- Redis Key-value data store – not sure what it’s place in the architecture is
A very insightful talk! A quite fascinating glimpse into how AOL is architecting a near real-time semantic content and ad-serving platform, as well as a data analytics platform that powers the real-time semantic content platform.