Machine Learning | End of Business as Usual

Large-scale Machine Learning and Data Mining using Hadoop – Hadoop World 2010

March 11, 2011 glennas Leave a comment

A couple interesting videos on large-scale machine learning and data mining using Hadoop from Hadoop World 2010.

1 – Large-Scale Text Analytics at AOL

The first is a presentation on Text Analysis from AOL:

Slides for the presentation can be found here.

AOL’s high-level text analytics architecture – built on top of HDFS – is shown in the visual below:

Related presentations on AOL’s use of Hadoop for Content Analytics and Ad Targeting can be seen below:

Architecting AOL’s Data Layer for Content Analytics – Ian Holsman, October 2010
An emerging data management architectural pattern behind interactive web applications – Cloudera, February 2011

The Text Analytics modules perform analysis that is then fed into important AOL applications. Two targeted advertising examples – shown below – are Location-Aware Contextual Advertising and User Aware Ad Targeting:

2 – Sentiment Analysis at GE

The second presentation is from GE on large-scale Sentiment Analysis using Hadoop:

glenn

Hyperlocal – Key Technologies

February 14, 2010 glennas 3 comments

This is the fourth in a series of posts on key dimensions of Hyperlocal. Other posts in this series are:

In this post we consider key enabling technologies that many of the hyperlocal platforms mentioned in previous posts will leverage.

Key Enabling Technologies

The initial post in this series identified the following key enabling technologies for Hyperlocal solutions:

Identity and Personalization
Social Media/Social Web
Real-time Web
Geolocation
Search
Mobile
Machine Learning
Structured Data/Semantic Web

Let’s explore each in turn.

*** Update January 5 2010 ***

It looks like ReadWriteWeb concurs with my identifiation of key enabling technologies for emerging web-based applications. See ReadWriteWeb’s Top 5 Web Trends of 2009. I think leaving out Geolocation is a fairly important omission on RWW’s part. I didn’t make reference to the Internet of Things in my list, but have referred to Web Meets World (another name for the same thing), and its impact on HyperLocal, in previous posts.
*** End of Update ***

Identity and Personalization

Identity is a key part of any online platform these days. Not only does Identity represent one’s online presence, but it’s the basis for relating to other in the context of one’s social graph.

Chris Messina has some great insights into the emergence of Identity as a platform – here’s video of his Identity is the Platform presentation from October 2009, and the slideshow accompanying his talk.

The two key players positioned to dominate the Identity Platform space are:

Facebook Connect, and
OpenID

Identity forms the foundation by which to deliver and manage personalized content for a user. I’m not going to discuss Personalization strategies in detail here, but ReadWriteWeb has an excellent piece on the topic.

Social Media and Social Web

I’m not sure too much needs to be said here. Obviously, Social Media and Social Networks, or what’s often referred to as the Social Graph, is a key feature of the Web today. If you’re going to host and service a Community on your website, you won’t get very far if you don’t design your website for the social web.

Interestingly, the Identity Platforms mentioned in the previous section – OpenID and Facebook Connect – allows you to import the Social Graph from external platforms into your Community site. Alternatively, you may also want to promote your content on other sites on the Social Web – including Twitter and Facebook.

Another important concept to be aware of in the context of the Web and HyperLocal is that of the Social Object. The Social Object is any piece of Content or information that a community might potentially socialize around. So for example, Twitter posts, news articles, photos, business listings, videos, URLs, movies … all are potential social objects that a community might share and discuss.

Social Media is any form of publishing that facilitaties social collaboration and sharing of information, content, and conversation. Social Networking sites, Blogs, Wikis, Microblogging platforms etc. all fall under this category.

The following are just a few of the more popular platforms on the social web:

It’s important on your website to enable key forms of social behavior, including sharing and bookmarking content, commenting, rating and reviewing, and so on. These are features that any social website should support, and the key community platform players, such as Jive, Pluck, and Lithium all support.

Real-time Web

With the viral adoption of Twitter, the real-time web has really taken off of late. To understand the state of the Real-time Web heading into 2010, see the following:

For an excellent overview of the real-time Web, please see RWW’s Top 5 Web Trends of 2009: The Real-Time Web, from September 2009.
For a series of fabulous videos from TechCrunch’s recent Real-time Web CrunchUp event in November 2009 in San Fran, please see Real-time Web – CrunchUp Event in November.
Any finally, here’s Mashup’s view of in the real-time Web heading into 2010: 5 Big Real-Time Web Trends of 2009

The Real-time Web can be viewed from a number of different angles. Three are:

Real-time Feeds/Sreams

This is the core of the Real-time Web – the underlying real-time feed protocol. Please see:

Rest in Peace, RSS – TechCruch, May 2009
PubSubHubbub: Real-Time Feeds and Real-Time Feedback Too? – louisgray.com, July 2009
Twitter to Open Firehose to Developers – Mashable, December 2009
You say you want a revolution – Steve Gilmore, December 2009
RSSCloud Vs. PubSubHubbub: Why The Fat Pings Win – TechCrunch, September 2009

Real-time Search

Here, see:

Twitter Search, Google launches Real-time Search – Mashable, December 2009
Real-time Search-off – TechCrunch, May 2009

Real-time Geo, or Geo-streams

Here, see:

Twitter API Adds Location Data – Tweets Get Realtime Geo – ProgrammableWeb, August 2009

For more on real-time geo and geolocation trends, see the Geolocation section that follows.

Managing the Real-time Firehose of Information

With the Real-time Web, information bursts furth as a massive stream – or firehose – of information, which is then filtered or consumed according to one’s particular social filters and interests. It can be overwhelming at first, as Nova Spivak discusses here.

Geolocation

… This post is a work-in-progress. Please return later to view the completed post.

glenn

Algorithmic Journalism – a “deep trend”

January 3, 2010 glennas Leave a comment

Thought I’d muse today about a topic I’m going to call Algorithmic Journalism. I’ve noticed a fair bit of discussion lately on the use of algorithms (typically machine-learning algorithms) to make sense of, understand the relevance of, aggregate, and distribute news.

First off, the use of machine-learning algorithms and collective intelligence to determine relevance of search and content are very common place today. They form the basis of Google’s search algorithms, and are heavily used by Amazon, Netflix, etc. However, machine-learning in Newsrooms is another matter. And it’s the discussion of machine learning in the context of the News Media business whose waves are starting to wash up against the shorelines of my personal information space (i.e. Twitter and the real-time Web!)

Here’s some of the articles/blog posts in the past few months that speak to this topic:

Content farms v. curating farmers – Jeff Jarvis, December 2009
A Speculative Post on the Idea of Algorithmic Authority – Clay Shirky, November 2009
The rise of machine-written journalism – Wired Mag, December 2009
The End Of Hand Crafted Content – TechCrunch, December 2009
Google’s vision of the future of journalism – The Guardian, October 2009
The Answer Factory: Demand Media and the Fast, Disposable, and Profitable as Hell Media Model – Wired Mag, October 2009

Note these articles were all written in the past few months. So the topic appears to be only recently breaking into the broader consciousness of the Journalism community.

I’d also point out that the evolution of Algorithimic Journalism is highly dependent on Semantic Web technologies. So look for the influence of the Semantic Web to continue to penetrate the Journalism industry.

Anyway, a topic to keep an eye on in 2010.

glenn

Categories: Future of Journalism, Machine Learning Tags: Algorithmic Journalism, Clay Shirky, Collective Intelligence, Demand Media, Future of Journalism, Google, Jeff Jarvis, Machine Learning, Michael Arrington, Semantic Web, Wired Magazine

Google Goggles – Visual Search technology from Google

December 18, 2009 glennas 1 comment

Man, those folks at Google are innovating at breakneck speed. In yet another cool application of artificial intelligence technology, check out Google’s new Google Goggles visual search applications:

Amazing.

glenn

Categories: Google, Machine Learning, Search Tags: Google, Machine Learning, Visual Search

Collective Intelligence – Part 5: Extracting Intelligence from Tags

November 17, 2009 glennas 4 comments

This is the fifth of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.

These posts will present a conceptual overview of key strategies for programming CI, and will not delve into code examples. For that, I recommend picking up Alag’s book. You won’t be disappointed!

Click on the following links to access previous posts in this series:

Introduction

So far in this series of posts, we’ve been introduced to some basic algorithms in CI, looked at various forms of user interaction, and explored how we used term vectors and similarity matrices to calcuate the similarity between users, items, and items and users. In this post, we’ll explore how to gather intelligence from tags.

Alag introduces the topic of gathering intelligence from tags as follows:

Users tagging items—adding keywords or phrases to items—is now ubiquitous on the web. This simple process of a user adding labels or tags to items, bookmarking items, sharing items, or simply viewing items provides a rich dataset that can translate into intelligence, for both the user and the items. This intelligence can be in the form of finding items related to the one tagged; connecting with other users who have similarly tagged items; or drawing the user to discover alternate tags that have been associated with an item of interest and through that finding other related items.

With that introduction, let’s begin.

Introduction to Tagging

Quoting Alag:

Tagging is the process of adding freeform text, either words or small phrases, to items. These keywords or tags can be attached to anything in your application—users, photos, articles, bookmarks, products, blog entries, podcasts, videos, and more.

[Previously] we looked at using term vectors to associate metadata with text. Each term or tag in the term vector represents a dimension. The collective set of terms or tags in your application defines the vocabulary for your application. When this same vocabulary is used to describe both the user and the items, we can compute the similarity of items with other items and the similarity of the item to the user’s metadata to find content that’s relevant to the user.

In this case, tags can be used to represent metadata. Using the context in which they appear and to whom they appear, they can serve as dynamic navigation links.

In essence, tags enable us to:

Build a metadata model (term vector) for our users and items. The common terminology between users and items enables us to compute the similarity of an item to another item or to a user.

Build dynamic navigation links in our application, for example, a tag cloud or hyperlinked phrases in the text displayed to the user.

Use metadata to personalize and connect users with other users.

Build a vocabulary for our application.

Bookmark items, which can be shared with other users.

Content-based vs. Collaborative-based Metadata

Alag emphasizes the distinction between content-based and collaborative-based sources of metadata. Quoting Alag:

In the content-based approach, metadata associated with the item is developed by analyzing the item’s content. This is represented by a term vector, a set of tags with their relative weights. Similarly, metadata can be associated with the user by aggregating the metadata of all the items visited by the user
within a window of time.

In the collaborative approach, user actions are used for deriving metadata. User tagging is an example of such an approach. Basically, the metadata associated with the item can be computed by computing the term vector from the tags—taking the relative frequency of the tags associated with the item and normalizing the counts.

When you think about metadata for a user and item using tags, think about a term vector with tags and their related weights.

Categorizing Tags based on how they are generated

We can categorize tags based on who generated them. There are three main types of tags: professionally generated, user-generated, and machine-generated.

Professionally generated Tags

Again quoting Alag:

There are a number of applications that are content rich and provide different kinds of content—articles, videos, photos, blogs—to their users. Vertical-centric medical sites, news sites, topic-focused group sites, or any site that has a professional editor generating content are examples of such sites.

In these kinds of sites, the professional editors are typically domain experts, familiar with content domain, and are usually
paid for their services. The first type of tags we cover is tags generated by such domain experts, which we call professionally generated tags.

Tags that are generated by domain experts have the following characteristics:

They bring out the concepts related to the text.

They capture the associated semantic value, using words that may not be found in the text.

They can be authored to be displayed on the user interface.

They can provide a view that isn’t centered around just the content of interest, but provides a more global overview.

They can leverage synonyms—similar words.

They can be multi-term phrases.

The set of words used can be controlled, with a controlled vocabulary.

Professionally generated tags require a lot of manpower and can be expensive, especially if a large amount of new content is being generated, perhaps by the users. These characteristics can be challenging for an automated algorithm.

User-generated Tags

Back to Alag:

It’s now common to allow users to tag items. Tags generated by the users fall into the category of user-generated tags, and the process of adding tags to items is commonly known as tagging.

Tagging enables a user to associate freeform text to an item, in a way that makes sense to him, rather than using a fixed terminology that may have been developed by the content owner or created professionally.

[For example, considering the tagging processes] at del.icio.us. Here, a user can associate any tag or keyword with a URL. The system displays a list of recommended and popular tags to guide the user.

The use of users to create tags in your application is a great example of leveraging the collective power of your users. Items that are popular will tend to be frequently tagged. From an intelligence point of view, for a user, what matters most is which items people similar to the user are tagging.

User-generated tags have the following characteristics:

They use terms that are familiar to the user.

They bring out the concepts related to the text.

They capture the associated semantic value, using words that may not be found in the text.

They can be multi-term phrases.

They provide valuable collaborative information about the user and the item.

They may include a wide variety of terms that are close in meaning.

User-generated tags will need to be stemmed to take care of plurals and filtered for obscenity. Since tags are freeform, variants of the same tag may appear. For example, collective intelligence and collectiveintelligence may appear as two tags.

[Additionally,] you may want to offer recommended tags to the user based on the dictionary of tags created in your application and the first few characters typed by the user.

Machine-generated Tags

Tags or terms generated through an automated algorithm are known as machine-generated tags. Alag provides several examples in his book of extracting tags using an automated algorithm – for example, generating tags by analyzing the textual content of a document.

Again from Alag:

An algorithm generates tags by parsing through text and detecting terms and phrases.

Machine-generated tags have the following characteristics:

They use terms that are contained in the text, with the exception of injected synonyms.

They’re usually single terms—Multi-term phrases are more difficult to extract and are usually done using a set of predefined phrases. These predefined phrases can be built using either professional or user-generated tags.

They can generate a lot of noisy tags—tags that can have multiple meanings based on the context, including polysemy and homonyms.—For example, the word gain can have a number of meanings—height gain, weight gain, stock price gain, capital gain, amplifier gain, and so on. Again, detecting multiple-term phrases, which are a
lot more specific than single terms, can help solve this problem.

In the absence of user-generated and professionally generated tags, machine-generated tags are the only alternative. This is especially true for analyzing user-generated content.

How to leverage Tags in your application

Alag leads off this section of his book with the following:

It’s useful to build metadata by analyzing the tags associated with an item and placed by a user. This metadata can then be used to find items and users of interest for the user. In addition to this, tagging can be useful to build dynamic navigation in your
application, to target search, and to build folksonomies. In this section, we briefly review these three use cases.

I’m not going to explore the specific use cases that Alag covers in his book. Again, you know where to find the details. 🙂

In Summary

Hopefully this post has given you a bit of a flavor of how Tags are used to surface collective intelligence in a social web application. In the final post in this series, I’ll be exploring extracting intelligence from textual content.

Also in this series

Categories: Collective Intelligence, Machine Learning, Social Web Tags: Collective Intelligence, Machine Learning, Satnam Alag, Social Web

Collective Intelligence – Part 4: Calculating Similarity

November 17, 2009 glennas 4 comments

This is the fourth of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.

Click on the following links to access previous posts in this series:

Determining Similarity using a Similarity Matrix

The essential task in developing collective intelligence is determining similarity between things – between users and items, between different items, and between groups of users.

In Collective Intelligence, this typically involves computing similarities in the form of a similarity matrix (or similarity table). A similarity matrix compares the values in two Term Vectors, and computes the relative similarity between comparable entries in each term vector. Please refer to this previous post, for a brief introduction to terms and term vectors.

In chapter 2 of his book, Alag calculates similarity tables using 3 basic approaches:

Cosine-based similarity
Correlation-based similarity
Adjusted-cosine-based similarity

I’m not going to get into the specific differences between the different methods, but I will provide a general example (from Alag’s book) to illustrate the approach.

User similarity in rating Photos

The example Alag gives involves 3 different users rating 3 different photos. They express their ranking of a photo as a number between 1 and 5. These ratings are displayed in the table below:

If we were to calculate a similiarity matrix (using the cosine-based approach) comparing how similar the photos are to each other, we’d get the following table:

This table tells us that Photo1 and Photo2 are very similar. The closer to 1 a value in the similarity table is, the more similar the items are to each other.

You can use the same approach to calculating similarity between users’ preferences for the photos. If we do the calculations, we get the following results:

Here we see that Jane and Doe are very similar.

In Alag’s book, he details the specific algorithm for caclulating each of the above similarity tables, and shows the different results obtained using the 3 methods listed above (i.e. cosine-based, correlation-based, and adjusted cosine-based methods). He also provides examples based on user ratings for photos, as well as user ranking of articles based on which articles they bookmarked. However, the basic approach is the same as illustrated above.

In Summary

In this post we looked at the basic task of calculating similarity between items and users. In the next post, we’ll look at the specific scenario of extracting intelligence from tags.

Also in this series

Categories: Collective Intelligence, Machine Learning, Social Web Tags: Collective Intelligence, Machine Learning, Satnam Alag, Social Web

Collective Intelligence – Part 3: Gathering Intelligence from User Interaction

November 17, 2009 glennas 4 comments

This is the third of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.

Click on the following links to access previous posts in this series:

Introduction – Applying CI in your Application

Alag states that there are three things that need to happen to apply collective intelligence in your application.

You need to:

Allow users to interact with your site and with each other, learning about each user through their interactions and contributions.

Aggregate what you learn about your users and their contributions using some useful models.

Leverage those models to recommend relevant content to your users.

This post will focus on the first of these steps: specifically the different forms of user interaction that capture the raw data used to derive collective intelligence in social web applications.

In Alag’s book, he provides persistence models for capturing this user interaction data. In this post, however, I will not be discussing the specific persistence models that model these user interactions. Please pick up a copy of Alag’s book if you are interested in the details of how the data collected from these user interactions are captured in underlying persistence models.

Gathering Intelligence from User Interaction

Quoting Alag:

To extract intelligence from a user’s interaction in your application, it isn’t enough to know what content the user looked at or visited. You need to quantify the quality of the interaction. A user may like the article or may dislike it, these being two extremes. What one need is a quantification of how the user liked the item relative to other items.

Remember, we’re trying to ascertain what kind of information is of interest to the user. The user may provide this directly by rating or voting for an article, or it may need to be derived, for example, by looking at the content the user has consumed. We can also learn about the item that the user is interacting with in the process.

In this section, we look at how users provide quantifiable information through their interactions. … Some of the interactions such as ratings and voting are explicit in the user’s intent, while other interactions such as using clicks are noisy – the intent of the user isn’t perfectly known and is implicit.

Alag discusses 6 examples of user interaction from which collective intelligence data might be extracted. These are:

Rating and Voting
E-mailing of Forwarding a Link
Bookmarking and Saving
Purchasing Items
Click-stream
Reviews

I would generalize “e-mailing and forwarding a link” to “forwarding and sharing content” generally, of which “e-mailing and forwarding a link” is variation.

This post will provide a very light treatment of some of the forms of user interaction from which collective intelligence is derived.As mentioned above, I will not be exploring the persistence models that capture the user data from these interactions.

So, first up, rating and voting.

Rating and Voting

Quoting Alag:

Asking the user to rate an item of interest is an explicit way of getting feedback o how well the user liked the item. The advantage with a user rating content is that the information provided is quantifiable and can be used directly.

Alag has a very nice section on the specific data and persistence models that underlie the rating and voting data captured from user intereaction. Please refer to his book for this additional detail.

Forwarding and Sharing Content

Forwarding and sharing is another activity that can be considered a positive vote for an item. Alag briefly discusses a variation of this activity in the form of a user e-mailing or forwarding a link

Bookmarking and Saving

A few quick comments from Alag:

Online bookmarking services such as del.icio.us allow users to store and retrieve URLs, also known as bookmarks. Users can discover interesting links that other users have bookmarked through recommendations, hot lists, and other such features. By bookmarking URLs, a user is explicitly expressing interest in the material associated with the bookmark. URLs that are commonly bookmarked bubble up higher in the site.

The process of saving an item or adding it to a list is similar to bookmarking and provides similar information.

Bookmarking and saving is another user interaction activity for which Alag explores the underlying persistence model.

Purchasing Items

In an e-commerce site, when users purchase items, they’re casting an explicit vote of confidence in the item – unless the item is returned after purchase, in which case it’s a negative vote. Recommendation engines, for example the one used by Amazon, can be built from analyzing the procurement history of users. Users that buy similar items can be correlated and items that have been bought by other users can be recommended to a user.

Click-stream

Quoting Alag:

So far we’ve looked at fairly explict was of determining whether a user liked or disliked a particular item, through ratings, voting, forwarding, and purchasing items. When a list of items is presented to a user, there’s a good chance that the user will click on one of them based on the title and description. But after quickly scanning the item, the user may find the item to be not relevant and may browse back or search for other items.

A simply way to quantify an article’s relevance is to record a positive vote for any item clicked. This approach is used by Google News to personalize the site. To furthre filter out the noise, such as items the user didn’t really like, you could look at the amount of time the user spent on the article. Of course, this isn’t fail proof. For example, the user could have left the room to get some coffee or been interrupted when looking at the article. But on average, simply looking at whether an item was visited and the time spent on it provides useful information that can be mined later.

You can also gather useful statistics from this data:

What is the average time a user spends on a particular item?

For a user, what is the average time spent on any given article?

Reviews

Web 2.0 is all about connecting people with similar people. This similarity may be based on similar tastes, positions, opinions, or geographic location. Tastes and opinions are often expressed through reviews and recommendations. These have the greatest impact on other users when:

They’re unbiased

The reviews are from similar users

They’re from a person of influence

Depending on the application, the information provided by a user may be available to the entire population of users, or may be privately available only to a select group of users.

Perhaps the biggest reasons why people review items and share their experiences are to be discovered by others and for boasting rights. Reviewers enjoy the recognition, and typically like the site and want to contribute to it. Most of them enjoy doing it. A number of applications highlight the contributions made by users, by having a Top Reviewers list. Reviews from top reviewers are also typically placed toward the top and featured more prominently. Sites may also feature one of their top reviewers on the site as an incentive to contribute.

Here again, Alag provides additional commentary around the persistence model underlying Reviews. See the book for details.

In Summary

In this post, we (very) briefly explored forms of user interaction that provide the raw data that applications use to derive collection intelligence to provide useful and relevant content to their users. In future posts in this series, we’ll explore how collective intelligence algorithms are used to aggregate this content, and provide useful insight and information to the users or a social web application.

Also in this series

Categories: Collective Intelligence, Machine Learning, Social Web Tags: Collective Intelligence, Machine Learning, Satnam Alag, Social Web

Collective Intelligence – Part 2: Basic Algorithms

November 16, 2009 glennas 5 comments

This is the second of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.

Click on the following links to access previous posts in this series:

Part 1: Introduction

Introduction

Quoting Alag (which I’ll be doing a lot of!):

In order to correlate users with content and with each other, we need a common language to compute relevance between items [or Social Objects], between users, and between users and items. Content-based relevance is achored in the content itself, as is done by information retrieval systems. Collaborative-based relevance leverages the user interaction to discern meaningful relationships. Also, since a lot of content is in the form of unstructured text, it’s helpful to understand how metadata can be developed from unstructured text. In this section, we cover these three fundamental concepts of learning algorithms.

We begin by abstracting the various types of content, so that the concepts and algorithms can be applied to all of them.

Users and Items

Quoting Alag:

As shown in [the figure below], most applications generally consist of users and items. Items may be articles, both user-generated and professionally developed: videos, photos, blog entries, questions and answers posted on message boards, or products and services sold in your application. If your application is a social-networking application, or if you’re looking to connect one user with another, then a user is also a type of item.

Alag continues:

Associated with each item is metadata, which may be in the form of professionally-developed keywords, user-generated tags, keywords extracted by an algorithm after analyzing the text, ratings, popularity ranking, or just about anything that provides a higher level of information about the item and can be used to correlate items together.

When an item is a user, in most applications there’s no content associated with a user (unless your application has a text-based descriptive profile of the user). In this case, metadata for a user will consist of profile-based data and user-action based data.

There are three main sources of developing metadata for an item: (i) attribute-based, (ii) content-based, and (iii) user-action based. Alag discusses these next.

Attribute-based

Quoting Alag:

Metadata can be generated by looking at the attributes of the user or the item. The user attribute information is typically dependent on the nature of the domain of the application. It may contain information such as age, sex, geographical location, profession, annual income, or education level. Similarily, most nonuser items have attributes associated with them. For example, a product may have a price, the name of the author or manufacturer, the geographical location where it’s available, and so on.

Content-based

Metadata can be generated by analyzing the contents of a document. As we see in the following sections, there’s been a lot of work done in the area of information retrieval and text mining to extra metadata associated with unstructured text. The title, subtitles, keywords, frequency counts of words in a document and across all documents of interest, and other data provide useful information that can then be converted into metadata for that item.

User-action based

Metadata can be generated by analyzing the interactions of users with items. User interactions provide valuable insight into preferences and interests. Some of the interactions are fairly explicit in terms of their intentions, such as purchasing and item, contributing content, rating an item, or voting. Other interactions are a lot more difficult to discern, such as a user clicking on an article and the system determining whether the user liked that item or not. This interaction can be used to build metadata about the user and the item.

Alag advises thinking about users and items having an associated vector of metadata attributes. The similarity or relevance between two users or two items or a user and item can be measured by looking at the similarity between the two vectors.

Content-based Analysis and Collaborative Filtering

Alag explains that User-centric applications aim to make the application more valuable for users by applying CI to personalize the site. There are two basic approaches to personalization: content-based and collaboration-based.

Content-based Analysis

Again, quoting Alag:

Content-based approaches analyze the content to build a representation for the content. Terms and phrases (multiple terms in a row) appearing in the document are typically used to build this representation. Terms are converted into their basic form by a process known as stemming. Terms with their associated weights, commonly known as term vectors, then represent the metadata associated with the text. Similarity between two content items is measured by measuring the similiarity associated with their term vectors.

A user’s profile can also be developed by analyzing the set of content the user interacted with. In this case, the user’s profile will have the same set of terms as the items, enabling you to compute the similarities between a user and an item. Content-based recommendation systems do a good job of finding related items, but they can’t predict the quality of the item – how popular an item is or how a user will like the items. This is where collaborative-based methods come in.

Collaborative Filtering

A collaborative-based approach aims to use the information provided by the interactions of users to predict items of interest to a user. For example, in a system where users rate items, a collaborative-based approach will find patterns in the way items have been rated by the user and other users to find additional items of interest for a user. This approach aims to match a user’s metadata to that of other similar users and recommend items liked by them. Items that are liked by or popular with a certain segment of your user population will appear often in their interaction history – viewed often, purchased often, and so forth. The frequency or occurrence of ratings provided by users are indicative of the quality of the item or the appropriate segment of your user population. Sites that user collaborative filtering include Amazon, Google, and Netflix.

Continuing:

There are two main approaches in collaborative filtering: memory-based and model-based. In memory-based systems, a similarity measure is used to find similar users and then make a prediction using a weighted average of the ratings of the similar users. This approach can have scalability issues and is sensitive to data sparseness. A model-based approach aims to build a model for prediction using a variety of approaches: linear algebra, probabilistic methods, neural networks, clustering, latent classes, and so on. They normally have fast runtime predicting abilities.

Since a lot of information that we deal with is in the form of unstructured text, Alag proceeds to review basic concepts about how intelligence is extracted from unstructured text.

Representing Intelligence from Unstructured Text

Alag begins this section as follows:

This section deals with developing a representation for unstructured text by using the content of the text. Fortunately, we can leverage a lot of work that’s been done in the area of information retrieval. This section introduces you to terms and term vectors, used to represent metadata associated with text.

Continuing:

Let’s consider an example where the text being analyzed is the phrase “Collective Intelligence in Action.”

In its most basic form, a text document consists of terms—words that appear in the text. In our example, there are four terms: Collective, Intelligence, in, and Action. When terms are joined together, they form phrases. Collective Intelligence and Collective Intelligence in Action are two useful phrases in our document.

The Vector Space Model representation is one of the most commonly used methods for representing a document. A document is represented by a term vector, which consists of terms appearing in the document and a relative weight
for each of the terms. The term vector is one representation of metadata associated with an item. The weight associated with each term is a product of two computations: term frequency and inverse document frequency.

Term frequency (TF) is a count of how often a term appears. Words that appear often may be more relevant to the topic of interest. Given a particular domain, some words
appear more often than others. For example, in a set of books about Java, the word Java will appear often. We have to be more discriminating to find items that have these lesscommon terms: Spring, Hibernate, and Intelligence. This is the motivation behind inverse document frequency (IDF). IDF aims to boost terms that are less frequent.

Commonly occurring terms such as a, the, and in don’t add much value in representing the document. These are commonly known as stop words and are removed from the term vector. Terms are also converted to lowercase. Further, words are stemmed—brought to their root form—to handle plurals. For example, toy and toys will be stemmed to toi. The position of words, for example whether they appear in the title, keywords, abstract, or the body, can also influence the relative weights of the terms used to represent the document. Further, synonyms may be used to inject terms into the representation.

To recap, here are the four steps Alag presents for analyzing text:

Tokenization – Parse the text to generate terms. Sophisticated analyzers can also extract phrases from text.
Normalize – Convert them into a normalized form such as converting text into lower case.
Eliminate stop words – Eliminate terms that appear very often.
Stemming – Convert the terms into their stemmed form to handle plurals.

Computing Similarities

Quoting Alag:

So far we’ve looked at what a term vector is and have some basic knowledge of how they’re computed. Let’s next look at how to compute similarities between them. An item that’s very similar to another item will have a high value for the computed similarity metric. An item whose term vector has a high computed similarity to that of a user’s will be very relevant to a user—chances are that if we can build a term vector to capture the likes of a user, then the user will like items that have a similar term vector.

A term vector is a vector where the direction is the magnitude of the weights for each of the terms. The term vector has multiple dimensions—thousands to possibly millions, depending on your application.

Multidimensional vectors are difficult to visualize, but the principles used can be illustrated by using a two-dimensional vector, as shown below.

Alag, again:

Given a vector representation, we normalize the vector such that its length is of size 1 and compare vectors by computing the similarity between them. Chapter 8 develops the Java classes for doing this computation. For now, just think of vectors as a
means to represent information with a well-developed math to compute similarities between them.

Types of Datasets

In this section of the book, Alag discusses the difference between densely- and sparsely-populated datasets. The difference?

A densely-populated dataset has more rows that columns, with a value for each cell. The classic example of a densely-populated dataset is a database table, where every record has an entry for every, or nearly-every field.
A sparsely-populated dataset is a dataset where each row has very few entries per column. For example, an Amazon customer may potentially be associated with any book in Amazon’s inventory. In this example, each book in Amazon’s universe would potentially be a field in the customer’s record (or vector). However, a record representing the books that a customers had viewed or bought would only contain entries for a very few of these many books. Thus, the table that associated all Amazon users with potentially all of Amazon’s books would be a “sparse” dataset.

Well, that about wraps it up for this blog post. In the next blog post in this series, we’ll look at the many forms of user interaction in an social application, and how they are converted into collective intelligence.

glenn

Also in this series

Categories: Collective Intelligence, Machine Learning, Social Web Tags: Collective Intelligence, Machine Learning, Satnam Alag, Social Web

Defining Requirements for Social Web Applications – Part 6: Collective Intelligence

November 15, 2009 glennas 6 comments

This is the 6th post in a series on Defining Requirements for Social Web Applications. As with previous posts in this series, the content is largely borrowed from Joshua Porter’s book Designing for the Social Web. Porter’s book is a gem, and if the topic of social web design is of interest to you, I highly recommend you pick up a copy.

This post also borrows significantly from Satnam Alag’s book Collective Intelligence in Action.

Click on the following links to access previous posts in this series:

Introduction

To my knowledge, the term Collective Intelligence was first coined – in the sense we mean it here – in a seminal paper published by Tim O’Reilly titled What is Web 2.0, published in September 2005. In this paper, O’Reilly states the following:

The central principle behind the success of the giants born in the Web 1.0 era who have survived to lead the Web 2.0 era appears to be this, that they have embraced the power of the web to harness collective intelligence

I rather like Joshua Porter’s comments which come close to capturing, IMO, the essence of Collective Intelligence. Porter states that Collective Intelligence is all about:

[Aggregating] the individual actions of many people in order to surface the best or most relevant content. … Collective Intelligence is based on the idea that by aggregating the behavior of many people, we can gain novel insights.

Satnam Alag in his excellent book Collective Intelligence in Action, comments that the Collective Intelligence of Users in essence is:

The intelligence that’s extracted out from the collective set of interactions and contributions made by your users.

The use of this intelligence to act as a filter for what’s valuable in your application for a user.

The common thread is “aggregated opinion”. Quoting Porter:

Digg and other aggregation systems rely on the fact that while no individual is right all the time, in the collective a large number of users can be amazingly accurate in their decisions and behavior. Amazon, Digg, Google, Netflix, and many other sites base their recommendations of products, news, sites, movies, etc. on aggregated opinion.

One result of Web 2.0-style applications that use Collective Intelligence is that, to quote Tim O’Reilly, the applications get better the more people use them.

The insights and patterns gleaned from Collective Intelligence are the product of algorithms of various degress of sophistication. Alag lists the following ways to harness Collective Intelligence in your application:

Aggregate information lists
Ratings, reviews, and recommendations
User-generated content: blogs, wikis, message boards
Tagging, bookmarking, voting, saving
Tag Cloud navigation
Analyze content to build user profiles
Clustering and predictive models
Recommendation engines
Search
Harness external content – provide relevant information from the blogosphere and external sites

Alag comments that:

Web applications that leverage Collective Intelligence develop deeper relationships with their users, provide more value to users who return more often, and ultimately offer more targeted experiences for each user according to her personal need.

Amazon, Yelp, Netflix, Google Search, Google News, Del.iciou.us, and Digg are just some of the more popular sites that leverage Collective Intelligence to target relevant content to their users.

Applying Collective Intelligence in your application

Alag states that there are three things that need to happen to apply collective intelligence in your application.

You need to:

Allow users to interact with your site and with each other, learning about each user through their interactions and contributions.

Aggregate what you learn about your users and their contributions using some useful models.

Leverage those models to recommend relevant content to your users.

Joshua Porter refers to these three steps as:

Initial Action
Display
Feedback

He provides the following table to illustrate the different forms these three steps take at various popular social websites:

Let’s see what Josha Porter has to say about these 3 steps.

Initial Action

The first step is for users to add content. Porter takes Digg as his case study.

On Digg, like on many social sites, you need an account to submit stories. Then, the process of submitting stories has two steps.

The first step is to enter the link you’re submitting. This is a normal URL. You also choose the type of content it is: a news story, image, or video. Digg helps people by providing a nice set of guidelines.

After you click “Continue” in step 1, Digg takes a moment to analyze the line to see if it’s a duplicate. This helps keep the system clean. When Digg thinks you’ve submitted duplicate content, it notifies you that the story has already been submitted.

Porter continues:

If the submission is not a duplicate, Digg analyzes the page and grabs any relevant content from it, including the page title, a description, and any images on the page. It then allows you to choose which elements are appropriate as part of your submission. This step makes it much easier to digg content, as you don’t have to do any heavy lifting of grabbing the content yourself.

Finally, Digg checks to make sure that the submitter of content is indeed a human being.

The initial action on Digg is a crucial step in the system. It determines what content is allowed, makes sure the content is unique, adds data that supports the story, and determines how can and cannot submit content. These decisions act as a barrier of entry to the system. The quality of content Digg that receives entry into the Digg system depends on the checks at this stage.

Adding Tags

Some services allow people to tag content, which allows aggregation of the content in additional, helpful ways. Porter uses the example of Del.iciou.us, which lets you add tags to bookmarks as you enter them into the system.

Aggregate Display

Quoting Porter:

The display of content is crucial to how people will interact with it. If content is displayed prominently then people will consider it more important. Content displayed less prominently will be considered less important.

In general, content is deemed more important when it is displayed:

On a home page. The home page is visited the most of any page, and therefore it garners the most attention from both site owners and readers.

More often. The more content is displayed and repeated, the more it is considered valuable.

At the top of a page. Just like on the front of a newspaper, above the fold is the prime real estate. The top of a web page is where the most important content is placed.

Higher in ranked displays. When content is ranked, such as in a “most emailed” list, the content at the top is deemed most valuable.

Porter continues:

When content first gets added to an adapative system, it is usually displayed in an appropriately less prominent location. Digg, for example, has what they call an Upcoming page, which displays all new submissions into the system in reverse-chronological order. These freshly-submitted stories stay on the upcoming page a short period of time, getting pushed off in favor of even fresher content. The Upcoming page is crucial to the functioning of the Digg site because it forces each story to gain its own popularity.

All of these stories aspire to reach the Digg home page, the ultimate place for grabbing attention, where they will be seen by thousands of people in a very short period of time. In fact, the burst of attention resulting from being on the Digg homepage often makes the site unreachable. So many people visit the site from Digg that the web server is overwhelmed and either slows to a crawl or breaks outright.

Types of Aggregation Order

Porter goes on to list some of the more popular ways that applications built for collective intelligence display content to their users to ensure that it is relevant and compelling to their audience. These are:

Chronological order
Popularity within a time range
Participant ranking
Collaborative filtering – filtering content based on your preferences and the recommendations of others
Relevance
Social

– displaying content based on who it’s from

User-based views – so the user can see their own content

Feedback

Types of Feedback

Finally, social applications that leverage Collective Intelligence are dependent on feedback to provide value. Porter highlights some different types of feedback: Implicit vs. Explicit, and Positive vs. Negative.

I’ll quote Porter’s comments of Implicit vs. Explicit feedback:

Typically, a combination of implicit and explicit feedback is used to create a picture of popularity. For example, Amazon’s bestseller list (based on implicit feedback) also show ratings (based on explicit feedback).

Implicit feedback is based on user behavior that is captured while someone moves through a site. Examples include downloading, bookmarking, and purchases.

Explicit feedback comes from someone’s explicitly-declared preferences, including ratings, reviews, and comments. While this sort of feedback tends to be more accurate in reflecting user taste, it also requires more work from the user and so less data can be collected.

Make Feedback easy

Finally, Porter has a few words to say about the importance of making feedback an easy, simple task for the user.

In Summary

Wow, that was a decent-sized post as well. So that’s a brief journey into how some of the more popular sites on the web leverage collective intelligence to keep their users engaged, and deliver interesting and relevant content.

In the next post, we’ll look at one more chapter from Porter’s book, that being devoted to application functionality designed to make it easy to share content with your friends and the world.

glenn

Also in this series

Categories: Collective Intelligence, Machine Learning, Social Design, Social Web Tags: Collective Intelligence, Joshua Porter, Machine Learning, Requirements Analysis, Social Web

Collective Intelligence – Part 1: Introduction

November 8, 2009 glennas 7 comments

This will be the first of a series of posts I plan to publish over the next few weeks discussing the topic of Collective Intelligence. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.

At a bare minimum, I’d like to discuss the following topics:

Basic algorithms for applying Collective Intelligence
Gathering Intelligence from User Interaction
Calculating Similarity
Extracting Intelligence from Tags
Extracting Intelligence from Content

However, time allowing, I’d like to address some specific CI algorithms such as Content Filtering, Collaborative Filtering, Search, and Recommendation Engines.

This initial post will serve as an introduction to the world of Collective Intelligence (or CI). Future posts will focus more on programming Collective Intelligence in social web applications. These posts will be primarily of interest to web developers, but will also provide insight into anyone interested in how collective intelligence works in web applications.

What is Collective Intelligence?

The central principle behind the success of the giants born in the Web 1.0 era who have survived to lead the Web 2.0 era appears to be this, that they have embraced the power of the web to harness collective intelligence

I rather like Joshua Porter‘s comments which come close to capturing, IMO, the essence of Collective Intelligence. In his book Designing for the Social Web, Porter states that Collective Intelligence is all about:

[Aggregating] the individual actions of many people in order to surface the best or most relevant content. … Collective Intelligence is based on the idea that by aggregating the behavior of many people, we can gain novel insights.

Satnam Alag in his excellent book Collective Intelligence in Action, comments that the Collective Intelligence of Users in essence is:

The intelligence that’s extracted out from the collective set of interactions and contributions made by your users.

The use of this intelligence to act as a filter for what’s valuable in your application for a user.

The common thread is “aggregated opinion”. Quoting Porter:

Digg and other aggregation systems rely on the fact that while no individual is right all the time, in the collective a large number of users can be amazingly accurate in their decisions and behavior. Amazon, Digg, Google, Netflix, and many other sites base their recommendations of products, news, sites, movies, etc. on aggregated opinion.

One result of Web 2.0-style applications that use Collective Intelligence is that, to quote Tim O’Reilly, the applications get better the more people use them.

What has changed about the Web to make CI so important?

Once again, I’ll defer to Satnam Alag to set the context for the increasing importance of CI:

Web applications are undergoing a revolution.

In this post-dot-com era, the web is transforming. Newer web applications trust their users, invite them to interact, connect them with others, gain early feedback from them, and then use the collected information to constantly improve the application. Web applications that take this approach develop deeper relationships with their users, provide more value to users who return more often, and ultimately offer more targeted experiences for each user according to her personal need.

Web users are undergoing a transformation.

Users are expressing themselves. This expression may be in the form of sharing their opinions on a product or a service through reviews or comments; through sharing and tagging content; through participation in an online community; or by contributing new content.
This increased user interaction and participation gives rise to data that can be converted into intelligence in your application. The use of collective intelligence to personalize a site for a user, to aid him in searching and making decisions, and to make the application more sticky are cherished goals that web applications try to fulfill.

In a nutshell, the Web has become social.

How is Collective Intelligence used in Social Web Application?

Aggregate information lists
Ratings, reviews, and recommendations
User-generated content: blogs, wikis, message boards
Tagging, bookmarking, voting, saving
Tag Cloud navigation
Analyze content to build user profiles
Clustering and predictive models
Recommendation engines
Search
Harness external content – provide relevant information from the blogosphere and external sites

Alag comments that:

Web applications that leverage Collective Intelligence develop deeper relationships with their users, provide more value to users who return more often, and ultimately offer more targeted experiences for each user according to her personal need.

Amazon, Yelp, Netflix, Google Search, Google News, Del.iciou.us, and Digg are just some of the more popular sites that leverage Collective Intelligence to target relevant content to their users.

Applying Collective Intelligence in your application

Alag states that there are three things that need to happen to apply collective intelligence in your application.

You need to:

Allow users to interact with your site and with each other, learning about each user through their interactions and contributions.

Aggregate what you learn about your users and their contributions using some useful models.

Leverage those models to recommend relevant content to your users.

Joshua Porter refers to these three steps as:

Initial Action
Display
Feedback

He provides the following table to illustrate the different forms these three steps take at various popular social websites:

Why should I care about Collective Intelligence?

Harnessing collective intelligence is critical to web-based business strategies in the Web 2.0 world. In Tim O’Reilly’s seminal paper defining the core characteristics of a Web 2.0 application, Collective Intelligence is positioned as a critical element. Dion Hinchcliffe also views Collective Intelligence as a core pillar of Web 2.0-based business strategies, as illustrated by the slide below (from a presentation he gave at Web2.0 Expo 2009):

Web-2_0-strategy-3-essentional-components

What are the key technologies underpinning Collective Intelligence?

The key technologies underpinning Collective Intelligence significantly derive from two streams of research: Information Retrieval and Machine Learning. I won’t delve into these research areas in this post, but will try to briefly explore these topics as they pertain to CI in later posts.

In Summary

Well, that should serve as enough of an apetizer to introduce the topic. Much more to come.

glenn

Also in this series

Categories: Collective Intelligence, Machine Learning Tags: Collective Intelligence, Machine Learning, Satnam Alag, Tim O'Reilly

Archive

1 – Large-Scale Text Analytics at AOL

2 – Sentiment Analysis at GE

Key Enabling Technologies

Identity and Personalization

Social Media and Social Web

Real-time Web

Real-time Feeds/Sreams

Real-time Search

Real-time Geo, or Geo-streams

Managing the Real-time Firehose of Information

Geolocation

Introduction

Introduction to Tagging

Content-based vs. Collaborative-based Metadata

Categorizing Tags based on how they are generated

Professionally generated Tags

User-generated Tags

Machine-generated Tags

How to leverage Tags in your application

Other topics

In Summary

Also in this series

Determining Similarity using a Similarity Matrix

User similarity in rating Photos

In Summary

Also in this series

Introduction – Applying CI in your Application

Gathering Intelligence from User Interaction

Rating and Voting

Forwarding and Sharing Content

Bookmarking and Saving

Purchasing Items

Click-stream

Reviews

In Summary

Also in this series

Introduction

Users and Items

Attribute-based

Content-based

User-action based

Content-based Analysis and Collaborative Filtering

Content-based Analysis

Collaborative Filtering

Representing Intelligence from Unstructured Text

Computing Similarities

Types of Datasets

Also in this series

Introduction

Applying Collective Intelligence in your application

Initial Action

Adding Tags

Aggregate Display

Types of Aggregation Order

Feedback

Types of Feedback

Make Feedback easy

In Summary

Also in this series

What is Collective Intelligence?

What has changed about the Web to make CI so important?

How is Collective Intelligence used in Social Web Application?

Applying Collective Intelligence in your application

Why should I care about Collective Intelligence?

What are the key technologies underpinning Collective Intelligence?

In Summary

Also in this series

Categories

Advertising and Marketing

Architecture

Business Strategy and Innovation

Citizen/Community Journalism

Cloud Computing

Commerce

Content Management

Content Strategy

Data Architecture & Analysis

Design

Favorite News Sources