Collective Intelligence – Part 2: Basic Algorithms
This is the second of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.
These posts will present a conceptual overview of key strategies for programming CI, and will not delve into code examples. For that, I recommend picking up Alag’s book. You won’t be disappointed!
Click on the following links to access previous posts in this series:
Quoting Alag (which I’ll be doing a lot of!):
In order to correlate users with content and with each other, we need a common language to compute relevance between items [or Social Objects], between users, and between users and items. Content-based relevance is achored in the content itself, as is done by information retrieval systems. Collaborative-based relevance leverages the user interaction to discern meaningful relationships. Also, since a lot of content is in the form of unstructured text, it’s helpful to understand how metadata can be developed from unstructured text. In this section, we cover these three fundamental concepts of learning algorithms.
We begin by abstracting the various types of content, so that the concepts and algorithms can be applied to all of them.
Users and Items
As shown in [the figure below], most applications generally consist of users and items. Items may be articles, both user-generated and professionally developed: videos, photos, blog entries, questions and answers posted on message boards, or products and services sold in your application. If your application is a social-networking application, or if you’re looking to connect one user with another, then a user is also a type of item.
Associated with each item is metadata, which may be in the form of professionally-developed keywords, user-generated tags, keywords extracted by an algorithm after analyzing the text, ratings, popularity ranking, or just about anything that provides a higher level of information about the item and can be used to correlate items together.
When an item is a user, in most applications there’s no content associated with a user (unless your application has a text-based descriptive profile of the user). In this case, metadata for a user will consist of profile-based data and user-action based data.
There are three main sources of developing metadata for an item: (i) attribute-based, (ii) content-based, and (iii) user-action based. Alag discusses these next.
Metadata can be generated by looking at the attributes of the user or the item. The user attribute information is typically dependent on the nature of the domain of the application. It may contain information such as age, sex, geographical location, profession, annual income, or education level. Similarily, most nonuser items have attributes associated with them. For example, a product may have a price, the name of the author or manufacturer, the geographical location where it’s available, and so on.
Metadata can be generated by analyzing the contents of a document. As we see in the following sections, there’s been a lot of work done in the area of information retrieval and text mining to extra metadata associated with unstructured text. The title, subtitles, keywords, frequency counts of words in a document and across all documents of interest, and other data provide useful information that can then be converted into metadata for that item.
Metadata can be generated by analyzing the interactions of users with items. User interactions provide valuable insight into preferences and interests. Some of the interactions are fairly explicit in terms of their intentions, such as purchasing and item, contributing content, rating an item, or voting. Other interactions are a lot more difficult to discern, such as a user clicking on an article and the system determining whether the user liked that item or not. This interaction can be used to build metadata about the user and the item.
Alag advises thinking about users and items having an associated vector of metadata attributes. The similarity or relevance between two users or two items or a user and item can be measured by looking at the similarity between the two vectors.
Content-based Analysis and Collaborative Filtering
Alag explains that User-centric applications aim to make the application more valuable for users by applying CI to personalize the site. There are two basic approaches to personalization: content-based and collaboration-based.
Again, quoting Alag:
Content-based approaches analyze the content to build a representation for the content. Terms and phrases (multiple terms in a row) appearing in the document are typically used to build this representation. Terms are converted into their basic form by a process known as stemming. Terms with their associated weights, commonly known as term vectors, then represent the metadata associated with the text. Similarity between two content items is measured by measuring the similiarity associated with their term vectors.
A user’s profile can also be developed by analyzing the set of content the user interacted with. In this case, the user’s profile will have the same set of terms as the items, enabling you to compute the similarities between a user and an item. Content-based recommendation systems do a good job of finding related items, but they can’t predict the quality of the item – how popular an item is or how a user will like the items. This is where collaborative-based methods come in.
A collaborative-based approach aims to use the information provided by the interactions of users to predict items of interest to a user. For example, in a system where users rate items, a collaborative-based approach will find patterns in the way items have been rated by the user and other users to find additional items of interest for a user. This approach aims to match a user’s metadata to that of other similar users and recommend items liked by them. Items that are liked by or popular with a certain segment of your user population will appear often in their interaction history – viewed often, purchased often, and so forth. The frequency or occurrence of ratings provided by users are indicative of the quality of the item or the appropriate segment of your user population. Sites that user collaborative filtering include Amazon, Google, and Netflix.
There are two main approaches in collaborative filtering: memory-based and model-based. In memory-based systems, a similarity measure is used to find similar users and then make a prediction using a weighted average of the ratings of the similar users. This approach can have scalability issues and is sensitive to data sparseness. A model-based approach aims to build a model for prediction using a variety of approaches: linear algebra, probabilistic methods, neural networks, clustering, latent classes, and so on. They normally have fast runtime predicting abilities.
Since a lot of information that we deal with is in the form of unstructured text, Alag proceeds to review basic concepts about how intelligence is extracted from unstructured text.
Representing Intelligence from Unstructured Text
Alag begins this section as follows:
This section deals with developing a representation for unstructured text by using the content of the text. Fortunately, we can leverage a lot of work that’s been done in the area of information retrieval. This section introduces you to terms and term vectors, used to represent metadata associated with text.
Let’s consider an example where the text being analyzed is the phrase “Collective Intelligence in Action.”
In its most basic form, a text document consists of terms—words that appear in the text. In our example, there are four terms: Collective, Intelligence, in, and Action. When terms are joined together, they form phrases. Collective Intelligence and Collective Intelligence in Action are two useful phrases in our document.
The Vector Space Model representation is one of the most commonly used methods for representing a document. A document is represented by a term vector, which consists of terms appearing in the document and a relative weight
for each of the terms. The term vector is one representation of metadata associated with an item. The weight associated with each term is a product of two computations: term frequency and inverse document frequency.
Term frequency (TF) is a count of how often a term appears. Words that appear often may be more relevant to the topic of interest. Given a particular domain, some words
appear more often than others. For example, in a set of books about Java, the word Java will appear often. We have to be more discriminating to find items that have these lesscommon terms: Spring, Hibernate, and Intelligence. This is the motivation behind inverse document frequency (IDF). IDF aims to boost terms that are less frequent.
Commonly occurring terms such as a, the, and in don’t add much value in representing the document. These are commonly known as stop words and are removed from the term vector. Terms are also converted to lowercase. Further, words are stemmed—brought to their root form—to handle plurals. For example, toy and toys will be stemmed to toi. The position of words, for example whether they appear in the title, keywords, abstract, or the body, can also influence the relative weights of the terms used to represent the document. Further, synonyms may be used to inject terms into the representation.
To recap, here are the four steps Alag presents for analyzing text:
- Tokenization – Parse the text to generate terms. Sophisticated analyzers can also extract phrases from text.
- Normalize – Convert them into a normalized form such as converting text into lower case.
- Eliminate stop words – Eliminate terms that appear very often.
- Stemming – Convert the terms into their stemmed form to handle plurals.
So far we’ve looked at what a term vector is and have some basic knowledge of how they’re computed. Let’s next look at how to compute similarities between them. An item that’s very similar to another item will have a high value for the computed similarity metric. An item whose term vector has a high computed similarity to that of a user’s will be very relevant to a user—chances are that if we can build a term vector to capture the likes of a user, then the user will like items that have a similar term vector.
A term vector is a vector where the direction is the magnitude of the weights for each of the terms. The term vector has multiple dimensions—thousands to possibly millions, depending on your application.
Multidimensional vectors are difficult to visualize, but the principles used can be illustrated by using a two-dimensional vector, as shown below.
Given a vector representation, we normalize the vector such that its length is of size 1 and compare vectors by computing the similarity between them. Chapter 8 develops the Java classes for doing this computation. For now, just think of vectors as a
means to represent information with a well-developed math to compute similarities between them.
Types of Datasets
In this section of the book, Alag discusses the difference between densely- and sparsely-populated datasets. The difference?
- A densely-populated dataset has more rows that columns, with a value for each cell. The classic example of a densely-populated dataset is a database table, where every record has an entry for every, or nearly-every field.
- A sparsely-populated dataset is a dataset where each row has very few entries per column. For example, an Amazon customer may potentially be associated with any book in Amazon’s inventory. In this example, each book in Amazon’s universe would potentially be a field in the customer’s record (or vector). However, a record representing the books that a customers had viewed or bought would only contain entries for a very few of these many books. Thus, the table that associated all Amazon users with potentially all of Amazon’s books would be a “sparse” dataset.
Well, that about wraps it up for this blog post. In the next blog post in this series, we’ll look at the many forms of user interaction in an social application, and how they are converted into collective intelligence.
Also in this series
- Collective Intelligence – Part 1: Introduction
- Collective Intelligence – Part 3: Gathering Intelligence from User Interaction
- Collective Intelligence – Part 4: Calculating Similarity
- Collective Intelligence – Part 5: Extracting Intelligence from Tags