Collective Intelligence – Part 4: Calculating Similarity
This is the fourth of a series of posts on the topic of programming Collective Intelligence in web applications. This series of posts will draw heavily from Santam Alag’s excellent book Collective Intelligence in Action.
These posts will present a conceptual overview of key strategies for programming CI, and will not delve into code examples. For that, I recommend picking up Alag’s book. You won’t be disappointed!
Click on the following links to access previous posts in this series:
Determining Similarity using a Similarity Matrix
The essential task in developing collective intelligence is determining similarity between things – between users and items, between different items, and between groups of users.
In Collective Intelligence, this typically involves computing similarities in the form of a similarity matrix (or similarity table). A similarity matrix compares the values in two Term Vectors, and computes the relative similarity between comparable entries in each term vector. Please refer to this previous post, for a brief introduction to terms and term vectors.
In chapter 2 of his book, Alag calculates similarity tables using 3 basic approaches:
- Cosine-based similarity
- Correlation-based similarity
- Adjusted-cosine-based similarity
I’m not going to get into the specific differences between the different methods, but I will provide a general example (from Alag’s book) to illustrate the approach.
User similarity in rating Photos
The example Alag gives involves 3 different users rating 3 different photos. They express their ranking of a photo as a number between 1 and 5. These ratings are displayed in the table below:
If we were to calculate a similiarity matrix (using the cosine-based approach) comparing how similar the photos are to each other, we’d get the following table:
This table tells us that Photo1 and Photo2 are very similar. The closer to 1 a value in the similarity table is, the more similar the items are to each other.
You can use the same approach to calculating similarity between users’ preferences for the photos. If we do the calculations, we get the following results:
Here we see that Jane and Doe are very similar.
In Alag’s book, he details the specific algorithm for caclulating each of the above similarity tables, and shows the different results obtained using the 3 methods listed above (i.e. cosine-based, correlation-based, and adjusted cosine-based methods). He also provides examples based on user ratings for photos, as well as user ranking of articles based on which articles they bookmarked. However, the basic approach is the same as illustrated above.
In this post we looked at the basic task of calculating similarity between items and users. In the next post, we’ll look at the specific scenario of extracting intelligence from tags.
Also in this series
- Collective Intelligence – Part 1: Introduction
- Collective Intelligence – Part 2: Basic Algorithms
- Collective Intelligence – Part 3: Gathering Intelligence from User Interaction
- Collective Intelligence – Part 5: Extracting Intelligence from Tags