I previously wrote about the Netflix prize: come up with a better system to recommend movies based on a large amount of data, and win $1 million. Tim Spey has a wonderful article on the dataset and the competition (though I can’t see a couple of the graphs he talks about). It is clear the dataset is pretty strange:

- Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
- Five customers have rated over 10,000 of the 17,770 titles selected – and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
- Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
- Customer 1664010 rated 5446 titles in a single day (October 12, 2005).

The main point of the entry, however, is that it is unclear that these sort of recommender systems can be useful in predicting consumer preferences. One striking point made is that a naive algorithm (predict the average value so far) is not that much worse than the Netflix system:

A simple algorithm that uses the average rating for each title as the prediction – “let’s see, the average rating for the 104,000 customers who rated Mean Girls was 3.514, so I predict you will give it a rating of 3.514” – gets an RMSE [Root Mean Squared Error] of 1.0540. Netflx’s Cinematch algorithm has an RMSE of 0.9525. Netflix set the prize target at a 10% improvement over that, which is an RMSE 0.8563. So the range that recommendation systems can realistically cover – from naively simple to cutting-edge research – seems to be [a] narrow band

To put that in perspective, here is the effect that sort of decrease in error has:

Anyway, if the errors followed a normal distribution (which they don’t, but we’re talking back-of-envelope here) then if a customer actually rated a title as 2 (poor), an algorithm with an RMSE of 1.0 would predict somewhere between 1 and 3 about 70% of the time. Not bad, but not startling. If the algorithm gave ten recommended movies, then it would get on average seven out of ten within one unit of the customer’s actual rating. Meanwhile, the RMSE=0.8563 algorithm would get 7.6 out of ten. While this is an improvement, and while it may be a remarkable technical accomplishment, it does not seem to be exactly a revolutionary leap compared to the really simple algorithms as far as customers go.

In short, would a customer even notice the difference? He concludes:

I’m no futurist, but I see little evidence from the first 300 days of the Netflix Prize that recommender systems are the magic ingredient that will reveal the wisdom of crowds.

This is an excellent blog entry that really goes to the heart of the value (or lack thereof) in these sorts of models.

## { 1 } Comments

To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.

## { 5 } Trackbacks

[…] Michael Trick pointed me to this post about the Netflix contest. They’re offering a million dollars to anybody who can improve their algorithm to recommend movies by 10%. The part that really interested me is this (RMSE is Root Mean Square Error, kind of like a standard deviation). A perfect algorithm would predict exactly what rating every user would give to every title and would have an RMSE of zero. A random set of predictions has an RMSE of 1.95. But the actual range of action is much narrower than this 1.95 range. A simple algorithm that uses the average rating for each title as the prediction – “let’s see, the average rating for the 104,000 customers who rated Mean Girls was 3.514, so I predict you will give it a rating of 3.514″ – gets an RMSE of 1.0540. Netflx’s Cinematch algorithm has an RMSE of 0.9525. Netflix set the prize target at a 10% improvement over that, which is an RMSE 0.8563. So the range that recommendation systems can realistically cover – from naively simple to cutting-edge research – seems to be the narrow band between the middle three lines in the following diagram. […]

[…] of the fmwaves blog pointed out another contest to determine a recommender system. Unlike the Netfix contest, which might not end up with a winner, the MyStrands contest appears to guarantee $100,000, and all […]

[…] goal is to better predict customer movie ratings, and win a million dollars in doing so. I have written before about the lessons to be learned from this particular challenge (in short: while it might be a […]

[…] I have been somewhat skeptical of the Netflix Prize (in short: it seems to be showing how little information is in the data, […]

[…] of different approaches can work better than any individual approach, and also in showing how difficult it is to find economically significant models for some types of recommendation systems. I, along with many, had looked forward to the new insights (and possible visibility for […]