I previously wrote about the Netflix prize: come up with a better system to recommend movies based on a large amount of data, and win $1 million. Tim Spey has a wonderful article on the dataset and the competition (though I can’t see a couple of the graphs he talks about). It is clear the dataset is pretty strange:
- Customer 2170930 has rated 1963 titles and given each and every one a rating of one (very bad). You would think they would have cancelled their subscription by now.
- Five customers have rated over 10,000 of the 17,770 titles selected – and presumably they also have rated some of the others among the 60,000 or so titles Netflix had available when they released the ratings. Are these real people?
- Customer 305344 had rated 17654 titles. Even though Netflix make it easy to rate titles that you have not rented from them (so they can get a handle on your preferences) can this be real?
- Customer 1664010 rated 5446 titles in a single day (October 12, 2005).
The main point of the entry, however, is that it is unclear that these sort of recommender systems can be useful in predicting consumer preferences. One striking point made is that a naive algorithm (predict the average value so far) is not that much worse than the Netflix system:
A simple algorithm that uses the average rating for each title as the prediction – “let’s see, the average rating for the 104,000 customers who rated Mean Girls was 3.514, so I predict you will give it a rating of 3.514” – gets an RMSE [Root Mean Squared Error] of 1.0540. Netflx’s Cinematch algorithm has an RMSE of 0.9525. Netflix set the prize target at a 10% improvement over that, which is an RMSE 0.8563. So the range that recommendation systems can realistically cover – from naively simple to cutting-edge research – seems to be [a] narrow band
To put that in perspective, here is the effect that sort of decrease in error has:
Anyway, if the errors followed a normal distribution (which they don’t, but we’re talking back-of-envelope here) then if a customer actually rated a title as 2 (poor), an algorithm with an RMSE of 1.0 would predict somewhere between 1 and 3 about 70% of the time. Not bad, but not startling. If the algorithm gave ten recommended movies, then it would get on average seven out of ten within one unit of the customer’s actual rating. Meanwhile, the RMSE=0.8563 algorithm would get 7.6 out of ten. While this is an improvement, and while it may be a remarkable technical accomplishment, it does not seem to be exactly a revolutionary leap compared to the really simple algorithms as far as customers go.
In short, would a customer even notice the difference? He concludes:
I’m no futurist, but I see little evidence from the first 300 days of the Netflix Prize that recommender systems are the magic ingredient that will reveal the wisdom of crowds.
This is an excellent blog entry that really goes to the heart of the value (or lack thereof) in these sorts of models.
To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.