It looks like the Netflix people made some good decisions when they designed their million dollar challenge. In particular, it appears that they kept two verification test sets: one that was the basis for the public standings and one that no one ever saw the results from. It is the success on the latter set that determines the winner. So BelKor, which appeared to come in second, based on the “public” verification test set, seems poised to be the winner based on the hidden test set. I put “public” in quotes, since the test set itself was not visible, but the results of each prediction on the test set was visible (as an aggregate statistic).
Why is this a good design? Any data set that gets as much of a workout as the public data set does is vulnerable to techniques that try to fit the model to that particular test set. In fact, there was discussion at the Netflix competition forum that some team or teams was doing exactly that: generating hundreds of slightly different predictions in an attempt to better fit the verification set. Such an approach, however, is counterproductive when it comes to working with a new, hidden data set. Any team that overfits the first verification test set is likely to do poorly on the second set.
So we’ll wait for official word, but it seems as though Netflix has a very nicely designed evaluation system