Journalists Should Be Required to Pass an Exam on Conditional Probability

There is nothing more grating than having a journalist toss around numbers showing no understanding of conditional probability (actually, there are 12 more grating things, but this ranks right up there).  In a nice story from NBC Chicago, journalists Dick Johnson and Andrew Greiner write about an autistic teen who has picked the first two rounds of the NCAA tournament correctly:

An autistic teenager from the Chicago area has done something almost impossible.

Nearly 48 games into an upset-filled NCAA tournament, 17-year-old Alex Herrmann is perfect.

“It’s amazing,” he says. Truly.

Yes it is amazing. But the writers get tripped up when trying to project the future:

There are still four rounds remaining, so it could fall apart — the odds of a perfect wire to wire bracket is about 1 in 35,360,000 by some measures or 1 in 1,000,000,000,000 by others.

Aaargh! Let’s let pass the factor of 28,000 or so difference in estimates. THIS IS NOT THE RELEVANT STATISTIC! We already know that Alex has picked the first two rounds correctly. We are interested in the probability he has a perfect bracket given he picked the first 48 games correctly. This is about the simplest version of conditional probability you can get.

If all he did was flip a coin for each of the remaining 15 games, he would have a one in 32,768 chance of having a perfect bracket, given where he is now. Not great odds, certainly but nothing like the probabilities given in the quote. You can argue whether 50/50 on each the fifteen remaining games is the right probability to use (Purdue as champion?), but there is absolutely no justification for bringing in the overall probability of a perfect bracket.  By quoting the unconditional probability (and who knows where those estimates come from), the writers vastly underestimate Alex’s chance of having a perfect bracket.

I swear I see the confusion between unconditional probabilities and conditional probabilities twice a day. I suspect the stroke that will finally get me will be caused by this sort of error.

Edit.  9:06PM March 23. From the comments on the original post, two further points:

  1. The writers also seem confused about joint probabilities:
  2. One in 13,460,000, according to BookofOdds.com. It’s easier to win the lottery. Twice.

    No… not unless your lottery has the probability of winning of one in the square root of 13,460,000, or one in 3669. While there are “lotteries” with such odds, the payoffs tend to be 1000 to 1, not millions to 1. I bet they thought winning one lottery might be 1 in 7,000,000 so two lotteries “must be” 1 in 14,000,000. No, that is not the way it works.

  3. It appears that if you manage a pool on cbs.sportsline.com, you can edit the picks after the games. That might be a more reasonable explanation for picking 48 games, but it is hard to tell.

So, to enumerate what journalists should be tested on, lets go with:

  1. Conditional Probability
  2. Joint Probabilities
  3. Online editing possibilities

You are welcome to add to the certification test requirements in the comments.

Update on LRMC after first round

Sokol and teams’ Logistic Regression/Markov Chain approach had a pretty good first round in the NCAA tournament.  It correctly picked 24 of the 32 games.  On the plus side, it picked the following upsets (NCAA seeds in parens):

Northern Iowa (9) over UNLV (8), Georgia Tech (10) over Oklahoma State (7), Murray State (13) over Vanderbilt (4), Old Dominion (11) over Notre Dame (6), St. Mary’s (10) over Richmond (7)

It incorrectly predicted upsets

San Diego State (11) over Tennessee (6), Florida State (9) over Gonzaga (8), Utah State (12) over Purdue (4)

It missed the upsets

Ohio (14) over Georgetown (3), Missouri (10) over Clemson (7), Washington (11) over Marquette (6), Cornell (12) over Temple (5), Wake Forest (9) over Texas (8)

Overall, favorites (higher seeds) only won 22 of the 32 first round games, so in that sense LMRC is doing better than the Selection Committee’s rankings.  3 of LRMC’s “Sweet Sixteen” have been eliminated but they are still fine for the round of eight on.

March Madness and Operations Research, 2010 Edition

Normally I do a long post on operations research and predicting the NCAA tournament.  I did so in 2009, 2008, 2007 and even in 2006 (when I think I made blog entries with an IBM selectric typewriter).   This year, I will cede the ground to Laura McLay of Punk Rock Operations Research, who has a very nice series of OR related entries on the NCAA tournament (like this post and the ones just previous to it).  I’d rather read her work than write my own posts.

That said, perhaps just a comment or two on Joel Sokol (and his team)’s LRMC picks, as covered in the Atlanta Business Chronicle.  Joel’s ranking (LRMC stands for Logistic Regression Markov Chain) can be used to predict winners.  They have a page for their tournament picks.  Some notable predictions:

1) Their final 4 consists of 3 number 1’s (Kansas, Syracuse, and Duke) and one number 2 (West Virginia).  The remaining number 1 is Kentucky, ranked only number 9 overall by LRMC.

2)  Kansas beating Duke is the predicted final game.

3) 7th seeded BYU is ranked 4th overall, so makes the Elite Eight until knocked off by Syracuse (3).

4) 12th seeded Utah State is predicted to beat 5th seeded Texas A&M and 4th seeded Purdue.

5) 13th seeded Murray State over 4th seeded Vanderbilt is the biggest predicted first round upset.

Let’s see how things turn out.

Nurse, Nurse! Where’s my Nurse?

I am a sucker for competitions.  The people at PATAT (a conference series Practice and Theory of Automated Timetabling; I was co-chair for their 2004 conference, and a member of their steering committee for a few years) do a really good job at timetabling competitions.  I really liked their competition for the 2008 conference on course and exam timetabling.  I had some thoughts of using logical Benders’ to solve one of the problems but didn’t quite get my act together for it.  But I liked the problems and I liked the organization.

The next competition has just begin, though it is on an accelerated schedule.  The topic is nurse rostering and ends June 1, 2010.  This is an important practical problem:

Building good rosters for nurses in hospitals has received ample attention in recent years and is recognised as a difficult combinatorial optimisation problem with practical relevance. In hospitals much effort is spent producing solutions which are workable and of a high quality.

It is a nicely designed competition, but be aware that you can’t use commercial codes.  So those of us who use CPLEX/XPRESS/Gurobi as a linear or integer programming base are probably out of luck:

Rule 5

Participants have to implement an algorithm to tackle the problem on a single processor machine; they can use any programming language. The use of third-party software is allowed under the following restrictions:

  • it is free software
  • it’s behaviour is (reasonably) documented
  • it runs under a commonly-used operating system (Linux or Windows)

Working with COIN-OR is probably legal, under a broad definition of “reasonably documented”. But it is interesting that Mac operating systems are not deemed “commonly-used”.

Despite these caveats, I do recommend the competition, and perhaps will see you in Belfast!

Sad News on the Netflix Prize

The followup to the wildly successful Netflix Prize has been canceled due to privacy concerns.  From a blog post by Neil Hunt, Chief Product Officer for Netflix:

In the past few months, the Federal Trade Commission (FTC) asked us how a Netflix Prize sequel might affect Netflix members’ privacy, and a lawsuit was filed by KamberLaw LLC pertaining to the sequel. With both the FTC and the plaintiffs’ lawyers, we’ve had very productive discussions centered on our commitment to protecting our members’ privacy.

We have reached an understanding with the FTC and have settled the lawsuit with plaintiffs. The resolution to both matters involves certain parameters for how we use Netflix data in any future research programs.

In light of all this, we have decided to not pursue the Netflix Prize sequel that we announced on August 6, 2009.

This is very sad news.  The first Netflix Prize did a lot of good in showing how combinations of different approaches can work better than any individual approach, and also in showing how difficult it is to find economically significant models for some types of recommendation systems.  I, along with many, had looked forward to the new insights (and possible visibility for operations research) another Netflix Challenge would bring.

For now, we are left with predicting murders in Philadelphia.  Unless someone finds privacy concerns there.

Google Maps API Enhancements

Google just announced some enhancements to their Directions in Maps API.  One addition, “avoiding tolls and highways” doesn’t really affect me much:  we have only one toll road in the area, and it is pretty well needed to go either east or west.  But the other two are exciting!

First, the API now adds bicycle routing.  I don’t know if it will let you go the wrong way down one way streets or hop on a crowded sidewalk to make a shortcut, but it does contain at least some of the long-distance bike paths.  Check out the path from my house to my buddy Barack’s place.  There is a bike path starting from about 20 miles from Pittsburgh all the way to Washington, DC.  I turn 50 this year, and have promised myself that I will do that trip.  Of course, I have broken promises before, and if I don’t start training soon, it won’t happen.  But it is nice to see Google will be by my side if I do set off.

The second, and more relevant to OR, enhancement adds Traveling Salesman routing capability.  From the announcement:

Route optimization. Have many places to go but no preference as to the order you visit them in? We can now reorder the waypoints of your route to minimize the distance and time you must travel. Very useful for traveling salesman I hear.

Now I would love to experiment with this!  There have been similar efforts before.  Gebweb has offered routing since 2007, and promises optimal solutions for up to 15 cities with 2-opting solutions for larger problems.  I would love to know how far Google guarantees optimality before resorting to a heuristic (the demo on the announcement page limits you to 11 points).  Unlike the other additions, this has not yet made it to Google Maps, but when it does, let’s see how Google competes with Concorde.

Great Book on Integer Programming

I just received the book “50 Years of Integer Programming 1958-2008”  published by Springer and edited by the blue-ribbon group of  Michael Jünger, Thomas Liebling, Denis Naddef, George Nemhauser, William Pulleyblank, Gerhard Reinelt, Giovanni Rinaldi, and Laurence Wolsey (full disclosure:  I received a complimentary copy, presumably hoping I would post something entitled “Great Book on Integer Programming”.  Of course they risked me posting “Someone’s Littering My Mailbox with Junk” if I didn’t like the book. That is the risk when you give a free book to someone with a day job:  I don’t really rely on this blog to support my family so I can afford to just post what I think).

This book grew out of a 2008 workshop held in Aussois, France.  The “50 years” aspect dates the beginning of integer programming to Gomory’s famous cutting plane approach to integer programming.  Given the “50 years” aspect, it is appropriate that the first part of the book (about 340 pages) consists of reprints of some of the most important foundational papers in this field.  These papers begin with Dantzig, Fulkerson, and Johnson on solving a “large-scale” traveling salesman problem (49 cities), and continue with papers by Kuhn (Hungarian Algorithm for assignment), Hoffman and Kruskal (total unimodularity), Gomory (cutting planes), Land and Doig (Branch and Bound), Balinski ( 1965 survey), Edmonds (matroid partition), Karp (reducibility), Geoffrion (Lagrangian relaxation), and Balas (disjunctive programming).  I have read all of these papers, some as an undergraduate at Waterloo, some as a doctoral student, and some in my professional life as a professor.  Seeing them again was like seeing old friends.  The best part is the introductions to each of the papers.  We are fortunate that many of these authors are still with us and were able to provide some background on where the papers came from.  For some, the authors have passed away, but the editors have done a good job of finding people who could offer insightful perspectives.

I particularly enjoyed re-reading the paper on the Traveling Salesman Problem by Dantzig, Fulkerson and S. Johnson.  This paper formally belongs in the pre-history of integer programming, having been published in Operations Research in 1954.  In the paper, the authors show how to find a solution to a 49 city TSP.  I was struck again how much work this took to solve without the use of computers.  This paper occurred even before terminology was fixed (so constraints become restraints in this paper).  Dual variables are solved for directly by taking advantage of the structure of sparse linear programming solutions to this problem, since all the subproblems were solved by hand (over 1176 variables!).  And I loved the conclusions:

It is clear that we have left unanswered practically any question one might pose of a theoretical nature concerning the traveling-salesman problem;  however, we hope that the feasibility of attacking problems involving a moderate number of points has been successfully demonstrated, and that perhaps some of the ideas can be used in problems of a similar nature.

If only some of today’s researchers had the same level of perspective and humility!

I also loved a paragraph from Gomory’s introduction to his two papers.  After developing what would later be known as “Gomory cuts”, he decided to test it on a (1958) computer:

During the exciting weeks that followed, I finally worked out a finiteness proof and the programmed the algorithm on the E101, a pin board computer that was busy during the day but that I could use after midnight.  The E101 had only about 100 characters and the board held only 120 instructions at one time, so that I had to change boards after each simplex maximization cycle and put in a new board that generated the cut, and the put the old board back to re-maximize.  It was also hard work to get the simplex method down to 120 E101 instructions.  But the results were better and more reliable than my hand calculations, and I was able to steadily and rapidly produce solutions to four- and five- variable problems.

I thought I was hard core, reminiscing about working with punch cards, but Gomory reveals me as a slacker!

The historical section is followed by three general survey articles:  Conforti, Cornuéjols, and Zambelli on polyhedral approaches, Bill Cook on combinatorial integer programming, and Vanderbeck and Wolsey on reformulations and decomposition.  The final section are a series of surveys on modern topics:  geometry of numbers, nonlinear integer programming, MIP computation, symmetry, semidefinite relaxations, and group theoretic approaches.  The authors chosen are the leading people in each area, and the articles are extremely well done.   With these papers, this book now is the reference book for the current state of the art in integer programming.  The Amazon page for the book has the first dozen or so pages available with their “Look Inside” feature, so you can check out the full table of contents.

The book comes with two DVDs.  One contains the presentations from Section II in hour-long sessions.  The second DVD has a very entertaining discussion between moderators George Nemhauser and Bill Pulleyblank and “founders” Egon Balas, Michel Balinski, Jack Edmonds, Ralph Gomory, Art Geoffrion and Richard Karp (Alan Hoffman, Harold Kuhn, and Aisla Land were unable to attend).  I haven’t watched the whole thing yet, so if it ends in a brawl, don’t spoil it for me!  It was great to see that panel all together.

Overall, I haven’t enjoyed getting a book as much in a long time (certainly not since Bill Cook’s Traveling Salesman book, and I wouldn’t want to choose between these two).  I think I would even pay Springer’s list price of $99 for it, though I really wish they would price this more reasonably (though the price is not absurd:  they have many books that offer far less that cost far more) .  Everyone even vaguely interested in integer programming should have this on their bookshelf.

Summer Internship at Solver Foundation

I get a fair number of emails of the sort “I am studying at ….and I would like to work under your supervision as a summer intern”.  Most of the time they go on to say things like “I want to work under your supervision on stochastic optimization, supply chains, e-commerce” or an of a dozen things that suggest they 1) don’t know what I do, and 2) don’t really care to find out.  That is probably the economically efficient choice, since I don’t take on summer interns, and have even written about it.  I don’t know where this myth of summer internships for people from around the world comes from (unless there are a large group of people who welcome this?  Let me know!).

Once in a while, however, I see a summer internship and think:  “Wow!  I wish I was 20 (or 25) again!  That looks like a great opportunity!”  The Solver Foundation team just posted such an opportunity.  From the announcement:

The Solver Foundation team is looking for an intern for this summmer! A short description is below – this is a software development position, where the focus is on practical application of numerical optimization techniques. I believe in a practical, challenging, and hopefully rewarding experience where interns work on real code that relates to their areas of interest. If you’re interested, here’s what you should do:

  • Apply for the position here. Make sure you apply under the “Software and Hardware Development” category.
  • After you have applied, send a resume to natbr at microsoft dot com. We cannot guarantee a response to every inquiry due to volume.

A word of warning – this position is not for everyone. If things like simplex, L-BFGS, and KKT conditions are unfamiliar to you, this position might not be the best fit.

If you know what simplex, L-BFGS, and KKT mean, then this looks like an interesting way to spend the summer.  And they are pretty open with regards to education:

Bachelor’s, MS, or PhD students with deep knowledge of numerical optimization methods and software are particularly encouraged to apply.

This sounds a bit better than the Associate Dean gig I currently have, but I think I am stuck here.

David Johnson to Receive Knuth Prize

AT&T Labs researcher David Johnson will receive this year’s Knuth Prize from the ACM “for his contributions to theoretical and experimental analysis of algorithms”.  While Johnson is best known for his NP-Completeness book with Michael Garey, he has had an extraordinarily influential career in both the theory of approximation algorithms and in research on experimental evaluation of algorithms.  From the announcement:

The ACM Special Interest Group on Algorithms and Computation Theory (SIGACT) will present its 2009 Knuth Prize to David S. Johnson of AT&T Labs for his contributions to theoretical and experimental analysis of algorithms.  Johnson’s innovations helped lay the foundation for algorithms used to address optimization problems, which involve the process of finding the best solution from all feasible solutions.  In addition, Johnson made many contributions to the early development of NP-completeness theory, which helps identify those problems that are hard to solve efficiently. The Knuth Prize, named in honor of Donald Knuth of Stanford University who has been called the “father” of the analysis of algorithms, will be presented at the ACM Symposium on Theory of Computing (STOC) June 6-8, 2010, in Cambridge, MA, where Johnson will present the Knuth Prize Lecture.

I got to know David almost 20 years ago when I agreed to coordinate one of his DIMACS Implementation Challenges.  This cooperation resulted in a book and, a decade later, a special issue of Discrete Applied Mathematics (with Anuj Mehortra). These Challenges are one of the ways David has explored issues in the evaluation of algorithms.  I had the chance recently to give a presentation on the effect of those Challenges.  David has had an enormous influence on me as I think about how to provide a firmer underpinning to the experimental analysis of algorithms and the way individuals and a field report on results.  This Prize makes me wish I had more time to spend on completing a planned system for reporting and confirming computational results.

I guess I am only surprised that David hadn’t already won this prize!  Very well deserved.