Everyone Needs to Know Some Statistics Part n+1

I have previously written on how decision makers (and journalists) need to know some elementary probability and statistics to prevent them from making horrendously terrible decisions.  Coincidentally, Twitter’s @ORatWork (John Poppelaars) has provided a pointer to an excellent example of how easily organizations can get messed up on some very simple things.

As reported by the blog Bad Science, Stonewall (a UK-based lesbian, gay and bisexual charity) released a report stating that the average coming out age has been dropping.  This was deemed a good enough source to get coverage in the Guardian. Let’s check out the evidence:

The average coming out age has fallen by over 20 years in Britain, according to Stonewall’s latest online poll.

The poll, which had 1,536 respondents, found that lesbian, gay and bisexual people aged 60 and over came out at 37 on average. People aged 18 and under are coming out at 15 on average.

Oh dear!  I guess the most obvious statement is that it would be truly astounding if people aged 18 and under had come out at age 37!    Such a survey does not, and can not (based just on that question), answer any questions about the average coming out age.  There is an obvious sampling bias:  asking people who are 18 when they come out ignores gays, lesbians, and bisexuals who are 18 but come out at age 40!  This survey question is practically guaranteed to show a decrease in coming out age, whether it is truly decreasing, staying the same, or even increasing.  How both the charity and news organizations who report on this can’t see this immediately baffles me.

But people collect statistics without considering whether the data address the question they have.  They get nice reports, they pick out a few numbers, and the press release practically writes itself.  And they publish and report nonsense.  Bad Science discusses how the question “Are people coming out earlier” might be addressed.

I spent this morning discussing the MBA curriculum for the Tepper School, with an emphasis on the content and timing of our quantitative, business analytics skills.  This example goes to the top of my “If a student doesn’t get this without prodding, then the student should not get a Tepper MBA” list.

Added December 2. Best tweet I saw on this issue (from @sarumbear):

#Stonewall ‘s survey has found that on average, as people get older, they get older. http://bit.ly/gT731O #fail

Continuing to be impressed with the 28 year old me

In the wake of coming across my network and matching notes, where the 28 year old me taught the 50 year old me some things I never realized I knew, in a way that leaves (the 50 year old) me breathless (at least about the amount of time the 28 year old me had), I was delighted to see a new survey (published in the Communications of the ACM, no less) about complexity and voting systems that says some very nice things about the 28 year old me.    The article, written by Piotr Faliszewski, Edith Hemaspaandra, and Lane Hemaspaandra looks at elections and the well known issue of manipulation of elections by giving untruthful representations of one’s preferences.  The authors survey approaches that use computational complexity to protect elections:

This article is a nontechnical introduction to a startling approach to protecting elections: using computational complexity as a shield. This approach seeks to make the task of whoever is trying to affect the election computationally prohibitive. To better understand the cases in which such protection cannot be achieved, researchers in this area also spend much of their time working for the Dark Side: trying to build polynomial-time algorithms to attack election systems.

I have a particular fondness for this issue, since the 28 year old me (and his coauthors) wrote about this:

This complexity-based approach to protecting elections was pioneered in a stunning set of papers, about two decades ago, by Bartholdi, Orlin, Tovey, and Trick.  The intellectual fire they lit smoldered for quite a while, but in recent years has burst into open flame.

Wow!  Now that is a paragraph to send off to my Dean!  I have written about how this approach was ignored for a long while, but is now pretty popular.  It is great to see this coverage in an outlet like CACM.

The article goes on to survey this area, and has lots of nice things to say about those 20 year old papers (“seminal” (twice!), “striking insight”).  And it makes a great case for all the work that has been done since, and issues going forward.  Thanks Piotr, Edith, and Lane:  I enjoyed the survey very much.  And my Dad is visiting, and he too loved reading about what my coauthors and I had done.

From the 50 year old me to the 28 year old me:  “That was pretty good work!”  And, if I can speak for the 28 year old me:  “Thanks!  What are you doing now?” 50 year old me: “Hmmm….. can I tell you about sports scheduling? Blogging? Or perhaps I should introduce you to Alexander?”.  28 year old me: “Wasn’t I the one who met Ilona, with whom you have Alexander?”  50 year old me: “OK, we’ll split that one!”

Different phases, different measures.  But I am impressed with the 28 year old me every time I run across him.

Keto Diets and Linear Programming

In today’s New York Times Magazine, there is an article on how a ketogenic (“keto”) diet can be used to control epileptic seizures in kids.  The diet is quite something.  Referring to his son’s diet, the author Fred Vogelstein writes:

His breakfast eggs are mixed with heavy cream and served with bacon. A typical lunch is full-fat Greek yogurt mixed with coconut oil. Dinner is hot dogs, bacon, macadamia nuts and cheese. We figure that in an average week, Sam consumes a quart and a third of heavy cream, nearly a stick and a half of butter, 13 teaspoons of coconut oil, 20 slices of bacon and 9 eggs. Sam’s diet is just shy of 90 percent fat.

The diet contains almost no carbs, making it reminiscent of an Atkin’s diet during a particularly anti-carb phase.  This diet has the effect of greatly reducing the number of seizures in many kids, including those who have not responded to any of the possible drugs.  In Sam’s case, the seizures went from hundreds per day to perhaps a half dozen.

While a diet of bacon and ice cream seems like heaven, it is actually very hard to do.

For Sam’s diet to be effective, he must eat a certain number of calories every day with specific ratios of fat, protein and carbohydrates. These are not back-of-the-envelope calculations, but ratios that have to be hit exactly at every meal. If Sam wants a snack after school, he gets 18 grams of bacon (about two slices), 14 grams of macadamia nuts (about seven nuts) and 18 grams of apple (less than an eighth).

You can’t just throw together everything without carbs and hope that things work:

To jump through these arithmetic hoops, Evelyn, who gave up her career to take on the now full-time job of feeding Sam, plans meals on the kitchen computer using a Web-based program called KetoCalculator. It is hard to imagine how to administer keto without it. A meal for Sam might have eight ingredients. Mathematically, there are potentially millions of combinations — a bit more of this; a bit less of that — that gets you to a 400-­calorie meal and a 3-to-1 ratio.

I don’t have access to the Calculator (you need to be referred to it by a doctor), but it would be interesting to know if the site goes beyond just being a calculator.  Putting together ingredients to meet dietary requirements is a classical problem for linear programming.   Rather than just determine if a set of ingredients meets the requirements, a linear-programming approach would put the ingredients together in an optimal way to meet the requirements while optimizing some objective (like maximizing use of ingredients you have not eaten in the last week, or maximizing use of some food you have a craving for).  No hit-or-miss calculating:  the combination would be guaranteed to meet the dietary requirements.

You can check out the NEOS server to experiment with putting together a traditional diet.  I would love to experiment with a site that creates keto diets.  Imagine being able to turn an illegal, but palatable, meal into a legal meal by adding 3 macadamia nuts (and have the linear program automatically suggest that).

In the past, I have used the diet problem for laughs, but this seems to be a real application for  operations research that could be very useful (if it is not already being used).

Yet More Tag Clouds

@polybot on twitter pointed to the Tagxedo site, a site that creates displays of words, kinda like Wordle.  There are a zillion choices (and even a presentation on 101 things you can do with Tagxedo).  Here are a few I generated while Alexander did homework and I watched football on a rainy Sunday.

What is operations research?  (based on the words in this blog)

What am I tweeting about?

Who am I (from my home page, it is not obvious but the background is a picture of me)

Here are words from the biographies of my twitter followers:

I don’t know if they mean anything, but they are fun to play with!

Some Older but Perhaps Useful Notes on Networks and Matchings

More than 20 years ago, I was a postdoc at the Institut für Ökonometrie und Operations Research at the University of Bonn, an Institute headed by Prof. Bernhard Korte (who now heads Bonn’s Research Institute for Discrete Mathematics).  It was a fantastic year.  Also visiting that year were people like Bill Cunningham, Bill Cook, Andras Sebo, and many others.  I learned a lot that year, and many aspects of what I learned showed up in my future research.  I was fortunate that I had a position at Carnegie Mellon but was allowed to delay the start by a year so I did not have to worry about job hunting during that time.

I loved living in Bonn.  Though I spoke no German at the beginning and very little by the end, I had a great time there, bicycling around and exploring the various areas of the town.  It helped that I met Ilona, who would go on to be my wife, so we could share those explorations (and she could handle all the German things).

During the year, I taught a Ph.D. seminar on networks and matchings.  In the course of the seminar, I put together some notes on the topic, amounting to about 80 pages of material.  I recently came across the files for those notes, and am very impressed with the younger me:  I seemed to have a lot of time available during that period.  The notes are in an interesting format.  All the results (which are mainly proofs of correctness or algorithmic analysis) are broken into digestible pieces.  So, rather than ask “What is the complexity of this algorithm”, the result is broken down into a number of steps, leading up to the result.  I was trying to recreate the thought process that went into the proofs in the hope that would make it easier for students to follow and emulate.  It seemed to inspire at least some students:  two students (Thomas and Helga, who I would see periodically in the following years) brought together my notes and presented me with a bound copy at the end of the course, a copy I still keep proudly on my shelves.

I am not sure the value of 20 year old notes.  Some basic algorithms haven’t changed in that time of course:  Dijkstra’s shortest path, Edmonds and Karp’s max flow algorithm, and so on.  Other approaches have been supplanted in the intervening time, and certainly books like Ahuja, Magnanti, and Orlin’s Network Flows covers more material and more depth.  But having found my notes, I would not like to see them lost again, so here they are in case anyone finds them useful.  The way the notes were designed to be used, students would get a version without the boxed answers and would try to work through the problems on their own.  If I find the tex source, I’ll see if I can add a student version:  this is the instructor’s version.

Notes on Network Flows and Matchings by Michael Trick

Entry on “INFORMS TweetUp”

I posted on the INFORMS blog about the “INFORMS TweetUP”:

There are lots of ways to get out information on the INFORMS conference.  This blog is one of them, and it has been great to see the variety of views of the conference.  Of course, with more than 4300 participants, there will be lots of variety in how the conference is seen.

For an even more immediate view of responses to the conference, be sure to track the “#informs2010″ tag on Twitter.  A bunch of us who tweet and use that tag had an impromptu get together this afternoon.  Look for more tweets during the armadillo racing tonight!

Here are the twitter ids (I think! Corrections welcome): @mlesz1 (@informs2010), @dianam, @wjcook, @johnangelis, @miketrick, @polybot, @SDamask

Bus Bunching talk at INFORMS

New post at the INFORMS site: “Avoiding Bunches of Buses”:

While the INFORMS conference is big enough to stay within a narrow research area, it is also a great place to get inspiration outside your focus.  I hopped in to see a talk given by Don Eisenstein on work he has done with John Bartholdi on “scheduling” buses.  Don and I were in graduate school together, and we each had John as our advisor, but we work in completely different areas.

Don talked about a system they have put in place to handle buses, trollies, and other transportation systems that have multiple vehicles going over the same routes.  I am sure we have all had the frustration of waiting a long time for a bus, only to have three buses (all running the same route) appear in quick succession.  This phenomenon is common enough to form the title for a popular mathematics book.  Upon reflection, this situation is not mysterious.  If buses are scheduled on a route at, say, 10 minute intervals, then any delay in one bus (a slow entering customer, a traffic jam, and so on), will cause that bus to run even later.  Once delayed by two minutes, more passengers will arrive in the now-12 minute gap, further slowing down the bus.  Meanwhile, the bus following the delayed bus will go faster than normal due to a dearth of passengers for it.  Very quickly, a small increase in the gap will turn into a large gap ahead of the delayed bus and a small gap behind it.

This bunching behavior is very common, and very difficult to schedule around.  In fact, as buses try to keep to the schedule, drivers may resort to dangerous or illegal driving, particularly if drivers are evaluated by their ability to keep to a schedule.  This situation is made worse if a bus suffers a mechanical failure, leading to a large gap in the schedule until the bus can be replaced.

Overall, trying to keep a bus system running smoothly is a very difficult problem.  Lots of people have tried to create better, more robust schedules, but such systems are often complicated and difficult to implement.

John and Don propose a very simple method for handling such a system.  The idea is to have a small number of checkpoints in the system (in the example they chose, they had a checkpoint at each end of the route, with the bus going back and forth along the route).  When a bus hits a checkpoint, the system checks how far behind the next bus is.  If the next bus is expected to hit the checkpoint in 10 minutes, say, then the current bus waits a fixed fraction of that 10 minutes (say .6 times the following time, or six minutes in this case) and then departs.   There is a variant when a bus waits at least, say, 3 minutes after the preceding bus had hit the checkpoint.  That is the entire system!  Ignore schedules, but simply wait a fixed fraction of the time before the next bus arrives.

This simple system has the amazing property that it will self-organize into a nicely spread-out arrangement of buses, and will reorganize itself in the face of bus delays (or speed-ups).  If a bus goes out of operation, nothing special needs to be done:  the system will automatically move the buses into a evenly spread-out pattern (albeit longer apart, since there are fewer buses).  Of course, it also gives up on a fixed schedule, but for systems that arrive often enough, the schedule is generally not relevant to passengers (and in the example they gave, only known to the drivers, not to the passengers).

This research follows their previous work on self-organizing manufacturing systems.  The thing I like very much about this entire research direction is how well it includes robustness.  While simple, these self-organizing systems respond extremely well to changes in the environment, much better than many optimization approaches.

The presentation was very convincing, and Don and John promise a paper “in a few weeks”.  I look forward to reading it in more detail (and perhaps correcting any errors in interpretation I have of their work).

This was just one of the thousands of talks at this conference.  I was very glad I went outside of my normal research range to see this talk.

INFORMS Blog entry on “Overbooking, Revenue Management, and Data Mining”

I have an entry over on the INFORMS blog regarding overbooking of hotels.  Here it is, though I recommend following the INFORMS blog for the next few days:

Fellow blogger Guillaume Roels wrote that the hotel he reserved overbooked, so he has been exiled to a remote location and he bemoaned the lack of customer service in this transaction.  Something similar was obviously going on in my hotel, the Hilton (the main hotel for the conference).  Throughout the checkin yesterday, the desk clerks were looking for volunteers to be exiled, offering various incentives (”Free transportation! A drink vouncher! Big, big discounts, just for you!”) for people to move.  They weren’t getting any takers while I was there, so I fear the late check-ins were similarly sent off to the boondocks.

I bet the hotels got into a mess because they misestimated the number of people who showed up (or overestimated the “melt”: people who canceled in the final week or two).  If they simply took an average “no show” or “cancel in the last week” rate, I bet conference participants do so at a much lower rate.  After all, the vast majority of us have preregistered for the conference, so late cancellation means forfeiting some or all of the conference registration fee.  We have great incentives to figure out early whether we are going to be here or not.  And, perhaps people in OR or other analytic fields tend to not cancel or cancel earlier due to the organized, steel-trap-like minds we all have!  We know what we are doing, so we don’t cancel in the last week.

Of course, whether or not that is true is an empirical question, and one that can be best answered by data mining methods.  Over the course of drinks last night, a senior researcher for a large business analytics firm pointed out the disconnect we have in our field between data mining and optimization.  Often (though not always), these are seem as two phases of the “operations research process”.  Instead, there is a need for much better integration between these approaches.  Data mining should be constantly in use predicting cancellations and melt, driving the revenue management optimization approaches.

For those who were bumped by the hotels last night, you have my sympathies.  Perhaps during your rides into the conference, you can plan how to integrate data mining and revenue management better in order to let hotels avoid these issues in the future.

Time for another INFORMS Conference

It is that time of the year again:  it is the Annual INFORMS Conference, being held for the first time in Austin, Texas.  Should be a fun time, and it is certainly warmer there than it is here in Pittsburgh.

INFORMS has a conference blog going, so I’ll be posting there for a bit (with a copy over here).  I’ve also put a feed in the sidebar here that includes both the blog and and  #informs2010 tweets.

I hope to see many of you in Austin.  I’ll be at the 8AM session that Laura McLay has put together on Social Networks and Operations Research, so perhaps that is a good meetup session!