Hey Buddy, Can I Sell You an NP Hard Problem?

In keeping with issues of economics and computational power, there is a very neat paper out of Princeton by Arora, Barak, Brunnermeier, and Ge entitled “Computational Complexity and Information Asymmetry in Financial Products“.  Can you embed an NP-hard problem into the pricing problem for a financial instrument?  As the authors point out, the answer is clearly “yes” and the issue is really how naturally you can do it.  For instance, I could have an instrument that pays $1000 if the answer to a particular, large traveling salesman problem is less than 1 million miles, and $0 otherwise.  Pricing such an instrument involves solving an NP-complete problem, but no one would argue that this implies anything about real financial instruments.  The authors give another, similarly artificial, example:

Consider for example a derivative whose contract contains a 1000 digit integer n and has a nonzero payoff if the unemployment rate next January, when rounded to the nearest integer, is the last digit of a factor of n. A relatively unsophisticated seller can generate such a derivative together with a fairly accurate estimate of its yield (to the extent that unemployment rate is predictable), yet even Goldman Sachs would have no idea what to pay for it. This example shows both the difficulty of pricing arbitrary derivatives and the possible increase in asymmetry of information via
derivatives.

Nobody is actually interested in embedding an NP-Hard problem in a financial instrument in that way.  If the pricing is not clear, and there is obvious information asymmetry, buyers will simply assume they are getting ripped off and walk away (see adverse selection, below).

This paper does something much more interesting.  It develops a system where the firm offering financial assets seems to divide multiple assets up fairly but actually does so in a biased way.  In order to distinguish this tampering from non-tampering, an outside analyst has to solve a densest subgraph problem (an NP-hard problem).

In finance and economics, there is a issue called adverse selection.  Why am I hesitant to buy a used car?  People will only sell their lemons.  Why should we be cautious in hiring faculty who work at other schools?  The other school knows more about them, and will choose not to compete if they want to get rid of them.  Why should I be cautious when a bank chooses 10 mortgages to resell out of their portfolio of 10,000?  They know where the lemons are and will choose to dump them given the chance.

What the authors are saying is that even when a company sells all of their assets in groups in an apparently random way, it is possible to hide manipulation.  So adverse selection can occur in a way that is hard to determine.  Maybe the buyers will assume no adverse selection and we get to dump our lemons!  You can read the paper (or this blog posting from Andrew Appel) for some more information.

I have somewhat mixed feelings about this result (though I have only lived with it for a few hours:  this may take more time to settle my views).  On one hand, I really think that people in economics and finance need to be more aware of computational limits in their models.  On the other hand, it is not clear that NP-hardness is a particularly good defense here.  The authors have a somewhat breathless gloss on what NP-hardness means:

Computational complexity studies intractable problems, those that require more computational resources than can be provided by the fastest computers on earth put together.

Ummm… OK.  Sanjeev Arora is one of the world’s experts in complexity.  He is a Godel Prize Winner and a Fellow of the ACM.  He even, most tellingly, has a wikipedia entry.  I still don’t think this is the world’s best definition of intractable in the NP-hardness sense.  In particular, if a group put out 1000 groupings of financial instruments, and I needed to solve the densest subgraph problem on the resulting instance, I would work very hard at getting an integer program, constraint program, dynamic program, or other program to actually solve the instance (particularly if someone is willing to pay me millions to do so).  If the group then responded with 10,000 groupings, I would then simply declare that they are tampering and invoke whatever level of adverse selection correction you like (including just refusing to have anything to do with them).  Intractable does not mean unsolvable, and not every size instance needs more computing than “the fastest computers on earth put together”.

NP-hardness talks about worst case behavior as problem size grows.  Those of us in operations research spend a lot of time solving NP-hard problems like routing, timetabling, and many other problems because we really want solutions to instances of a particular size and with particular structure.  Bill Cook will solve practically any instance of the NP-hard Traveling Salesman Problem that you like (particularly if the financial incentives are right) as long as you keep it to no more than 100,000 cities or so.  I’d be happy to help a financial firm solve densest subgraph instances if it really mattered to them, NP-hardness be damned!

Of course, if this paper takes off, there might be real demand for operations researchers in looking at NP-Hard problems for the financial industry.  And that would be great for our field!

Thanks to my finance colleague Bryan Routledge for pointing out the paper, and for providing the excellent (IMHO) title to this entry.

P=NP in the New York Times

The New York Times has a nice article on the recent Communications of the ACM article on P=NP (article at author Lance Fortnow’s site). I had read Fortnow’s article last month, and really liked it. In fact, I am surprised I didn’t blog it: I guess I was too busy worrying about false proofs of P=NP (or not).

The Times article does a good job of summarizing the problem in a non-technical way:

The challenge, in its simplest, but not most understandable phrasing, is to determine whether or not P equals NP. P stands for the class of problems that can be solved in polynomial time, or reasonably quickly. NP stands for the class of problems that can be verified in polynomial time — quickly. If it can be shown that P=NP, then it is possible that the world will be a very different place.

The main thrust of the New York Times article is simply how popular the paper is:

The editors of the journal said they were tickled to have a hit article on their hands.

”I don’t think we’ve ever had an article that started off with this kind of a bang,” said Moshe Y. Vardi, a Rice University computer scientist who is editor in chief of the journal. ”Our e-mail blast went out and thousands of people felt they had to click.”

The author of the article, Lance Fortnow, a computer scientist at Northwestern University McCormick School of Engineering, initially told Dr. Vardi that the article would be a short one. ”Still open,” he writes, was in first reaction to writing about the state of the work on solving the problem.

But the article does point to one possible new direction for attacking the problem:

There remains a glimmer of hope, he noted. An esoteric branch of mathematics known as algebraic geometry may offer an avenue to prove or disprove the problem, which was first outlined by Stephen A. Cook, a University of Toronto computer scientist, in 1971.

That prospect feels a bit intimidating. As Dr. Vardi said, “It’s a bit scary because we have to start learning a very difficult mathematical field.”

The Times did a good job on this one. Readable and accurate: good work John Markoff! And thanks Ted for the pointer.

Note added As a reddit commentator noted, “algebraic geometry” is hardly “an esoteric branch of mathematics”, so perhaps there is room for improvement in the article.

Graph Coloring Benchmark System

For many years, I have wanted to put together a system for keeping track of results in solving instances of various combinatorial optimization problems.  This started back in the early 1990s when I was the coordinator for the DIMACS Challenge on Cliques, Coloring, and Satisfiability.  This was pre-web, so everything was done via ftp and email.  During that Challenge, David Johnson and I collected a large number of instances suitable for finding cliques and graph coloring (I also collected a number of satisfiability instances).   Lots of people tried a number of methods on these instances, and they became the “DIMACS Instances”, and have appeared in many, many papers (the resulting DIMACS volume is by far my most referenced work).

In 2002, David and I joined with Anuj Mehrotra to run some further workshops with an emphasis on graph coloring.  In 2008, a special issue of Discrete Applied Mathematics came out with ten or so papers from those workshops.

Throughout all this, I was not satisfied with the state of information.  What were the best solutions around?  Results were scattered throughout the literature, sometimes in hard-to-find places.  Worse, there seemed to be some inconsistencies (feasible solutions with values better than lower bounds) and no clear way of determining where the fault lay.  It was not enough for me to collect instances, I had to collect solutions, both values and colorings.  And I had to be much more organized about tracking results.  I was particularly interested in this since I do that for the Traveling Tournament Problem, and I think that has greatly increased interest in the problem.

But life is busy, and I never quite got around to putting the system together.  When I spent a year in New Zealand, I did some work on it, but moved on to other things.  I couldn’t get a doctoral student interested in it, so things kinda lay fallow.

A few days ago, the file I had on this churned to the top of the stack, and I refound the code I was working on.  And, while I do have some really pressing things to do (both editorial and survey article writing), I have spent some time advancing the code.   At this point, I think I can let the world see it, though it is a work in progress.  I have grand plans for this, so I have called it ROIS: Registry for Optimization Instances and Solutions. Right now, all it has the results from some graph coloring papers along with the instances.  As I move along, I will include finding cliques in graphs, the traveling tournament problem, and who knows what else.  Now that the basic code is there, it is extremely easy to add in new problems.

There is a lot more to do.  One key aspect that is not yet implemented is the ability to upload solutions and have them checked for correctness automatically.  This clearly is a key component of the system and about the only way of keeping “wrong” results out.  It is unclear how to handle lower bounds in the same way due to the lack of a succint certificate, but even handling one side is better than nothing.  I also need to add lots of things for sorting columns in different ways, but that is all pretty trivial (a good project for lazy mornings).

I’m entering in some of the published work now.  Any comments you have on this would be very much appreciated.

Three City Traveling Salesman Problem Solved by Bacteria!

A Guardian (London) article has the provocative title “Bacteria make computers look like pocket calculators” (thanks Hamish for the pointer). They report on how bacteria can be used to search through feasible solutions to the traveling salesman problem to find optimal solutions:

The research, published today in the Journal of Biological Engineering, proves that bacteria can be used to solve a puzzle known as the Hamiltonian Path Problem. Imagine you want to tour the 10 biggest cities in the UK, starting in London (number 1) and finishing in Bristol (number 10). The solution to the Hamiltonian Path Problem is the the shortest possible route you can take.

This simple problem is surprisingly difficult to solve. There are over 3.5 million possible routes to choose from, and a regular computer must try them out one at a time to find the shortest. Alternatively, a computer made from millions of bacteria can look at every route simultaneously. The biological world also has other advantages. As time goes by, a bacterial computer will actually increase in power as the bacteria reproduce.

Programming such a computer is no easy task, however. The researchers coded a simplified version of the problem, using just three cities, by modifying the DNA of Escherichia coli bacteria. The cities were represented by a combination of genes causing the bacteria to glow red or green, and the possible routes between the cities were explored by the random shuffling of DNA. Bacteria producing the correct answer glowed both colours, turning them yellow.

Where to begin? Does that article really talk about the 3-city TSP?

The belief that NP-completeness means that you have to search through every possibility fundamentally misunderstands fifty years of research in computational combinatorial optimization. The “three city” TSP was solved essentially as soon as it was defined. Even the 10 city TSP was solved long before people wanted to look through 3.5 million choices.

Even given the premise that bacteria can look through all possibilities, where does that leave us? Computationally, I really think that 300 cities is a good lower bound for exact work on the TSP (see Concorde, and the discussion given in a previous post). If someone has a bacterial technique that might work on 10 cities, how will it work on a system with 10^600 times the number of solutions? Will you have 10^600 bacteria? For those confused by orders of magnitude, the human body has about 10^14 cells (perhaps) and the universe has perhaps 10^80 atoms. I am glad that Bill Cook and his coauthors did not need to find 10^520 other universes to work with when solving 300 city TSPS, let alone the number needed for solving the 85,900 city problem they have solved (about 10^869527 possible tours).

Now, maybe the Guardian is not the best place to look for scientific detail. Going to the original paper, however, does not make one very confident:

The serial approach that most silicon computer algorithms use is not well suited for
solving NP-complete problems because the number of potential solutions that must be evaluated
grows combinatorially with the size of the problem. For example, a Hamiltonian Path Problem
for a directed graph on ten nodes may require as many as 10! = 3,628,800 directed paths to be
evaluated. A static number of computer processors would require time proportional to this
number to solve the problem. Doubling the number of nodes to 20 would increase the possible
number of directed paths to 20! = 2.43 x 10^18, increasing the computational time by 12 orders of
magnitude. Improvement in computational capability could come from parallel processing and
an increase in the number of processors working on a problem. Significant breakthroughs in this
regard may be possible with the development of biological computing, because the number of
processors grows through cell division.

OK, biologists, tell us about 10^600 parallelism. The “serial approach that most silicon computer algorithms use is not well suited for solving NP-complete problems” handles 300 cities pretty easily.

Don’t get me wrong. I think it is really cool that bacteria can solve traveling salesman problems. I think it would be equally cool if you could put together a Lego system that can solve 10 city traveling salesman problems. I have an issue, though, when the researchers believe they are really saying anything about solving NP-hard problems. Talk about relaxations and heuristic solutions, and then I am fascinated. Talk about comparisons to complete enumeration, and I have much less interest.

Contrary to the Guardian title, I think computers are still looking like computers, and biological systems have a long way to go to compete with calculators, at least as far as the TSP goes.

P=NP …

… or not.

About 20 years ago, I attended a summer program at RUTCOR (the OR center at Rutgers) called, I believe, ARIDAM (Advanced Research Institute in Discrete Applied Mathematics?  Something like that).  It was a great few weeks:  I learned a lot and met a number of people for the first time who I know to this day.  If my memory holds, it was at that conference that we heard about work by a University of Guelph professor named Ted Swart that purported to prove that P=NP.  In his paper, Swart gave a n^5 variable (or n^7) linear programming formulation for the hamiltonian path problem, along with a complicated case-by-case proof that this linear program had a solution if and only if the graph had a hamiltonian cycle.  I remember staying up two nights in a row working through this paper with people like Harvey Greenberg.  At the end, we identified a hole in the proof.  Swart changed things around a bit, and I lost track of where things went:  I recall thinking that it was pretty clear that adding a subscript could get around any hole we found, but that any constant number of subscripts would leave ever-more-complicated holes in the proof.  I left it there.

If I had been brilliant, I might have asked the question whether you could prove that this type of formulation cannot work.  I’m not that brilliant, but Mihalis Yannakakis is, and he proved shortly thereafter that any symmetric formulation for this problem requires exponential size.

Since then, it is stunning how often proofs that P=NP (or P not equal to NP) crop up.  Gerhard Woeginger does the field a tremendous service by collecting pointers to papers that try to prove the question one way or the other.   The rate of production for these sorts of results seems to be growing, with 10 papers in 2008 and 2009.  I very much like Gerhard’s obsessive neutrality in this.  It is always “Professor X proves P=NP” when I would be tempted to write “Professor X claims to prove P=NP”, or even “Professor X has typed some stuff which only he thinks proves P=NP”.

Having spent hours twenty years ago with one of these types of proofs, I don’t feel an obligation to work through more of them.  There are always people out there who do work through things, and find the flaws.   The Geomblog has a nice parody of how the process goes, and that does match up nicely with my experience 20 years ago.  And this post is not an invitation to send me your proof of P=NP or P!=NP!

Despite the “crankiness” that comes from the P(!)=NP papers, I do think good comes out of this work.  For instance, Swart’s paper led to Yannikakis’ very significant result.  And reviewing these papers (once or twice, anyway) is great training for doing referee work in our field.

Thanks to @hakmem whose tweet got me looking at these sites again.

The Perils of “Statistical Significance”

As someone who teaches data mining, which I see as part of operations research, I often talk about what sort of results are worth changing decisions over.  Statistical significance is not the same as changing decisions.  For instance, knowing that a rare event is 3 times more likely to occur under certain circumstances might be statistically significant, but is not “significant” in the broader sense if your optimal decision doesn’t change.  In fact, with a large enough sample, you can get “statistical significance” on very small differences, differences that are far too small for you to change decisions over.  “Statistically different” might be necessary (even that is problematical) but is by no means sufficient when it comes to decision making.

Finding statistically significant differences is a tricky business.  American Scientist in its July-August, 2009 issue has a devastating article by Andrew Gelman and David Weakliem regarding the research of Satoshi Kanazawa of the London School of Economics.  I highly recommend the article (available at one of the authors’ sites, along with a discussion of the article,  and I definitely recommend buying the magazine, or subscribing:  it is my favorite science magazine) as a lesson for what happens when you get the statistics wrong.  You can check out the whole article, but perhaps you can get the message from the following graph (from the American Scientist article):

gelman analysis of Kanazawa

The issue is whether attractive parents tend to have more daughters or sons.  The dots represent the data:  parents have been rated on a scale of 1 (ugly) to 5 (beautiful) and the y-axis is the fraction of their children who are girls.  There are 2972 respondents  in this data.  Based on the Gelman/Weakliem discussion, Kanazawa (in the respected journal Journal of Theoretical Biology) concluded that, yes, attractive parents have more daughters.  I have not read the Kanazawa article, but the title “Beautiful Parents Have More Daughters” doesn’t leave a lot of wiggle room (Journal of Theoretical Biology, 244: 133-140 (2007)).

Now, looking at the data suggests certain problems with that conclusion.  In particular, it seems unreasonable on its face.  With the ugliest group having 50-50 daughters/sons, it is really going to be hard to find a trend here.  But if you group 1-4 together and compare it to 5, then you can get statistical significance.  But this is statistically significant only if you ignore the possibility to group 1 versus 2-5, 1-2 versus 3-5, and 1-3 versus 4-5.  Since all of these could result in a paper with the title “Beautiful Parents Have More Daughters”, you really should include those in your test of statistical significance.  Or, better yet, you could just look at that data and say “I do not trust any test of statistical significance that shows significance in this data”.  And, I think you would be right.  The curved lines of Gelman/Weakliem in the diagram above are the results of a better test on the whole data (and suggest there is no statistically significant difference).

The American Scientist article makes a much stronger argument regarding this research.

At the recent EURO conference, I attended a talk on an aspects of sports scheduling where the author put up a graph and said, roughly, “I have not yet done a statistical test, but it doesn’t look to be a big effect”.  I (impolitely), blurted out “I wouldn’t trust any statistical test that said this was a statistically significant effect”.  And I think I would be right.

Computational Sustainability

Carla Gomes from Cornell visited here a few weeks ago.  I have known Carla for a decade or so, and she has been one of the people who I have found very useful to talk to when trying to figure out the world of constraint programming.

Institute for Computational SustainabilityCarla gets involved in lots of things.  She (along with others, of course, though she is the Lead PI) received a huge grant from NSF to form an Institute for Computational Sustainability, which I think is a wonderful term for an interesting topic.   The vision statement for this activity is quite provocative:

Computer scientists can — and should — play a key role in increasing the efficiency and effectiveness of the way we manage and allocate our natural resources, while enriching and transforming Computer Science.

Now, I consider Carla to be in Operations Research, even though she calls herself a computer scientist, and the topics that the Institute address have a strong operations research feel. One example: how can you most effectively create wildlife corridors to connect populations of endangered animals? If that is not an OR problem, I don’t know what is!

The Institute is holding its first conference in a few weeks, with registration open until May 22. The conference has an extremely broad group of speakers and I suspect most of the conference will be spent trying to understand each others terminology, skills, and interests. Looks like it will be a fun week!

The Importance of Stupidity in (Operations) Research

A colleague of mine (thanks Laurie, I think!) sent me a copy of the paper “The Importance of Stupidity in Scientific Research” by Martin Schwartz, published in the Journal of Cell Science in 2008. My colleague swears I should not take offense, and no offense was taken. I think the article is brilliant.

One of the most difficult transitions to make is to change from being a student to a researcher, a transition that practically defines the doctoral program. Most researchers were good students (at least) in their field: without success as a student, it is hard to get the enthusiasm necessary to get to the researcher transition. But many excellent students don’t make the leap to researcher, and many of the best researchers were no better than good students.

In the article, Schwartz describes facing a problem as a doctoral student. He sought help from the finest minds around him, and found that no one knew the solution to his problem.

I remember the day when
Henry Taube (who won the Nobel Prize two years later) told me
he didn’t know how to solve the problem I was having in his area.
I was a third-year graduate student and I figured that Taube knew
about 1000 times more than I did (conservative estimate). If he
didn’t have the answer, nobody did.

That’s when it hit me: nobody did. That’s why it was a research
problem. And being my research problem, it was up to me to solve.
Once I faced that fact, I solved the problem in a couple of days. (It
wasn’t really very hard; I just had to try a few things.) The crucial
lesson was that the scope of things I didn’t know wasn’t merely vast;
it was, for all practical purposes, infinite. That realization, instead of
being discouraging, was liberating. If our ignorance is infinite, the
only possible course of action is to muddle through as best we can.

In short, research happens when we are stupid, but productively so.

Productive stupidity means being ignorant by choice. Focusing
on important questions puts us in the awkward position of being
ignorant. One of the beautiful things about science is that it allows
us to bumble along, getting it wrong time after time, and feel
perfectly fine as long as we learn something each time. No doubt,
this can be difficult for students who are accustomed to getting the
answers right. No doubt, reasonable levels of confidence and
emotional resilience help, but I think scientific education might do
more to ease what is a very big transition: from learning what other
people once discovered to making your own discoveries. The more
comfortable we become with being stupid, the deeper we will wade
into the unknown and the more likely we are to make big
discoveries.

So, today, when my wife asks what I did today, I will say “I was being stupid”, and I’ll feel very good about it.

Closed Loop Supply Chains

There is a new paper on the OR Forum by Dan Guide and Luk Van Wassenhove that looks at the research trajectory of “Closed Loops Supply Chains”.  Closed loop supply chains are supply chains where there is at least as much interest in getting things from the customer to the supplier as vice versa.  Sometimes the drive for this is environmental (think European electronics laws to try to reduce metals in the refuse system) and some is economic (think of a printer manufacturer getting back used cartridges to try to cut down on the refill market, or firms that restore used items for further sale).  Luk and Dan’s paper is a nice, personal, view of the research that has gone on in the last years.

For about eight years (1997-2005), I headed up the Carnegie Bosch Institute.  Part of what we did was sponsor conferences and workshops on emerging topics in international management.  One of our success stories was early support for closed loop supply chains (or reverse logistics).  I am really pleased to see how the field has developed.

Solving real problems in Norway and Ireland

A couple of weeks ago, I was in Europe and visited two different (yet similar) research groups. The first was in Oslo, Norway, where I visited the applied math group at SINTEF. I think the best US analogy to SINTEF is the RAND Corporation, a name with cold-war connotations, but still very active in providing research and analysis on public policy issues. The group I visited uses operations research approaches to solve a dizzying array of problems, most based in Norway. Some examples of what they do include clearing transactions on the Norwegian stock exchange, nurse rostering, forestry management, transportation scheduling and much more. They also provide schedules for the Norwegian Football League, so they wanted to chat with me on my sports scheduling experiences. In addition to the technical aspects, we talked about the challenges of working with sports leagues. Frustrations with sports leagues seems to be a world-wide phenomenon.

Visiting the new 4C facilities
Visiting the new 4C facilities
I then moved on to one of my favorite academic-based organizations, the Cork Constraint Computation Center (4C) at the University of Cork. This center was started by Gene Freuder (one of the key founders of constraint programming) when Ireland made a very big investment in the field. 4C has grown to have more than 50 faculty, postdocs, researchers, and graduate students, so it is a very big operation. I sit on its scientific advisory board, which is great fun, since Gene has brought together some really interesting and accomplished people. During the visit, we saw demos by a number of the students, including work on natural gas purchasing, telecommunications network design, and water resource operation. 4C has spent its lifetime so far in a building by itself, but the University is building a new math and computer science building, so 4C will move into that (in the picture I am flanked by David Waltz of Columbia and Barry O’Sullivan of 4C). We did a tour of the building (hence the hardhats, something not normally needed in operations research), and the space looks fantastic.

I was struck by the similarities between 4C and SINTEF, even though one is an academic institute and one is a nonprofit research center. Both are heavily engaged with problems of practical interest and both worry greatly about funding. Since I teach in a business school, funding is not much of a worry for me (at least until the MBA students stop coming), but I envy the real problems these groups work on, and the validation they get through their interactions with companies. I get that in my own work with sports leagues, but I saw a dozen problems I wish I was working on during my visits.

On a personal note, I enjoyed Oslo very much (and I always enjoy Cork). The weather was terrible, and the prices very high (even though the US dollar had gone up quite a bit). But the food was great (even the Lutefisk, helped by the huge amount of bacon served with it) and the people were great to talk to. Thanks to my hosts Tomas Nordlander (SINTEF) and Gene Freuder (4C) for having me out.