Need for Parallelism

The operations research world is crying out for better parallel approaches to our problems. Note that for $10,000, you can buy a NVIDIA Tesla Personal Supercomputer, with speeds up to 4 teraflops (which, unless I am reading the graph wrong, easily puts it in the top 500 supercomputers at Top 500). The catch? You get the speed through hundreds of processors (processors designed for doing graphics calculations, but easily programmed for other tasks). I can see it now: “Hello ILOG? I’d like 960 licenses for parallel CPLEX! … That will be how much??” If one key part of our field is linear and integer programming, we really need to get on the bandwagon here and show how our field can take advantage of machines like this.

Of course, there is work being done, particularly by the open source wizards associated with COIN-OR, but this looks like an incredibly rich and important research area. What kind of speed up can we get with 960 processors? 500? 100? 2? Has anyone done any OR on this sort of machine?

18 thoughts on “Need for Parallelism”

  1. Hi Prof. Trick,

    Interesting post!

    I was wondering if there has been any theoretical work done on a general parallel LP
    solver. In other words, what’s the current
    bottleneck: the theory or the implementation?

    Best,
    Raghavan

  2. There is some work in parallelizing LP solvers. Some conventional wisdom in this area includes:

    – In general, the opportunities to parallelize the simplex method are somewhat limited. For wide, short problems (many variables, relatively few constraints), primal pricing or the dual ratio test are the most obvious places. For other steps, the vectors tend to be relatively short and sparse.

    It is possible that the tradeoff between sparsity (which inhibits pipelining) and vector speed needs to be re-assessed on GPGPUs[1], but I would conjecture that major gains are not in the cards.

    – For interior-point methods, the prospects are much better. The major work in each iteration is a Cholesky factorization, which offers opportunities for parallelism even in the case of sparse matrices. Because the sparsity structure of the matrices doesn’t change from iteration to iteration, it pays to do significant analysis up front.

    There is some tradeoff between minimizing fill-in during the factorization and making the factorization suitable for parallelizing. I haven’t reviewed the literature recently, so I don’t know exactly where we stand on this. But I would guess that a lot is problem-dependent, so implementations have to be adaptive. There are a number of algorithms for performing the up-front analysis. Mostly, I think they have been focused in the past more on sparsity than on parallelism and vectorization.

    Another issue with GPGPUs is that, until the latest generation, they have been available only in single precision (which is fine for graphics). In LP, we tend to rely on double precision to keep things numerically accurate. Most LP implementations would probably run into some numerical difficulties if forced to run in single precision.

    The GPGPU architecture seems to me at first glance to be even less well suited to integer programming because the parallelism is much less homogeneous. But I need to learn more about it.

    BTW, the entry level on the fall ’08 list is about 12.7 Tflop/s sustained or about 15-16Tflop/s peak. I’m guessing that the Tesla spec is theoretical peak, so you’re about a factor of four shy of making the list. Still not bad for $10K, though.

    [1] GPGPU = general-purpose graphics processing unit.

  3. The top 500 list is determined by scores on a double precision benchmark (solving a linear system of equations ala Linpack.) The GPGU systems discussed here aren’t really competitive at this, first because they are much slower at double precision arithmetic than single precision arithmetic and also because they can’t sustain anywhere near the peak flops while doing a real computation such as solving a system of equations.

    The single/double precision problem is simply a consequence of the fact that single precision is what is normally used for graphics computations in GPGPU’s so that’s all that is optimized in the implementation. Extending the implementations of these architectures so that they would perform well in double precision isn’t a hard technical problem- it’s more a question of whether there’s a reasonable market for the product.

    The problem of programming these machines to solve interesting problems while getting a good percentage of the theoritical maximum flops is a very real research challenge.

  4. @Raghavan-

    I don’t have them handy, will have to hunt some down. Might take a while. Mine (quite old now):

    M. J. Saltzman, “Implementation of an Interior Point LP Algorithm on a Shared-Memory Vector Multiprocessor,” in O. Balci, R. Sharda, and S. Zenios (eds.), Computer Science and Operations Research: New Developments in Their Interfaces, Pergamon Press, 1992, 87–101.

    M. J. Saltzman, R. Subramanian, and R. E. Marsten, “Implementing an Interior Point LP Algorithm on a
    Supercomputer,” in R. Sharda, B. L. Golden, E. Wasil, O. Balci and W. Stewart (eds.), Impacts of Recent Computer Advances on Operations Research, North Holland, 1989, 158–168.

    @Brian-

    Last year, nVidia techs at Supercomputing told me that a double-precision version of the Tesla chip was on its way. It doesn’t look like the current chips have hardware DP (or at least they don’t have a double-wide bus), though (933 GFlops Single Precision and 78 GFlops Double Precision).

    Even if the chip was competitive at LINPACK in single-precision, LP developers probably wouldn’t take advantage of it.

  5. Some real-life numbers for large scale discrete optimization for my old job in the airline industry

    Problem Type
    —————–
    Real-life numbers from parallel airline crew and aircraft scheduling algorithms for Set-Partitioning LPs in the early 2000s (sparse problems, ~million columns, ~100K rows, lots of degeneracy)

    Hardware
    ———-
    essentially large SMP machines, and they are probably using something different now.

    Implementation
    ——————-
    POSIX threads. The simplest implementation worked the best.

    Solvers (compared with serial performance)
    ———
    CPLEX Barrier: about 2-6X (average 4X). better than simplex for sure.
    Deflected subgradient methods (inhouse): about 2-5X (avg 3X). this does most of the grunt work.
    CPLEX-MIP: more the merrier. This can result in non-determinism from the users point of view which they detest (same data, rerun with slightly different results). thats a necessary evil of parallelism sometimes unless u hit the global optimum each time.

    Summary
    ———–
    As the machines got better with faster memory, i noticed that these numbers came down, indicating that probably a new method that divides up the work at a higher-level is required to justify the purchase of the latest and greatest parallel hardware.

    Decomposition methods where the subproblems can be independently solved become more attractive when compared to solving a single, large and tight LP relaxation (or maybe just do both)

  6. Interesting topic. I’m particularly encouraged by the open source parallel computing framework, it can be the next Linux for all network computing, this can be big.

    Btw, I checked the NVIDIA Telsa, it says “…and powered by up to 960 parallel processing cores”, looks like you can’t get it with $10k, it’s just the sales language there, unless I read something wrong…

  7. Nope, it looks like a system with four GPUs with 240 serial processors each for $10K from at least one reseller.

    I can’t find information about the organization of the connecting fabric in the Tesla (probably just haven’t looked enough), but a common organization for GPGPUs is to group the cores into banks, each of which can be pipelined. Once again, that’s great for graphics, but it is not the most flexible for sparse linear algebra or tree search.

    I did find that the Tesla has 64-bit ALUs, so there is double-precision arithmetic in hardware. Not sure yet what accounts for the order of magnitude difference in throughput between single and double precision.

  8. Shiva’s experience is fairly representative. Barrier gives good speed-ups for low numbers of processors.

    Parallel MIP is particularly interesting: The ability to parallelize search into non-contiguous areas of the B&B tree may not only turf up more diverse solutions (useful for certain heuristics), it may also avoid fruitless search by identifying better integer feasibles quicker. The net gains can be hyper-linear. In that sense, exploiting parallelism makes a lot more sense for MIP than for LP.

    Conversely, near-infeasible problems may perform essentially the same irrespective of the number of processors.

  9. I have been always fascinated by GPUs’ capabilities in linear algebra. Alas no one in our department works on computational optimization.

  10. There is also a lot of interesting work on parallel metaheuristics for combinatorial problems, see for example:
    * M. Mezmaz, N. Melab, E-G. Talbi, «Combining metaheuristics and exact methods for solving exactly multi-objective problems on the Grid», Journal of Mathematical Modelling and Algorithms, Vol.6, No.3, pp.393-409, 2007.
    * I. Zunino, N. Melab, E-G. Talbi, “A Grid-enabled framework for exact optimization algorithms”, HPCS’2007 High Performance Computing and Simulation, Pragua, Czeche Republic, June 2007.

    And their tools:
    * S. Cahon, N. Melab, E-G. Talbi, «Building with ParadisEO reusible parallel and distributed evolutionary algorithms », Parallel Computing, Vol.30, No.5-6, pp.677-697, June 2004.
    * R. Bolze, F. Capello, E. Caron, M. Dayde, F. Desprez, Y. Jegou, P.Primet, E. Jeannot, S. Lanteri, M. Leduc, N. Melab, G. Mornet, R. Namyst, B. Quetier, O. Richard, E-G. Talbi, I. Touche, «GRID’5000: A large scale and highly reconfigurable Grid», International Journal of High Performance Computing Applications (IJHPCA), Vol.20, No.4, pp.481-494. 2006.

  11. “Can we parallelize the lp solver with openmp and optimize it significantly? Are there any examples as such?”

    Some modest gains are possible–see my comment #2 above. But the heaviest work in the simplex method is inherently serial or nearly so: basis factorization updates, FTRAN/BTRAN (triangular solves). And short, sparse vectors don’t lend themselves to parallelization.

    OpenMP (shared-memory SMP) is much lower overhead than message passing across a network, of course, but it isn’t *that* cheap. One still needs to do a reasonably significant amount of work in each thread to make the thread overhead pay off.

    I think I recall Eckstein played around with a parallel dense simplex solver on the Connection Machine 5 in the early 90s, but I don’t think it led to any great re-thinking at the time.

Leave a Reply

Your email address will not be published. Required fields are marked *