I just ordered a new computer to replace this one. mat.tepper.cmu.edu has done a great job over the last five years, but it is showing its age. The operating system is woefully out of date, causing ever-increasing security problems. Of course, if it gets out of date enough, the machine becomes “secure by obsolescence” as a buddy of mine (who runs HPUX 9 or so, circa 1995) has deemed his “security measures”.
I ended up going a little crazy on the new machine, ordering a dual quad-core system (that is eight cores). It is amazing how these multi-core systems have taken over. It also means something important for us operations research people: parallelism is important. I have often been down on the COIN-OR people for all the bells-and-whistle they put in their code (I just want a branch-and-cut code: I don’t want to dig through all the stuff about “computing on multiple heterogeneous machines” and so on). But much of their code is already set up for multiple machines (and multiple processors on one machine) and that is going to be important. Multicore is of little use if the software doesn’t take advantage of it. And expecting the operating system to automatically parallelize a complicated optimization program is currently a pipedream. As this article points out, for a lot of software, more cores does no good whatsoever. So hurray for the COIN-OR people for thinking ahead on this.
It is interesting to compare the speed of this system with the listing of supercomputers, as given by Top500 Supercomputers. I can’t get a clear reading on the speed of the machine I am about to buy, but it may be in the 40-80 Gflop range (can this be? Correct me if I am wrong). Based on the historical graph, such a machine would have been the fastest supercomputer in 1993 and would have hit the top 500 at late as 2000. Now under US$7000.
You didn’t specify whether you’re running AMD or Intel processors. You also didn’t specify the exact model and/or CPU clock.
For the Intel processors, in theory each core can compelte four floating point operations per clock cycle. At 2.5 Ghz (about the middle of the clock range), that’s 10 billion operations per second per core or 80 billion operations per second. In practice, you’ll find that you don’t get anywhere near this performance, both because the processor has to do something other than floating point operations and because these systems don’t have the memory bandwidth to keep the processor cores busy.
Modern computer architectures are so complciated that very few programmers understand them well enough to get really good performance. If at all possible, you should make use of library routines (such as BLAS and LAPACK for linear algebra, FFTW for FFT’s, etc.) that have been optimized for your machine instead of recoding these functions yourself.
It’s also worthwhile to distinguish between shared memory parallel computing (what you do with multi-core processors from AMD and Intel) and distributed memory parallel computing (what you do with a Beowulf cluster, although the individual nodes in the cluster are likely to be shared memory machines.) It’s relatively much easier to parallelize code for a shared memory architecture using OpenMP extensions to C or Fortran than it is to parllelize a code for distributed memory architecture by passing messages from process to process using MPI.
They are Intel Xeon 2.33’s so I guess I give up about 10%.
There is a code I would like to write that could use all 8 processors. All integer arithmetic, no chance for library routines. Suggestions for where to look for training/suggestions on how to program at least passably well for this machine? The algorithm can be naturally parallelized in a pretty straightforward way that partitions the data and doesn’t rely on communication between processes.
OpenMP is great for parallel code, as long as the code has a very simple structure. As soon as one gets more complicated structures, for example with dynamic load balancing issues, then the model starts to break down.
What kind of model for programs one should use is mostly dependant on the kind of program one wants to do. OpenMP-style, demand driven computation, explicit threads, and message passing all have their virtues. Of course, sometimes not all the possibilities are available, but in this case I think you can chppse quite freely.
That computer is barely adequate for surfing the web and writing the occasional word *cough*openoffice*cough* document 😉
I have to agree with Mikael that OpenMP is best for situations in which you’re operating in parallel over sections of large matrices as is typical in numerical linear algebra on dense matrices. In this situation you typically just want to parallelize a few hot spots in the code, almost all of which correspond to deeply (at least two and typically three or more levels deep) nested loops. OpenMP is not nearly so good a choice for something like parallelizing a branch and bound code.
Parallelizing code that does number crunching on large arrays using OpenMP is something that I’ve have experience with and find reasonably easy to do. Parallelizing a branch and bound code using MPI on a distributed memory cluster is something that I’ve tried but failed to do well. I found it to be unbearably difficult, so I’ve made a decision to avoid (except for some embarrassingly parallel Monte Carlo simulations that I did at Sandia) working on distributed memory architectures.
Many folks that come from scientific computing backgrounds seem to have the same strong aversion to writing message passing programs for distributed memory machines. Of course, if you want to apply thousands of processors to your problem your only choice is a distributed memory architecture. Thus the preference for languages like HPF or the use of higher level libraries like ScaLapack.
You certainly could write message passing code (using MPI) and run it on your shared memory machine, and you’d get a performance benefit from the shared memory over running the same code on a Beowulf cluster (the message passing has much less latency.) However, you’d pay a big price in programming complexity that might not be necessary.
If your code with lots of integer arithmetic is spending most of its time iterating over arrays of integers and you don’t care whether your code can ever run on a distributed memory cluster, then OpenMP should be just fine for you.
On the other hand, if your code must eventually run on a big cluster, or if your code is spending most of its time chasing pointers through linked lists then OpenMP is less likely to be a good choice.
It would certainly be interesting to go further into this discussion- could you tell us more about the code that you want to parallelize? What problem does it solve? What kinds of algorithms does it use?
The code is for finding small binary voting trees (see this paper. I am trying to confirm that each of about 250 million voting rules occur (or generate as many of them as I can) (symmetry allows me to know out some of these). To do this, I generate binary trees in order of size, deleting all those that are “equivalent” (result in the same voting rule). Given all non-equivalent trees of size <= K-1, I get the trees of size K with the following
The trees of size K are in linked lists. The lookup to determine if voting on T3 not found is a hash table.
My intention was to assign different ranges of s to the different processors (I know the number of trees of size s, so I can do the balancing myself).
Given the ease of the combine and check steps, a shared memory model (rather than a message passing model) seemed best to me.
This seems like something that would be appropriate to parallelize with OpenMP. Be careful how you handle the “add T3 to size K” though so you don’t get any bugs. These list insertions should be done with exclusive access to the size K linked list.
Thanks: I’ll give it a whirl.