So I wasted a valuable afternoon today because of ickyness in LAM/MPI 6.5.9.

I have the AVIDD-B cluster all to myself to run dissertation performance results today. Mainly, I’m comparing LAM 6.5.x performance to LAM 7.x performance. For what I’m doing the two results should be pretty much the same — the whole point of my dissertation is that I added a bunch of great abstractions into LAM but without any performance penalty. I had it about 2 weeks ago, too, for the same reason. But a bunch of my numbers got borked 2 weks ago — namely the 6.5.9 numbers were way worse than they were supposed to be. Specifically: 7.x performed way better than 6.5.9 on gigabit ethernet.

I thought that it was just a simple missing [memcpy] optimization that we debuted in 7.x. So I added that optimization in my copy 6.5.9 today and re-ran the results. Same crappy performance.


So I removed the optimization from my copy of 7.x, and the same great performance was there. i.e., there was still a huge performance difference between 6.5.9 and 7.x. I spent several hours trying to figure out what the heck the difference was. I even roped Brian into it — we couldn’t remember what optimization we had added that gave such a huge performance increase in 7.0.

While I was going through Changelogs, it hit me. The stupid “[-O]” option to 6.5.x’s [mpirun] — if you don’t explicitly tell LAM 6.5.9 to no do it, it’ll always put data on the network in big endian (“network”) order. This really sucks on Linux boxen, obviously. So you specify “[-O]” to [mpirun] and tell it not to do that. In 7.x, we handle this automatically. Specifically, I had totally forgotten about this option, and none of my 6.5.9 results were run with “[-O]”. Hence, all those results were showing the effects of 2x byte swapping.


And I knew about this option. It’s bitten me before. And I’ve scolded users to use it. I wasted several valuable hours on the cluster figuring this out (and it’s why my results weren’t right two weeks ago). Grrr…


