gm RPI. I've found a case that repeatably hangs the state machine, so I'm diving into it to find out why.
Wow -- I forgot how twisted this code is.
Actually, it's not _twisted_, it's more like _complex_. It's actually laid out fairly well and broken up into a million pieces, and *THANK GOD* I left many really long comments to explain what the heck is going on in the code.
Can you say: Holy special cases, Batman!
gm RPI that I've been chasing all morning...
* A peer process sends a message to me
* I have not yet posted a receive for this message yet
* The envelope is received, and I notice that it doesn't match any of the posted receives. So it's marked as unexpected, and placed in the proper area.
* The message is "short", meaning that there's still another message coming -- right behind this one -- that holds the actual content of the message
* But the underlying message transport (gm) indicates that there's no message here yet, so we return into the main MPI progression, and back up to the user program
* The user program then posts the matching receive
* Down in the RPI, the receive is matched with the half-received unexpected message.
* I setup such that the next received gm message from this process will go directly into the user buffer
* ...but the request was still marked "ACTIVE", so it followed the normal (expected... as opposed to "unexpected") message progression, which meant posting it to the "pending receive" queues, even though it was already halfway finished.
So I had previously done about half the Right Thing -- I recognized that the short body was still pending, and tried to setup to receive into it. But I left a little bookkeeping undone, so Badness occured sometime (randomly) later.
Wow.
As I said last night, "Holy special cases, Batman!"
/home, but already had a few files under /home). Elapsed time: about 10 minutes.
configure script is over 1MB long.
'nuff said.
"Suppose the prophecy IS true. Suppose tommorow the war'll be over. And this MPI program runs error free. Isnt that worth fighting for? Isnt it worth DYING for?"
...After that I can do the rlogin with out providing a password, but when I run a progran it failed. The progran make a: mpirun hostfile.host /localhome/marc2001/bin/marc and I get a Connection Confused.
94 ECONNCONFUSED Connection confused. No connection could be made, or maybe it was. We're not certain. The situation may or may not have resulted from the client application attempting to eat Jello(tm) brand gelatin desert while utilizing a pogo-stick.
LAM_CONFIGURE_HOST="`hostname | head -n 1`"
So I wasted a valuable afternoon today because of ickyness in LAM/MPI 6.5.9.
I have the AVIDD-B cluster all to myself to run dissertation performance results today. Mainly, I’m comparing LAM 6.5.x performance to LAM 7.x performance. For what I’m doing the two results should be pretty much the same — the whole point of my dissertation is that I added a bunch of great abstractions into LAM but without any performance penalty. I had it about 2 weeks ago, too, for the same reason. But a bunch of my numbers got borked 2 weks ago — namely the 6.5.9 numbers were way worse than they were supposed to be. Specifically: 7.x performed way better than 6.5.9 on gigabit ethernet.
I thought that it was just a simple missing [memcpy] optimization that we debuted in 7.x. So I added that optimization in my copy 6.5.9 today and re-ran the results. Same crappy performance.
Wha…?
So I removed the optimization from my copy of 7.x, and the same great performance was there. i.e., there was still a huge performance difference between 6.5.9 and 7.x. I spent several hours trying to figure out what the heck the difference was. I even roped Brian into it — we couldn’t remember what optimization we had added that gave such a huge performance increase in 7.0.
While I was going through Changelogs, it hit me. The stupid “[-O]” option to 6.5.x’s [mpirun] — if you don’t explicitly tell LAM 6.5.9 to no do it, it’ll always put data on the network in big endian (“network”) order. This really sucks on Linux boxen, obviously. So you specify “[-O]” to [mpirun] and tell it not to do that. In 7.x, we handle this automatically. Specifically, I had totally forgotten about this option, and none of my 6.5.9 results were run with “[-O]”. Hence, all those results were showing the effects of 2x byte swapping.
ARRRGGGGHHHH!!!!
And I knew about this option. It’s bitten me before. And I’ve scolded users to use it. I wasted several valuable hours on the cluster figuring this out (and it’s why my results weren’t right two weeks ago). Grrr…
Let the record show that I ower Brian 3 beers of his choice (size/quantity and brand unspecified) for, at a moment’s notice, dropping everything and figuring out how to use gcov (in parallel, no less) and calculate how much coverage we have in the LAM test suite.
We [finally] managed to get LAM/MPI v7.1.2 out the door. It includes about 1.5 years worth of bug fixes and updates. The reason it took so long is because we have been spending 99% of our time on Open MPI .
The release announcement is here.
Since I’m heading to Cisco (starting tomorrow), this is my last official action in LAM/MPI. I’ve been working with LAM/MPI for about 8 years: my first commit was 23 Oct, 1998, my last commit was minutes ago, moving my name from the “current” section to the “previous” section in the AUTHORS file:
------------------------------------------------------------------- Author: jsquyres Date: 2006-03-12 11:18:28 -0500 (Sun, 12 Mar 2006) New Revision: 10311 Modified: trunk/AUTHORS Log: It is now time to say goodbye. Modified: trunk/AUTHORS =================================================================== --- trunk/AUTHORS 2006-03-12 16:18:16 UTC (rev 10310) +++ trunk/AUTHORS 2006-03-12 16:18:28 UTC (rev 10311) @@ -11,7 +11,6 @@ Indiana University: - Brian Barrett (brbarret) - Andrew Lumsdaine (lums) - - Jeff Squyres (jsquyres) - With special thanks to Josh Hursey and Andrew Friedley. @@ -19,6 +18,7 @@ ---------------- Indiana University: + - Jeff Squyres (jsquyres) - Anju Kambadur (pkambadu) - Vishal Sahay (vsahay) - Nihar Sanghvi (nsanghvi) -------------------------------------------------------------------
So long, and thanks for all the fish…
Too funny. I omitted a few e-mail addresses for privacy reasons, even the return address of the person who wrote this, even though they probably don’t deserve the protection:
From: <omitted>@domainsbyproxy.com
Date: October 1, 2006 2:43:35 PM EDT
To: <omitted>
Subject: FWD: FWD: FWD: FWD: Offer on your domain, lam-mpi.org. Please respond
[lam-mpi.org@domainsbyproxy.com] [lam-mpi.org@domainsbyproxy.com]
[lam-mpi.org@domainsbyproxy.com] [lam-mpi.org@domainsbyproxy.com]
Reply-To: <omitted>
Hello,
I am interested in purchasing your domain name, lam-mpi.org. I am only
interested in the domain and not any web site which may be associated with it.
My offer for this domain is $366.00. This is a genuine offer and it may be
negotiable depending on the domain's traffic/visitors.
I kindly request that you to respond to this e-mail so that I know you have
received it.
Thank You,
D.R.T.
Woo hoo! Congrats to the Indiana University team for putting out LAM/MPI v7.1.3 (this was the project that I was the head of while I was at IU). Here’s the release announcement:
It is likely that this is the last release of LAM/MPI ever.
This page contains an archive of all entries posted to JeffJournal in the LAM/MPI category. They are listed from oldest to newest.
Army is the previous category.
Notre Dame is the next category.
Many more can be found on the main index page or by looking through the archives.