« October 2000 | Main | December 2000 »

November 2000 Archives

November 1, 2000

But it my leadership that got you in that dress

This is prep-week for SC2000 -- so most entries are likely to be technical. Deal.

The boys from HP have done it again -- they found a rather gaping hole in my IMPI implementation in LAM. Doh!

Quick explanation:

  • When a "long" message is sent across IMPI boundaries (where "long" messages are defined as longer than an agreed-upon number of bytes), it is broken up into 2 or more packets, where packets have a previously-agreed-upon maximum length. The first packet of a long message is sent "eagerly" (i.e., right away), and is marked as "first of a long" -- it is called a DATASYNC packet. When the receiver gets a DATASYNC, it allocates enough space for the whole message, does some other bookkeeping, and then sends back a SYNCACK telling the sender "go ahead and send the rest of the packets; I'm ready."

  • Messages in MPI are identified by the communicator that they are on (essentially, a unique communications space) and the tag that they use (a user-specified integer that distinguishes between messages). Messages that have the same source, destination, communicator, and tag (for purposes of this discussion), have the same signature -- meaning that multiple messages with these same characteristics would be judged by MPI to be the matching messages.

  • When you send a message in MPI, you have to receive with the same signature. Hence, the signature of a send and a receive must agree.

  • Note that the signature has nothing to do with the contents of the message. Two messages with the same signature may contain completely different data, and even be completely different sizes. More to the point -- the message signature is a user-specified set of attributes, so it's up to the user to assign meanings to them; MPI just provides a flexible way to distinguish between different messages with the signature mechanism.

  • MPI has a message ordering guarantee for single-threaded, non-wildcard operations. That is, two messages sent with the same signature must be matched in the order that they were sent by the receiver. That is, if I send message A with signature Z, and then immediately send message B, also with signature Z right behind it, A must be received before B. If you think about it, it's pretty intuitive, actually.

For a longer discussion of MPI, see the MPI Forum web site. For a longer discussion on IMPI, see the IMPI web site.

Something that I hadn't thought about before was that long and short messages have different protocols. Take the following example:

  • Send long message A with signature Z.
  • Send short message B with signature Z.

Given what was discussed above, only the first packet of A will actually be sent to the receiver, whereas the entirety of message B will be sent to the receiver (because it's shorter than the length of one packet). However, A has to wait for an ACK and then the rest of the packets from the sender before it can be fully delivered to the receiver.

My implementation of IMPI didn't take this into account at all --
it just served up messages as soon as they became available. It didn't take into account the fact that long messages may be "in progress" and a short message may sneak in before the long message completed, and thereby violate the MPI message ordering guarantee. Doh!

Hence, I had to spend the majority of today writing diagrams and flow charts, and then implementing a "gate" at the delivery end of IMPI such that it watches for long and short messages, and has a somewhat-complicated state machine to only allow messages by when long messages are not already in progress. If a long message is in progress, the just-received message (even if it's the first of a long, itself) is queued up. When the long at the head of the queue finishes (i.e., we sent the ACK back to the sender, and the sender sent us the rest of the packets in the message), the rest of the queue can progress until either the queue drains or the first of a long is encountered (then we have to send the ACK back to the sender, wait for the rest of the packets, etc.).

Not a simple undertaking.

After all this, I got it working with HP's IMPI implementation. I found a bunch of memory leaks in our proxy agent (the "impid") and fixed up all of those. There's still a ton of "blocks in use" when the impid quits, but those are all from the internals of Solaris and there's nothing that I can do about them. :-\

After fixing those, I released LAM 6.4a6 to HP and MPI Software Technology (the two MPI vendors that we have to demo this stuff with next week at SC2000).

I love bcheck. I can't imagine how I programmed before I discovered it. Go RTFM if you don't know what it is.

One thing that bugs the crap out of me, though, is our implementation of what is called the IMPI server. The IMPI server is basically used as a rendevouz point at the beginning of a run. All the MPI implementations meet there, have some coffer, exchange some meta information, and then go off than shake their booty.

Needless to say, this all contains lots of socket code. The server allows you to specify what port it sits on to listen for the MPI implementations to meet at, or take a randomly assigned OS port. It's frequently convenient to use a fixed port for repeated runs, so that you can just do !! (or up-arrow) in the server and client windows and not have to change the port number in the various command line arguments.

However, sometimes when one tries to fire up the server again, it complains that the socket is "already in use", and you can't reclaim it for several minutes while the OS times out. Result: you have to go change the port number in all the command line parameters, which is a pain.

The thing is, I don't know why it says that the port is "already in use" -- I don't know the conditions that lead up to this. Indeed, take something like sendmail or apache -- it can always fire up on the correct port (25, 80, respectively) no matter what state it was previously shut down in. This suggests that it's not a client action that guarantees that the port will be open, but a server action. But I'll be damned if I know what it is. :-(

If anyone has any insight here (and is still reading this :-), please enlighten me...

November 8, 2000

Would you like a mouse pad?

Been a busy week.

We're all here at SC'2000.

Things were not going well at the beginning.

They appear to be going well now.

The schmoozometer has been set on ultra.

Lots of people at lovin' LAM.

Must go.

Miles to code before I sleep.

November 10, 2000

Would you like a mouse pad?

There were some rocky parts, but I think we had a good SC2000 overall.

This is an epic journal entry. Cope.


Some of us met in the lab where we gawked at the LAM and LSC shirts that Jeremy picked up on Saturday night. They rocked. The nd.edunetwork went out around 11:45 (note: this is important for later).

Long flights, a three hour layover in Midway (what are crazy place), arrival in Dallas. Lummy met us at the Dallas airport. ATA lost Arun's luggage, but we waited there for a while anyway. Got in, had dinner (which was Much Fun), and started some slides in Arun/Brian's room.

The hotel has high speed internet access, but nd.edu was down. Luckily, nd.edu's vBNS link was still up, so we could get in via Berkeley or Argonne. So life was still ok -- we could still get to our e-mail and do some work. No biggie.


We got to the exhibition floor somewhere around 9am. We appraised the situation, said hi to all the good IU and Purdue folks, and started to get our stuff together. The commodity link to nd.edu was still down, so we started downloading LAM and XMPI via Argonne.

Then.. BAM!!!

nd.edu's vBNS link went down.

And stayed down.

Life sucked.

As Arun said in his journal, epics have been written about less. We cobbled together [mostly] working versions of LAM and XMPI from backup and working copies at Berkeley and IU and our laptops. Ugh. We were all cursing Ameritech (supposedly the cause of the nd.edu's outage, but I still blame the OIT).

It was a race against time to get all our stuff downloaded, assembled from the various repositories around the country, couple it together with some missing software from ftp.gnu.org, battle a shaky SciNET (the network on the SC2000 show floor -- it kept going in and out), and get it all working.

The deadline was 7:30pm -- my IMPI demo. We finally got enough downloaded, and I met with the HP people. We were further confounded by the fact that the union folks made us clear the aisles in order to lay all the carpet between all the booths. Hence, I couldn't travel to the HP booth to coordinate with CQ (HP's IMPI guy -- his real name is Asian, and probably unpronounceable to us Americans, so he goes by "CQ"). I finally got over there around 4pm, and we did some testing.

After a bit of futzing, we got it up and running with HP as the master and displaying on his machine (we had to download and install ssh because they didn't have it, and the IU demo machines didn't have telnet (yeah!). But it all worked out.

After some more battling (battling low battery power, shaky SciNET connections, and pesky sales droids), we got it to work properly with the IU booth machines as the master. Whoo hoo!!

We also converted Matt from Purdue from MPICH to LAM. We reduced the complexity of his Makefiles dramatically, and showed him the goodness of lamboot, mpirun, etc. He said, "I'm a convert!". Another happy customer.

MPI Software Technology (MST), however, wasn't quite as lucky. :-( They didn't bring the right kind of fiber connectors to get on SciNET, and then the local Fry's was out of the right kind. Their IMPI implementation was not quite finished, either. I managed to download a recent copy (nd.edu came back up that evening) of LAM's IMPI distribution tarball. I downloaded a copy to their LAN and helped him get it up and running (they previously had some problems trying to install LAM, but I don't quite know why...). Rossen thanked me, and started debugging.

So my demo went off at 7:30pm and it seemed to go off well. I had a varying size of crowd watching. I was a bit annoyed, though, because literally at the last minute, I got switched to the other Imersadesk, and nothing was setup right. It took a good 10 minutes to get it setup right just so that I could bring up my slides. It was somewhat embarrassing because the NIST folks (the people who funded our IMPI work) were standing there waiting for me to start talking. But it eventually turned out ok.

We gave out a surprising number of LAM key chains (they were quite popular!). We walked around a bit and saw a few people, and it was generally pretty good.

We left there, dropped our stuff off back at our room, and went to the Beowulf Bash (which was conveniently in our hotel). It was pretty cool; when we got there, they were announcing that more deer was coming immanently (and it did :-). We chatted with Dave from Myricomm (and ND grad) and swapped ND stories. I also chatted with Doug from Paralogic, Don from Scyld, and and Dan from Scyld.

Dan chatted with all of us for a while -- they do some really cool stuff in Scyld for their clusters. They have an rfork() call that forks things onto nodes (and an associated rkill()), and do process migration all over the place. They directly load the BIOS to boot linux in 3 seconds, and the get everything else from the cluster master. I don't know all the details, but it sounds good.

I also chatted with Dan about the parallel MP3 encoder that I wrote a while ago (he downloaded it was amazed that he downloaded something from a .edu site -- particularly the LAM/MPI site -- and he ran ./configure / make with his MPICH distribution, and it just worked). He also wanted to talk about a parallel ogg vorbis encoder, and wants to write a paper about it on Linux Journal (I think it was LJ -- can't recall offhand). This could be really cool. I think we might do it.

I sent Dan an e-mail later saying, "let's do it -- how do you want to precede?" We'll see what happens. Also, Scyld is interested in LAM -- to do so, we would probably need to ditch the lamd. In such a case, Scyld would have to provide some services like process management (which I think they already do), an out-of-band messaging channel (which might be harder), potentially trace gathering, and name/value publishing. We'll see how this all works out.

After all the schmoozing, Brian and Arun and I had cigars downstairs and had a good chat about all kinds of things. Rock on.


Saw some MPI papers in the morning. Two were about one-sided implementations. The third was about... er... something. One guy presented results with LAM. Whoo hoo!!

We schmoozed all day. We officially ran out of key chains. We got several t-shirts from several companies, including a really nice button down shirt from Veridian (the PBS folks).

We talked to all kinds of people -- so many that I actually don't remember everything that happened that day. It was good. I do remember chatting with the Myricomm folks quite a bit, though, and chatting with the PBS folks, NIST people, HP,

I stopped by to see how MST was doing with IMPI. They were still having some problems, but I didn't have time to debug with him. I came back later and helped some more -- turns out that he wasn't zeroing out the upper 12 bytes in the IPV6 address, so LAM wasn't able to find a match in the source address. Hence, dropped packets. This turned into goodness; the MST/LAM ping pong tests started working.

Dinner was with the Research@Indiana folks at Fish: An Upscale Seafood Restaurant. All us ND students sat together (except George, who sat with Jesus, 'cause they got there a bit after us). Our conversation was mostly about the GPL, licenses, etc. It was pretty good, all around. A good time was had by all, and the food was excellent.


Got in a bit early to setup the LAM and XMPI demos. We had some real problems. :-( We uncovered some bugs in XMPI at literally the last minute, so I canceled the XMPI demo, and we did just the LAM demo. We actually had some problems there, too -- we had problems making a user MPI program fail in a controllable way (we wanted to show the usefullnes of running an LAM/MPI program under a debugger). But we finally got it, and it worked out ok.

However, we did have major problems with the Sun Workshop debugger -- we just couldn't get it to run. gdb didn't work, either. We had 4 UltraSPARC 10 machines to run down here, but they weren't quite setup the way that we were expecting. In particular, we asked for tcsh to be our default shells. But after some painful processes of elimination, we proved that the tcsh that was installed on those suns was broken -- it caused gdb to fail, and it sometimes caused logins to hang and have tcsh CPU usage to go around 95%
or so. VERY annoying, and very difficult to track down
-- how often do you actually suspect the shell itself? No, you assume other things are wrong (like your . files, the OS, etc.). But switching to csh fixed everything. I've never see anything like it before.

But we didn't figure this out before the LAM demo, so we actually run on nd.edu machines and used gdb (firing up the Workshop debugger invoked just too way too much time). The demo and talk actually went well, though.

I talked with a whole bunch of people throughout the rest of the day -- we wandered the floor some more, talked to some ASCI people, Tony and company at MST, the Compaq sales guys, etc., etc. During my "booth duty" time, I chatted lots of people about LAM/MPI and ND (including some people whose sons/daughters are currently at ND), and particularly with a guy from Sweden about LAM who mentioned that he wanted the ability to checkpoint LAM/MPI processes so that he could take his nodes down and do maintenance on his cluster. And then when he's done, restart the process and keep the MPI job going. I initially said no, you can't do it because of the "socket problem" (i.e., you can't checkpoint sockets -- more info below), but then I started thinking about it, especially with respect to the Condor checkpoint library (very cool stuff). We chatted about this for a while, and I ended up putting it in the background because other things were going on.

Spent a bit more time with Rossen and his IMPI. I don't recall what the exact error was, but we found it and fixed it, and after Rossen worked out the rest of the details, it later worked with LAM/MPI in the pmandel code. Woo hoo!

Spent a good amount of time debugging XMPI and LAM's demo (and figured out the tcsh/csh issues). After figuring out the csh problem, LAM pretty much fell in line right away. Brian and I spent the rest of the afternoon debugging XMPI and stayed after everyone left. We fixed up most everything and fixed up some nagging bugs.

Renzo called in the middle of this and we setup stuff for the BC game at ND this weekend. He's in Vegas this weekend, so no family dinner with Lynzo and the chunky monkey. Bonk! :-(

One of the problems was actually an error in Sun Workshop 5.0's <fstream< implementation. VERY ANNOYING. It turns out that using getline(fstream&, string&) to read in a blank line will start returning true for eof(). ARRGGGHHH!!!

Once we figured this out, Brian and I left for dinner (around 8pm). We passed the Myrinet folks, and chatted with them for a while (lots of laughs -- we share the same exact feelings about writing software, users, distributing software, etc., etc.). They recommended an Italian restaurant for dinner.

Brian and I headed out for dinner, and I brought up the checkpoint/restart problem with Condor's library. We talked about this for a while (we were in one of those cool Italian restaurants with paper tablecloths, so we could draw on it with the provided crayons, etc. Very handy!). A good dinner, with good food. We caught a cab back.


More LAM pimping. Had more good chats with Myricomm/Bob Feldman; seems like we could have quite a future there. Near the end of the day, Talked to infiniband people about using their stuff as a high speed fabric for LAM. Had a look at some other booths; talked to the NPACI people, who had some REXEC people, and shared some info about LAM (since REXEC has some common elements with LAM).

Went over to the RealWorldComputing booth; they have some cool stuff, including SCore MPI. Meant to look at that last year, but...

Then we talked to a few linux integrators, pimping LAM. One hadn't heard of LAM (bastards!), but the other was Linux Networks. "Hey Jeff... we talked last year" was the greeting. Amazing. And apparently, Dog and Brian had been there about 5 minutes previously. But we had a nice chat and he gave us t-shirts.

Then the expo was over. We cleared our stuff out of the Research@Indiana booth and went back to the hotel. As we were getting on the shuttle bus, I said to Arun, "hey... some Swedish guy came up to me yesterday and gave me a great idea about checkpointing MPI jobs in LAM..." and then I stepped on the bus. I heard behind me, "Hey... you're the LAM guys! We've been meaning to find you!"

Turns out that the Condor grad students were standing right behind us and heard me mention checkpointing and noticed who we were. It further turns out that they've been having similar ideas -- wanting a checkpointable/migratable MPI. So we chatted on the bus, and then chatted some more in the bar before they had to catch a cab back to the airport. REALLY cool stuff, and we think we can do it. There's some delicious complications, but the fact of the matter is: no one else can do this, and it would be truly fantastic if we could do it.

Condor wants a checkpointable MPI and one that they can schedule/migrate around in Condor, and we want a checkpoint/restartable MPI. This could be the start of a really, really cool collaboration. I'll jot down the notes that are in my head in a technical journal entry after this. I'm still brimming over with goodness about this; I actually think we can make it all work (and get a bunch of papers, become famous, and take over the world). How cool is that?

We then met everyone else from the LSC and went to dinner at the Spaghetti Warehouse in the West End. Good food, and good conversation -- a good time was had by all.

And now I'm back here, typing it all in so that I don't forget it.

Now on to the technical journal entry about Condor/LAM...

So all in all, it was a good SC2000.

Yes, I would like a mouse pad.

I forgot to mention that I am Mouse Pad Pimp Daddy. We came to Dallas with 900 LAM mouse pads (300 C, 300 Fortran, and 300 C++). WE HAVE NONE LEFT!!! I think that I personally handed out about 700 of them.

Rich from the OIT told me that I could be a used car salesman.

November 13, 2000

On diatribes and dianetics

A good weekend.

I haven't finished typing up my technical thoughts on LAM/Condor yet; that's forthcoming.

This weekend was good -- I got back to SBN on Friday evening and briefly stopped at Ed-n-Suzanne's for a most excellent tuna sandwich. I then met up with Renzo and we ended up going to Senior Bar, where we ran into lots more people, Stina, Jason [current 'bone section leader], lil'Putt, Jill B., Jason B., Deli, Catherine K., etc. It was a good time. We then hiked to my office to get the parking pass.

The next morning, I was blading to where Renzo and Schleggue were parked when I ran into Jill B. again. During the conversation, my phone rang; it was Renzo, asking where the hell I was. Oops! I was now very late. But I eventually got over there, and Schleggue, Renzo, and I had some good conversation before we ended up heading over to the Putt tailgater.

More fun was had there by all. Tracy eventually joined us (she drove up that morning), and we all headed into the game. We smuggled Renzo and Schleggue into the student section, which was cool. Mike N., Brian B., his fiance Dana, Jeremy S., and Katie M. joined us as well. It was a fun game; a few nervous moments, but we ended up stomping on the hated BC Eagles, so the day ended well.

Thinking that we were smart, we ordered Papa John's right after the game from the stands on the rationale that it would take forever to get the pizza and we'd be at Oak Hill long before it arrived. Indeed, the PJ person told me that it would be 60-90 minutes before the pizza came.

We ran into Vernon my the car, and invited him along. Jason Brost left a message on Schleggue's voice mail (apparently the #$@%@#$%
wireless circuits get very busy in SBN during football games, and many calls don't get through, so they get switched to voicemail) indicating that he might drop by. So we decided that we didn't order enough pizzas. I called PJ back (it was 30 minutes after our first order at this point) to see if we could add another pizza to the order. The PJ person told me that the delivery guy had already left. DOH!!

So Tracy and I got out of the car (which was stymied in a long line of cars waiting to exit the Hesburgh library parking lot area) and started jogging to Oak Hill (me), and to PJ itself (Tracy). I didn't beat the pizza guy, but he went to the wrong address anyway, so he ended up coming back not long after I got there (which was a good bit before Renzo et al. arrived in the car). Good exercise to jog from the Hesburgh parking lot to Oak Hill, but God, I despise running...

Tracy and I went to mass at the Basilica the next day, but it was so crowded that we had to stand in the vestibule for the whole mass. After a brief trip to the Grotto, Tracy headed home, and I went to a SC'2000 roundup meeting at Lummy's. We chatted about LAM, SC2000, and future directions. Looks like Jeremiah, Ron, and Brian will eventually be joining the LAM Team. Woo hoo!!

Ron also mentioned an ANSI-izer tool that we could use on the LAM source code. Mmm.... I've been wanting to do that for quite a while. Since there ate 900+ source files in LAM/MPI, the standing rule has been to ANSI-ize each file whenever you edit it (it's just too much to go through and do them all at once). But having a tool to do it would be fabulous...

Ron also mentioned the LXR, which we might use to create an annotated, self-referencing hyperlinked version of the LAM source code. That too, would be quite cool. Lummy's big on web-enabled groupware things, so we're probably going to explore a few of those for the LSC as well.

I drove home, took care of a bunch of emails and things that popped into my head while I was driving, and then watched the X files with dinner.

Now on to finishing that technical discussion of LAM/Condor...

November 14, 2000


Ok, so I didn't spend much (any) time on the Condor/LAM stuff yesterday. I spent most of the day finishing up the Password Storage and Retrieval system (PSR) originally written by Dale Southard. We use it with our batch queueing system (PBS) to get AFS tokens when jobs are submitted, and to automatically refresh tokens before they expire so that AFS authentication lasts throughout the entire submitted job.

It's pretty cool stuff -- it uses public/private keys for storing the user's password and whatnot. I've made it fully automake-ized, cleaned it up a bunch, added it to CVS, fixed a few bugs, ensured that it works with both Transarc's proprietary development AFS libraries and the krb4 freeware AFS libraries, and updated the patch to the OpenPBS source code (it's dynamically generated now, too). I finished early this morning and sent it off to Dale for review, and to Bob at PBS so that he can give the patch a once-over.

Hopefully -- that will be it, and I'll be able to release it and get it out of my hair.

Today will be spent answering 3 old LAM emails and working on the LAM/Condor description:

  • Keith from Citifinancial: he has discovered that when in fault tolerant mode, if you mpirun before the lamd's have discovered that one of the other lamd's is down, mpirun will get the wrong information and sit forever trying to spawn a job on a node where the lamd is gone. Hence, deadlock. Need to fix this.

  • Dave from GE: wants to get the native signal/error handler fired when LAM intercepts a SIGSEGV, SIGBUS, SIGFPE. Seems like a reasonable request; need to work with him a little more to get the details right.

  • Patricia from Dec: thinks that she has found a problem with MPI_Intercomm_merge in LAM. Need to check this out; I think she sent a sample program that shows the error.

Off to work...

November 16, 2000

Winter is the finest 7 months of the year in Wisconson

Been cleaning up LAM code for the past 48 hours. Trying to make it compile with a C++ compiler. You have no idea how painful it is.

And just when I thought I had a handle on it (I got liblam.a and libmpi.a and a bunch of supporting apps to compile cleanly), I moved into the lamd tree.

Oh, pain, pain, pain!

I'm in function pointer hell.

The original Llamas did everything in the pre-ANSI way, which was to simply declare a function pointer with the right return type, but with no arguments in the argument list. I guess this works...?

Part of the problem is that many of the lamd functions are supposed to return function pointers to the [effectively] to themselves. More to the point, they have to return pointers to functions that have the same signature as themselves. That is, function A has to return a pointer to function A (or a function that has the same signature as A).

After dinking around with this for quite a while, I sat back and thought about it, and it turns out that C/C++ can't do this legally. i.e., you can't declare a function that returns a pointer to a function with the same signature. It's a recursive problem -- trying to do so changes the return parameter type, which then changes the function signature, which then changes the return parameter type... etc., etc.

A more concrete example:

ret_type func_name(arg_list);

The goal is to have a function signature (call it func_sig) that encompasses all of that. However, func_sig must equal ret_type, which, if you think about it, can't be. Hence, C/C++ is unable to describe this abstraction.

This is actually very interesting (to me, at least), because I've never run across something that C/C++ just couldn't do because of its language specification. Sure, there are tons of things that C/C++ is not good at, but I can't recall ever running across something that it just couldn't do because of its language.

Anyway, getting tired -- off to bed before I screw up the LAM tree...

November 17, 2000

Extra thrifty lima beans

New version of Mojonation came out a few days ago. I noticed this because I suspected a memory leak in Mojonation because my router would become increasingly slow (although I never checked its memory usage... doh!) and swapping activity would become much more pronounced (I have a loud disk drive in that machine :-).

So I restarted mojonation today, and it told me that there was a new version available on the web site. Among other things, it fixed a memory leak. :-) We'll see how this bad boy performs now...

Additionally, Lummy sent around a hot tip about Linux's hdparm which allows you to tweak the performance of your IDE hard drives. I tweaked a bit on my laptop and got a good amount of speedup. Same for my router -- tweaked a bit and got some improvement (from about 4.something MB/sec to 6.something MB/sec). On my desktop machine, the performance increase was dramatic. I went from 4.83 MB/sec to 25.50 MB/sec! That rocks!

Per request, I created web archives for our LSC staff internal mailing list today. Some peals of wisdom have been mailed across the list (C++ tricks, location of Friday lunch order files, etc.) and been lost. Web archives fix that.

I also made it a real mailing list instead of a sendmail alias. GNU mailman ROCKS.

I forgot to mention in the journal that a few days ago (or was it last night? Time has no meaning...), I formally released the Password Storage and Retrieval system (PSR) that allows OpenPBS jobs to run with AFS authentication. I also pinged the Condor guys about it (today) since I seem to recall that Dale said something about how they were interested in it. But I could be halicinating.

Speaking of Condor, I mailed off the huge technical entry about LAM/Condor (curses -- it just occurs to me that I set the category incorrectly on last night's journal entry!) to the Condor folks. Erik says that he'll read it this weekend in depth and discuss it with the other Condors next week.

I wonder if they refer to themselves as Condors as we refer to ourselves as uber-auth^H^H^H^H^H^H^H^HLlamas.

Off to do some LAM debugging, and them more dissertation writing. Gotta get a skeleton together at the very least.

Got to Hell, Costas

The Moog rocks.

I found Arun's The Moog Cookbook in my laptop as a leftover from SC2000. So I had to rip it into MP3s and have been enjoying it all day on my surround sound speakers. It's no "Slut", but it's not bad.

And of course, I'm gonna have to buy the damned CD now. Damn morals... arrghh...

Had a dentist appointment this morning. He tells me that all four of my wisdom teeth are gonna have to come out, as well as one more that's as rotten as a skunk roadkill in Alabama in the middle of July. And baby. that's rotten.

Spent the majority of the rest of the day finishing typing up my notes on Condor/LAM. I'll send those in a separate journal entry.

I did spend a little time looking into anti-virus software for my church. What a scam. You basically have to subscribe to anti-virus software these days -- pay a yearly fee for the privilege of continuing to get anti-virus updates. On the one hand, I can see how the company is continuing to provide a service, and that service should be paid for. But on the other hand, it's more like a tax -- if you run in the Windoze or Mac world, you need to have anti-virus software. Hence, you will have to pay whatever they charge. And it's not like there's tons of competition in the anti-virus world: there's essentially two companies, and their prices after 2 years of subscriptions are essentially a wash.

Don't let me get started on a rant here, but have you noticed how the whole security industry is founded upon the mistrustful nature of humans? Remember ARPANET? (of course, few of us "young 'uns" actually remember the ARPANET, but we've all read about it) There was no security -- everyone just trusted each other. There were no passwords, no secured protocols, no encryption. It just worked.

Such a system is inconceivable these days -- releasing the 'net to the rest of the world has brought out the worst in humans. Online scams, cracking, stealing of information, viruses -- it's all now commonplace and people almost expect it. Or, even worse, they have the attitude, "I don't have any important information -- no one would bother to hack into my system..." But that's a whole different topic; I digress.

So to combat this, the whole virtual security industry sprang up pretty much overnight. It's probably a multi-billion dollar business. And it can't even offer any guarantees. And it's all because humans suck, morally speaking. Especially the high-school punks who break in just for the sport of it, and don't realize that each of their pranks actually cost thousands of dollars. These kids don't even have a realization that what they are doing is wrong. It doesn't matter how easy it is -- it's still wrong. Just because I know that the Smiths leave their front door unlocked during the day doesn't mean that I actually walk into their house and start poking around.

And viruses. What the hell is the point of that? They're not directed attacks. They are potentially wide-spread attacks with massive collateral damage to innocent people who did nothing wrong other than open an e-mail attachment. Why? What could the virus writer possibly derive from that? Some kind of sick, twisted joy at the fact that their virus brought down hundreds of mail servers (e.g., Melissa), or wiped out thousands of hard drives around the world? My dad's hardware store got hit with a virus recently. It instantly went out across his Windoze network and infested 3 workstations. Luckily, the virus was fairly benign -- it only whacked all his .jpg and .gif files. But it could have been much, much worse. And that computer network is his livelihood -- it all that data goes away, he's screwed. All because some high school kid thought it would be fun.

I'm grossly stereotyping here, sure. So sue me, but I'm mad.

This may seem to be a bit of a stretch, but bear with me... I talked to a guy in GE Medical Systems one day -- he was a manager in their produce development section. I told him that I was a computer scientist. He said he loved to get newly graduate comp sci majors working for him. He said that without fail, within the first month or so of all new comp sci hires, he would take them down to a hospital and show them real patients whose lives depended upon the software that they wrote. A bug, a simple seg fault, an overflowed buffer, a bad logic test, and someone will die.

So the things that we do on computers (as computer scientists) we tend to imagine all stays "in the computer", and it can be hard to realize that what we do actually affects real life. But it does. The medical systems example is rather extreme, but I even went off in a previous journal entry about how LAM/MPI is used in people's daily lives, and the things that LAM/MPI is used for are in even more people's daily lives. Indeed, my favorite example of one project that uses LAM/MPI is the US Naval Surface Warfare center (SWAPAR). They use some of the MPI-2 dynamic process management features of LAM/MPI to simulate large scale naval battles, and use that to help shape navy tactics and policy.

So what we do is real. It matters. And it matters when that punk releases a virus that goes off and destroys a few thousand random hard drives. It matters a lot to the people whose hard drives it crashed. And it offends me that others in my profession do these kinds of things.

But to end this very random and wandering diatribe on a positive note, the next time you're sitting in a movie theater watching some naval battle and some "military smart" friend tries to explain the actual tactics to you, just nod sagely, touch your nose, and say, "Yes, I know. I wrote the book that wrote the book. I am an uber-author. I am the alpha to this omega. I am a Llama."

Migrating racks of LAM

I've got a bunch of things that I want to put down about a possibility about making LAM/MPI be checkpoint/restartable. I'll break it into multiple parts:

  • Some LAM terminology
  • The "checkpointing sockets" problem
  • Possibilities
  • lamd problems
  • Possibilities with Condor
  • Checkpointing without Condor
  • Making this portable
  • Other problems

Some LAM terminology

Since others will be reading this text, I'm going to throw in some LAM definitions that I'll be re-using throughout the text below:

  • lamd: The lamd is the LAM daemon that is run on every host in a "normal" LAM run-time environment. It provides several services to running LAM/MPI jobs, such as process control, an out-of-band messaging channel, key=value global publishing, a scoping mechanism, etc.

  • C2C: An acronym for "client-to-client", meaning that MPI communication goes directly from the source process to the destination process. This is usually via TCP sockets, but can also be via shmem or GM (myrinet), or whatever other network connects to MPI ranks.

  • nsend() / nrecv(): the function calls in the LAM/MPI implementation that are used for the out-of-band messaging channel. That is, MPI ranks can use nsend() and nrecv() to send messages to each other. These messages go from the source rank to the local lamd, then to the remote lamd, and then to the destination rank. Hence, the out-of-band messaging channel goes through the lamd, not through C2C channels.

  • LAM universe: one instance of the LAM/MPI run-time environment. That is, the LAM run-time environment is typically instantiated with the lamboot command and a file specifying a list of hosts. The LAM universe then exists among that set of hosts.

Here's a few assumptions that we make because of the LAM/MPI environment:

  • LAM/MPI is completely user-level. All processes belong to the user -- nothing runs as root. That is, each user has their own set of lamd's and user MPI programs.

  • LAM/MPI currently cannot "overlap" universes except in batch systems. By "overlap", I mean have multiple, different LAM universes of the same user on the same machine. i.e., while a user can run as many MPI programs as they want in a single LAM/MPI universe (and even have them share the same machines safely without interfering with each other), you cannot have multiple LAM/MPI universes on the same machine without a special exception. It will be trivial to make LAM be able to overlap universes in a Condor environment, but I felt that I should mention this.

The "checkpointing sockets" problem

So the Condor project has a library that can checkpoint a running program and start it up again at a later point. It can even migrate it to a different machine. That is, it serializes the entire image of the process (stacks, heap, program, data, etc., etc.) and dumps it into a file (or socket, apparently). The astute reader will recognize that things like open files will present a problem in this scheme -- particularly in the case of migration. i.e., if a process has an open file and it migrates to a new node, what happens with read() and write() calls in the process to that open file on the new node?

The answer is that the library leaves a "proxy" agent (I think their terminology for it is a "shadow process") back on the original node. So read() and write() calls on the new node are proxied back to the original node where the real operation takes place, and the result is piped back to the new node where the program is running.

This is all fine and good for most system calls -- i.e., intercept all system calls, shuttle them back to the proxy agent, and then pipe the results back -- but it doesn't work for sockets. More to the point, it could work with sockets (at least I think it could), but then performance on the sockets will suck, and that is unfortunately important to us in MPI-land (i.e., latency would rise dramatically, and there could be potential bandwidth issues as well, depending on the proxy implementation). Hence, we have "the socket problem".

The solution is to close all sockets before allowing an MPI job to be checkpointed, and then re-establish them after the job has been restarted. Multiple problems arise from this, though. The MPI job will assumedly still know where its sibling ranks were located (and could therefore reestablish sockets to them), but zero or more ranks may have moved -- so trying to establish sockets to the old addresses may not work anymore. LAM needs to become aware of which ranks moved and where they moved to.

This is particularly problematic with LAM's shared memory/TCP scheme. i.e., if rank X migrates, it needs to re-figured out if rank Y is on the same machine or not. Specifically, it needs to re-initialize its entire connection table and either [re]connect its sockets, or [re]setup shared memory to communicate with Y. Even more generally than the TCP/shmem problem, this is definitely going to change the RPI somehow.

There are other issues as well -- how do we start up a LAM job under Condor? LAM currently uses a separate daemon process (the lamd) for a bunch of additional services, such as process control (fork/kill), an out-of-band message channel, and a global database for arbitrary key=value pairs (for MPI-2 MPI_PUBLISH). I guess it also functions as a scope mechanism as well -- providing a "universe" for a single user.


For efficiency reasons, we may only want to only checkpoint/migrate some ranks -- not all of them. Hence, there are two kinds of ranks: a rank that will get checkpointed (and possibly migrated), and a rank that will not. It seems to make sense to notify the entire parallel application (i.e., all ranks) when even one rank is checkpointed with intent to exit (e.g., because it will be migrated). So there's even two types of checkpoints: (a) one to just save the process's state (i.e., checkpoint the entire parallel application just for save/backup purposes), (b) and one to migrate one or more of the ranks to a different node.

We'll discuss (b) first (checkpointing for the purpose of migrating), because it lays the groundwork for (a).

Checkpointing for migration: the checkpointed rank

So it seems that LAM needs to take some actions before it allows itself to be checkpointed, and them immediately after it restores from a checkpoint. So if a LAM job can get some signal when it wants to be checkpointed (possibly via nrecv() from the local unix named socket, which we currently implement with SIGUSR2 so that the MPI process knows to go check the socket), a signal handler can be fired, read the message, realize that it wants to be checkpointed, flush and close down and invalidate all its communication channels (including the local unix socket to the lamd [or lamd-like underlying services] sockets, GM ports, shmem, etc.), and then checkpoint itself. This will require at least one new RPI function so that we can keep the RPI abstraction clean and apply this to all of our RPIs --
close/invalidate procs (with the assumption that no new communication will happen before we re-invoke _rpi_c2c_addprocs() to re-add all the communication channels again).

The Condor guys tell me that there is a checkpoint_and_exit() function that, when called, dumps the state of the program out to a file (or a socket), and then exits. Very handy! When the process is restored, it just returns from this function. Ultra cool!

So after returning from this function, an MPI rank must obtain the [potentially new] locations of its sibling ranks. I'm thinking that this will come from an nrecv() from the underlying infrastructure (i.e., Condor) -- it will get an array of information saying where everything is (how to do different RPI's? GM ports vs. TCP addresses/ports, for example? Might have to re-init those as well; re-look for open GM ports, etc.).

That is, the run-time system that potentially moved the ranks in the first place will know precisely where all the ranks are, so it can provide the location information to each rank. Once this information is provided to each rank, the ranks can effectively re-do some of the stuff that they did during startup (contact their local "lamd", establish C2C communications with the other ranks by calling _rpi_c2c_addprocs(), etc. I'll explain why "lamd" is in quotes later).

Specifically, the sequence of events on a single MPI rank will be something like the following:

  • Receive SIGUSR2.
  • nrecv() a message indicating three things:

    • One or more MPI ranks is going to migrate.
    • Whether this rank needs to checkpoint.
    • Whether this rank is going to migrate.

  • Flush all C2C and local "lamd" communications.
  • Close down all C2C connections.
  • Close down connection to the local "lamd".
  • If this rank is to checkpoint:

    • If this rank is to migrate, call checkpoint_and_exit(). The steps below will commence when the rank has been migrated and starts up again, and returns from checkpoint_and_exit().
    • If this rank is not going to migrate, call checkpoint().

  • Re-establish a local socket with the local "lamd".
  • nrecv() a message with new location information on all MPI ranks.
  • Repeatedly invoke _rpi_c2c_addprocs() (and whatever else is necessary, perhaps _cpi_c2c_init()?) to re-establish C2C communication channels.
  • Return from SIGUSR2 handler and continue processing in user code as if nothing had happened.

I think that's essentially it. There's a bunch of details in there, of course, particularly in the re-initializing C2C connections bit, but that should all be resolvable with some clear and potentially clever re-entrant C2C init code. Hence, when we go through this checkpoint/migrate phase and re-establish C2C communications, we essentially re-initialize the C2C subsystem -- do the exact same thing as when we do it the first time. That would probably be the cleanest approach.

Checkpointing for migration: the non-checkpointed ranks

Upon further thought, I guess there is little difference between checkpointed ranks and non-checkpointed ranks. There could be a slight optimization in that it is really only necessary to send new location information for ranks that have migrated -- the old location information is sufficient for any rank that has not migrated. However, it may make it easier in terms of less complexity to only have one code path -- just receive all new location information.

However, the question does arise -- when one MPI rank out of a parallel job is migrated, what happens to the other ranks while the rank is in process of moving? There are two approaches:

  • Make the other ranks freeze and wait for the migrating ranks to be restored and C2C communications have been re-established. This certainly makes implementation of the MPI side easier -- the non-migrating ranks can just sit blocking on the nrecv() waiting for new location information. The underlying "lamd" can just delay sending the new location information until the migrating ranks have been restored.

  • Allow the other ranks to continue in the user program while the MPI rank(s) in question migrate. They would have to freeze at the first blocking communication involving the rank(s) that are being migrated. Any non-blocking communication can continue (e.g., Isend, Send_init, etc.), but would have to be "suspended", indicating that they just get put in a queue, and will only be attempted when the destination rank(s) are actually restored from migration and C2C communication has been restored to them.

    This will add complexity to the MPI implementation, and it slightly changes the scheme presented above -- the non-migrating ranks will have to delay the second part of the scheme (i.e., starting with the nrecv() to get the new location information) until they get a second signal indicating that one or more of the migrating ranks are now ready.

    This could get arbitrarily complicated -- take the case where N ranks migrate. What if they get restored at different times? i.e., if one rank gets restored much earlier than the rest -- does the underlying "lamd" signal the other ranks in the job with just the new location information for that one rank? Or does it wait for all N ranks to be restored before signaling everyone? The coarse-grain approach is clearly easier; the question is what actually happens most of the time: does Condor (and others) piecemeal restore migrated processes, or all at once?

So this raises some interesting questions:

  • With the "easy" model of making all MPI ranks wait until all migrated processes are restored, is there really much of a difference in migrating one rank versus migrating all ranks? Since they all block waiting for the one migrated node to be restored, particularly if that one rank can't be restored immediately. For example, the MPI rank that was migrated was running on an idle workstation that suddenly became non-idle, forcing the MPI rank to migrate. But say that there are no more idle workstations available, so this MPI rank must wait in limbo for a while for another machine to become idle. But during this time, the entire rest of the MPI application must also wait. What happens to the accounting records during this time? Are Condor users "charged" with the time that the rest of their MPI ranks are blocking?

  • There is also the argument that most MPI programs tend to operate at least in some kind of lock-step. i.e., the MPI ranks are at least loosely synchronized (e.g., per iteration). So even if the non-migrating ranks are allowed to continue, they'll eventually block anyway because they'll try to communicate with a rank that is in process of migrating (or, by the domino effect, try to communicate with a rank who is blocking trying to communicate with a rank that is in progress of migrating, etc.), which could potentially (and usually!) eventually cause the whole MPI process to block anyway. More to the point: is there anything gained by allowing non-migrating MPI ranks to continue while one or more MPI ranks are in process of migrating? My gut feeling says no.

Hence, it may make sense to really only migrate the entire MPI process at once, or only migrate ranks when it is known that they can be placed immediately. This may not be possible, so it may be easiest to just make all MPI ranks block until migrated ranks are restored and C2C communication is restored. The accounting issue still needs to be addressed, though.

However, I have very little experience in the dynamic process migration area -- I'm curious to what the Condor folks have to say about these ideas and questions.

Checkpointing for saving state (no migration)

For checkpoints that do not involve migration -- i.e., checkpointing just for the purpose of saving state -- it may or may not be necessary to close all communications channels. On the one hand, no rank is migrating, so it would seem silly to close and re-establish communications with the exact same location information. On the other hand, if we want to re-start the checkpointed process later, the re-started process will return from the checkpoint() (notice -- not checkpoint_and_exit()) function. If we re-start the process on an entirely different set of nodes (e.g., a PBS or Condor job is checkpointed and then later fails because someone powers off a node, so we restart the job in a later PBS/Condor job -- the ranks will be on entirely different machines and have a different topology), we will need to re-learn the location knowledge and re-establish C2C channels.

Using this argument, it's probably better to treat a backup/save checkpoint (even with no migration involved) as a checkpoint with all ranks migrating (per the procedures shown in the previous section), so that all ranks close all communications channels and then receive new location information from the underlying system (lamd/Condor) and then re-establish all communication channels.

This would allow the most flexibility for re-starting a job. That is, even if the job does get restarted from a set of migration files, it doesn't matter if it is on the same set of nodes or not -- it will re-establish all C2C communication channels and continue from where it left off.

lamd problems

The lamd is really helpful in standalone environments. But does it really make sense in a Condor (or other run-time system)? We mainly use the lamd for the following kinds of services:

  • Process control (startup, shutdown, abort)
  • Out-of-band messaging
  • key=value publishing
  • File transfer (mainly for non-uniform filesystems)
  • Scoping mechanism

Normally, each MPI rank is associated with a single lamd that is located on the same machine. They communicate through a named unix pipe. When the lamd sends a message to an MPI rank, it pushes a message down the socket and then tweaks the process with SIGUSR2.

Note that there may be multiple MPI ranks per lamd --
it is common to run multiple MPI ranks on a single machine. In this case, they all share a common lamd (although the MPI ranks don't know or care that they are sharing a lamd).

It should also be noted that the out-of-band messaging can also be the primary message channel for an MPI job. That is, C2C communications aren't necessarily setup. It's a run-time flag to mpirun -- the user can specify to use the lamd for all communication instead of C2C. Although this imposes extra hops on the all messages (even MPI_Send / MPI_Recv messages), it can provide true asynchroncity (sp?) for non-blocking messages. That is, LAM/MPI is single threaded, so it can only make progress on messages while it is inside of LAM/MPI function calls. In the "lamd" mode, once a message is given to the lamd, the lamd is a separate process, so it can make progress on the message independently of the main thread of control in the user program. While this may seem counterintuitive and incur too much extra overhead, several LAM users who rely on non-blocking message passing have told us that they can get significant speedup using this mode as opposed to C2C.

So LAM's normal model is that each MPI rank has a single lamd that it is associated with. This may be problematic with Condor (or any other run-time system) for multiple reasons:

  • If the MPI rank ever migrates off a given machine, the lamd will also have to be migrated with it. Hence, both processes will need to be treated as a single process by Condor, which I assume would create some special exceptions in the Condor code. This is not attractive.

  • Even worse, if multiple MPI ranks are sharing a single lamd, if one of those MPI ranks migrates and the others do not, what happens to the lamd? It would seem that we need to create a new one on the machine where the MPI rank migrates to, and then have the network of lamd's reorient themselves to include the new lamd. Or, if the MPI rank migrates to a node that already has a lamd, it can just join that lamd, and no new lamd is necessary. But this would seem quite complex to implement!

Hence, it would seem desirable to be able to ditch the lamd when running in some other run-time environment (such as Condor).

Possibilities with Condor

Our short conversation with the Condor folks is that a LAM/MPI program will need to interact with their "starter" somehow, or have a custom LAM/MPI starter written that knows things about MPI programs.

My first impression (and admittedly, I don't know much about how Condor works) is that the least-cost solution here would be to have a custom LAM/MPI "starter" that can mimic the lamd services. It would seem that Condor must already provide most of what we need; the starter can simply provide a translation between what LAM/MPI expects and the native Condor underlying services. Hence, the majority of LAM/MPI wouldn't need to change -- it just opens up a local unix socket to what it thinks is the lamd, but in reality it's a Condor "starter" (or whatever).

More specifically, some of LAM's calls such as nsend(), nrecv(), rploadgo(), rpdoom(), etc., can probably translate to Condor semantics without too much trouble. So if Condor can open a socket and effectively have an nrecv() implemented locally, it can receive local packets from MPI ranks, and then process and interpret them.

Admittedly, this would put more of a burden on the Condor folks, but I think we could help out a bit as well. :-)

Checkpointing without Condor

In a non-Condor environment, it would still be highly desirable to be able to checkpoint. Can we do this without the rest of Condor? I would assume that we could make it so. I think that the key for doing this outside of Condor would be a new pseudo-daemon in the lamd to handle these kinds of things -- to furnish the new location data, for example. We'll probably also need a command like rempirun to restart a checkpointed job. Possible scenarios include:

  • A separate LAM executable (mpicheckpoint) that can checkpoint a running MPI program to a set of rank files. The checkpointing will follow the same scheme as outlined above. A run-time flag can specify whether the job should stop or continue after the checkpoint. It might also be desirable to provide a LAM-specific API call for this as well (MPIL_Checkpoint(char* directory, int stop_flag) or something). Note: we're not talking about migrating here; see below.

  • A separate LAM executable (rempirun) can take a set of rank files from mpicheckpoint and restart the job on an arbitrary set of nodes. Note that this would not have to happen in the same LAM universe -- it could have much later, for example, after the LAM universe that the original job was running in has been destroyed and a new one takes its place. Some extra condor-checkpoint-library bootstrapping is probably necessary to restart the job, but after that, it just uses the lamd to get the new location data, etc., just like it would in a Condor environment.

  • A separate LAM executable (lammoverank) can migrate one or more ranks to different nodes within the current LAM universe. This can work exactly the same way as it does in Condor. As mentioned above, this will require an extra pseudo-daemon in the lamd to know where ranks are moving and provide new location data to all the ranks.

Making this portable

There is desire to run LAM/MPI in other run-time environments (as alluded to in comments above) in addition to Condor. Scyld is an obvious target, since they have their own set of process control stuff (bproc) and whatnot. Scyld might be a bit more challenging because they seem to only support process control, not the other services that we need. Someone (Jeremiah?) suggested that we might be able to get away with one lamd somewhere in the system; I'm not quite sure that this would work, but it will definitely take a) further thought on the issue, and b) investigation of bproc and the rest of the Scyld infrastructure.

PBS is another obvious target (as well as any other batch schedulers). It would be nice to ditch the lamd in a batch environment, and rely on the batch system's underlying services for process control (the benefits are obvious, not the least of which is job accounting and guaranteed cleanup, a notorious problem for non-native support in batch schedulers), but the out-of-band messaging and global publishing still need to happen as well. PBS's TM can do the process control and can do the global publishing too (IIRC), but I don't think it provides any kind of out-of-band messaging. That will require more thought... Our initial ideas about PBS/TM (from a while ago) didn't include ditching the lamd, but perhaps this is a bit more natural extension of making this whole concept portable (i.e., replacing the lamd with underlying services, when available).

Or will a "one lamd" idea work here, too? Not sure how such an idea will work, but it's worth thinking about.

The real trick, however, will be to do this in a run-time-decidable way. That is, it would be nice, at run time to decide which underlying service to use -- native lamd, Condor, PBS/TM, Scyld, etc. That is, a user can take the same executable (assuming that their LAM was compiled for support for all of them) between all systems without having to recompile/relink. That would be nice, but not an absolutely necessary goal.

Upon a moment's reflection, from the proposed schemes above, the difference between native lamd and Condor would not be known to the MPI process -- if Condor truly emulates the lamd, there's no need to know. Whether or not the LAM has been compiled with checkpoint/migrate support is an entirely different issue (because I assume we'll need to get some Condor headers/libraries and some #if code for the checkpoint/migrate LAM code).

In order to make this workable for PBS/TM and/or Scyld (i.e., to keep the abstraction level clean), we'll have to implement lamd services in the lower levels of PBS/TM and Scyld as well. Hmm. I guess we'll have to cross the line into the root-level services earlier than we thought!

For PBS/TM, all the TM stuff is in one file, so extending that should be easy. But to do true messaging, it may take a bit more --
we may have to do some actual hacking in the MOM itself. It could be as simple as adapting the lamd's to fit in the MOM. We'll have to see. As for Scyld, I have no idea. :-)

Other problems

  • Voluntary vs. involuntary checkpointing. Is there much of an issue here? Probably not -- I don't see why involuntary checkpointing can't work just like voluntary checkpointing.

  • How about open files and whatnot? Particularly after a migration? Condor can proxy this stuff back to the original node, but does this make sense in a batch situation? What if we don't own those nodes anymore? This might be ok for Condor, but about about PBS / Scyld? It would seem bad for PBS. :-(

  • Are we trying to solve the "node goes down" problem? i.e., involuntary checkpoint at timed intervals (to files, not sockets...?), and if a node crashes at some point, we can rempirun the set of checkpoint files (which would seem highly desirable). But what about open files, etc.? If the node crashes, there's no Condor proxy to take the request back to on the original node ('cause it's down). So does checkpointing with the Condor library solve the "node goes down" problem? Or perhaps only in a limited scope (i.e., your open files won't be preserved)...? Granted, anything outside of the MPI API is outside the scope of what we need to worry about, but this does seem to be a "real world" concern that would be good to take care of. Even if it just means setting open file descriptors to -1 or NULL upon restoration of the process so that the job can know that the files are closed or something.

  • So what happens to lamboot and lamhalt under Condor? Does they effectively become noops (we can't ditch them, because users will still invoke them)? And then mpirun talks to various Condor services (for example) to do the things that the lamd would have done? One of the current functions of mpirun is to serve as a rendezvous point for the ranks so that they can all become aware of each other. Does this still need to be? It would seem that it would need to be changed somehow -- since the migration problem changes all the location information anyway, Condor itself must provide a way to get this information, potentially making mpirun's rendezvous point irrelevant.

  • Does this (running under Condor, PBS/TM, or Scyld) make sense with the MPI_COMM_CONNECT and MPI_COMM_ACCEPT models? i.e., how does a Condor job get more nodes? Or how do multiple Condor jobs join together? In vanilla LAM, only jobs in a single universe can join together. Will this be true in Condor (etc.)? More to the point:

    • What would it mean to allow multiple LAM universes together? What about the obvious security concerns with this?

    • How will a universe be defined in Condor? Will you have to (for example) ask for M nodes and start M different jobs and have them CONNECT / ACCEPT to each other?

    • If this is the case (still only connect within a single universe), is CONNECT / ACCEPT useful within a Condor context?

    • The same question applies to SPAWN -- does the user have to request a maximum number of nodes ahead of time? Or, when SPAWN is invoked, does this have to allocate nodes from Condor dynamically and then spawn on them? This scheme would seem attractive, but it may cause the MPI application to hang while waiting for nodes to become available?

    • In a dynamic environment like Condor, is dynamic processing useful at all, given that a SPAWN may have to block waiting for the underlying system to make nodes available? Does the whole MPI application (or, at least the ranks who invoke SPAWN) have to block waiting for this to happen? (no one has answered this yet -- it's not even defined in the MPI standard)


So these are my initial thoughts. In spite of all the unanswered questions listed above, I believe that this can work. Some trips Wisconsin<-->South Bend and some teleconferencing and a ton of e-mail will likely be necessary. But this is ultra cool stuff, and will be immediately useful to lots of people in the real world. Plus, we'll get lots of papers out of it, become famous, and one or two people might degrees out of it. :-)

November 19, 2000

17 days and a wakeup

We effectively stomped on Rutgers yesterday. Woo hoo!!

We looked a bit sloppy at times; their quarterback was quite good, actually, although he was a bit too hasty and kept taking high-risk passes. So we kept intercepting them. :-) Aside from a few nervous points, it was a fun game to watch. Go Irish!

Spend some of yesterday playing with modules in LSC's AFS space. I preliminarily made up modules for PBS, LAM, MPICH, Workshop, and Forte6. We will probably make up modules for all the GNU stuff (although they'll be broken up into several modules -- the compilers and auto* and libtool, Gnome, and the rest of the GNU stuff, or somesuchlikethat). Lummy wants to go a bit hog wild and have our own copies of latex, X, etc. We'll see -- we've been trying to have a higher bandwidth discussion about this for a few days and keep missing each other.

This all precipitated because I'm genuinely worried about having all the GNU file utilities first in our path rather than the Solaris ones. If I want to work in Linux, I'll work in Linux. If I want to work in Solaris, I want to work in Solaris -- not Linux. I've been burned a couple of times by having the GNU stuff first in my path (ar, ranlib, make, etc.) rather than the Solaris stuff, and I don't want that to be. It just scares me, 'cause we'll end up coding for GNU-specificisms without even knowing it. And that will suck (that's one of my pet peeves: people who code for GNU-specific extensions and say, "just use gcc" everywhere. They don't understand what they are saying. Although I have personally discussed this with many people, I'll put it here in my journal to get it on the record: take the Alpha processor, for example. When you switch from Tru64 to Linux, you lose at least 10% of the performance [there are hard numbers to prove this]. And when you switch from custom compilers to gcc you lose at least another 10% of performance [I'm speaking of high-performance applications, of course]. gcc just doesn't have the punch on all platforms. Portability is only half the story).

Anyhoo, we're going to split it up somehow. The exact mechanism remains to be seen. Modules are pretty nice, actually, and surprisingly easy to setup and maintain. Although we've been meaning to do this for quite a long time, we really should have done this a while ago.

Saw the movie "Bounce" with Ben Affleck and Gweneth Paltrow (sp?) last night with Janna and Tracy. Yes, it was a concession to the ladies (who wanted to see it). I'll give it a sympathy, but that doesn't really rate the quality of the movie because it's just not my kind of movie. So if you want an honest rating, go see it yourself.

Today will be spent putting together a real skeleton for my dissertation. I've started this a few times, but really need to carry through and actually put all the .tex into one place and start shaping it up to be a real dissertation.

Off to write... whoo hoo!

November 21, 2000

Who needs green beans?

Dentistry, while painful, is interesting.

Here's some interesting factoids that I learned this morning while having a cavity filled:

  • Dentists' drill tips are made of a diamond/metal carbide. They spin at many thousands of RPMs, and when combined with a little spay of water, vaporize whatever they come into contact with.

  • The jaw nerves are split in half. So when they give you novacain, it only numbs up half of your jaw/face. Right now, the right half of my chin all the way up to (and including!) my right ear are numb.

  • Modern cavity fills are multiple layered: I forget the name of the first one, then a "primer" layer, and then a bonding agent. The bonding agent (IIRC) is light activated -- so they have a "light gun" that shines a many-watt highly-intense light on the tooth to make the bonding agent cure. There's an orange shield around the nozzle so that the dentist can watch/direct the light without being blinded.

  • It's difficult to talk when half of your tougne is numb.

  • We have nerves in our teeth only for the sake of knowing when something is wrong. i.e., the nerves in our teeth on serve as warning indicators. Sharks do not have nerves in their teeth. Godgineer must have figured that since sharks lose teeth all the time (and promptly grow new ones to replace to lost ones), it would be less efficient to put the warning indicators in there. Since we humans only get two sets of teeth, having the "failure alert system" was a good engineering decision.

  • It feels really, really weird to drink something and only feel it on half of your tounge.

  • Dentist drills can go at different speeds, not only for the different types of work that they do, but also because it is possible to resonate within the jaw and within specific teeth. Hence, if patient starts resonating with a given drill, the dentist can switch to a drill with a different set of harmonics. (No I'm not making this up; it happened to me this morning!)

My sister is hosting the big Squyres Clan Thanksgiving Dinner this year; just about everyone in the family will be there. She came up with the bright idea early yesterday afternoon to rent a PlayStation "for the boys", and called my brother-in-law at work to go rent one (apparently his work is literally right across the street from Blockbuster). So he popped across the street and found a PS. But wait.. it wasn't a PS... it was a PlayStation2!! They apparently only have one, and someone had returned it literally 5 minutes previously. So Rob rented it along with several games and took it home to hook it up.

He didn't go back to work.

It should be much fun!

I've been playing with modules in the LSC AFS space. I have them pretty much stable and working now. There's two distinct sets of modules: ones that are cross-platform (e.g., LAM, MPICH), and several more that are platform-specific (e.g., we only have SSL/pine compiled for sparc-sun-solaris2.6 and sparc-sun-solaris2.7). Loading the lsc module loads a default set for a given architecture -- the default cross-platform ones and then a platform-specific lsc module that loads any platform-specific modules that we have for that platform.

All in all, it's pretty neat stuff. Kinda annoying, though, since aliases aren't inherited by the shell. So you have to go through some extra hoops and hurdles to make that work right.

It's also kind annoying that the IRIX machines on campus have their own modules, but use a much older version of the modules package. Hence, in order to interoperate -- and yes, this is counter-intuitive -- we have to use the older modules version, not the newer version. Go figure (using the newer module version with the older modules causes seg faults, but using the older module version with the newer modules works fine). So that causes some extra hoops and hurdles as well. Ugh. It would be nice if there was one uniform version of module stuff across all campus.

But they certainly do make it easy to switch between versions of things, and make maintaining packages easier because each package has its own discrete module.

About November 2000

This page contains all entries posted to JeffJournal in November 2000. They are listed from oldest to newest.

October 2000 is the previous archive.

December 2000 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34