« Would you like a mouse pad? | Main | Yes, I would like a mouse pad. »

Would you like a mouse pad?

There were some rocky parts, but I think we had a good SC2000 overall.

This is an epic journal entry. Cope.


Sunday

Some of us met in the lab where we gawked at the LAM and LSC shirts that Jeremy picked up on Saturday night. They rocked. The nd.edunetwork went out around 11:45 (note: this is important for later).

Long flights, a three hour layover in Midway (what are crazy place), arrival in Dallas. Lummy met us at the Dallas airport. ATA lost Arun's luggage, but we waited there for a while anyway. Got in, had dinner (which was Much Fun), and started some slides in Arun/Brian's room.

The hotel has high speed internet access, but nd.edu was down. Luckily, nd.edu's vBNS link was still up, so we could get in via Berkeley or Argonne. So life was still ok -- we could still get to our e-mail and do some work. No biggie.


Monday

We got to the exhibition floor somewhere around 9am. We appraised the situation, said hi to all the good IU and Purdue folks, and started to get our stuff together. The commodity link to nd.edu was still down, so we started downloading LAM and XMPI via Argonne.

Then.. BAM!!!

nd.edu's vBNS link went down.

And stayed down.

Life sucked.

As Arun said in his journal, epics have been written about less. We cobbled together [mostly] working versions of LAM and XMPI from backup and working copies at Berkeley and IU and our laptops. Ugh. We were all cursing Ameritech (supposedly the cause of the nd.edu's outage, but I still blame the OIT).

It was a race against time to get all our stuff downloaded, assembled from the various repositories around the country, couple it together with some missing software from ftp.gnu.org, battle a shaky SciNET (the network on the SC2000 show floor -- it kept going in and out), and get it all working.

The deadline was 7:30pm -- my IMPI demo. We finally got enough downloaded, and I met with the HP people. We were further confounded by the fact that the union folks made us clear the aisles in order to lay all the carpet between all the booths. Hence, I couldn't travel to the HP booth to coordinate with CQ (HP's IMPI guy -- his real name is Asian, and probably unpronounceable to us Americans, so he goes by "CQ"). I finally got over there around 4pm, and we did some testing.

After a bit of futzing, we got it up and running with HP as the master and displaying on his machine (we had to download and install ssh because they didn't have it, and the IU demo machines didn't have telnet (yeah!). But it all worked out.

After some more battling (battling low battery power, shaky SciNET connections, and pesky sales droids), we got it to work properly with the IU booth machines as the master. Whoo hoo!!

We also converted Matt from Purdue from MPICH to LAM. We reduced the complexity of his Makefiles dramatically, and showed him the goodness of lamboot, mpirun, etc. He said, "I'm a convert!". Another happy customer.

MPI Software Technology (MST), however, wasn't quite as lucky. :-( They didn't bring the right kind of fiber connectors to get on SciNET, and then the local Fry's was out of the right kind. Their IMPI implementation was not quite finished, either. I managed to download a recent copy (nd.edu came back up that evening) of LAM's IMPI distribution tarball. I downloaded a copy to their LAN and helped him get it up and running (they previously had some problems trying to install LAM, but I don't quite know why...). Rossen thanked me, and started debugging.

So my demo went off at 7:30pm and it seemed to go off well. I had a varying size of crowd watching. I was a bit annoyed, though, because literally at the last minute, I got switched to the other Imersadesk, and nothing was setup right. It took a good 10 minutes to get it setup right just so that I could bring up my slides. It was somewhat embarrassing because the NIST folks (the people who funded our IMPI work) were standing there waiting for me to start talking. But it eventually turned out ok.

We gave out a surprising number of LAM key chains (they were quite popular!). We walked around a bit and saw a few people, and it was generally pretty good.

We left there, dropped our stuff off back at our room, and went to the Beowulf Bash (which was conveniently in our hotel). It was pretty cool; when we got there, they were announcing that more deer was coming immanently (and it did :-). We chatted with Dave from Myricomm (and ND grad) and swapped ND stories. I also chatted with Doug from Paralogic, Don from Scyld, and and Dan from Scyld.

Dan chatted with all of us for a while -- they do some really cool stuff in Scyld for their clusters. They have an rfork() call that forks things onto nodes (and an associated rkill()), and do process migration all over the place. They directly load the BIOS to boot linux in 3 seconds, and the get everything else from the cluster master. I don't know all the details, but it sounds good.

I also chatted with Dan about the parallel MP3 encoder that I wrote a while ago (he downloaded it was amazed that he downloaded something from a .edu site -- particularly the LAM/MPI site -- and he ran ./configure / make with his MPICH distribution, and it just worked). He also wanted to talk about a parallel ogg vorbis encoder, and wants to write a paper about it on Linux Journal (I think it was LJ -- can't recall offhand). This could be really cool. I think we might do it.

I sent Dan an e-mail later saying, "let's do it -- how do you want to precede?" We'll see what happens. Also, Scyld is interested in LAM -- to do so, we would probably need to ditch the lamd. In such a case, Scyld would have to provide some services like process management (which I think they already do), an out-of-band messaging channel (which might be harder), potentially trace gathering, and name/value publishing. We'll see how this all works out.

After all the schmoozing, Brian and Arun and I had cigars downstairs and had a good chat about all kinds of things. Rock on.


Tuesday

Saw some MPI papers in the morning. Two were about one-sided implementations. The third was about... er... something. One guy presented results with LAM. Whoo hoo!!

We schmoozed all day. We officially ran out of key chains. We got several t-shirts from several companies, including a really nice button down shirt from Veridian (the PBS folks).

We talked to all kinds of people -- so many that I actually don't remember everything that happened that day. It was good. I do remember chatting with the Myricomm folks quite a bit, though, and chatting with the PBS folks, NIST people, HP,

I stopped by to see how MST was doing with IMPI. They were still having some problems, but I didn't have time to debug with him. I came back later and helped some more -- turns out that he wasn't zeroing out the upper 12 bytes in the IPV6 address, so LAM wasn't able to find a match in the source address. Hence, dropped packets. This turned into goodness; the MST/LAM ping pong tests started working.

Dinner was with the Research@Indiana folks at Fish: An Upscale Seafood Restaurant. All us ND students sat together (except George, who sat with Jesus, 'cause they got there a bit after us). Our conversation was mostly about the GPL, licenses, etc. It was pretty good, all around. A good time was had by all, and the food was excellent.


Wednesday

Got in a bit early to setup the LAM and XMPI demos. We had some real problems. :-( We uncovered some bugs in XMPI at literally the last minute, so I canceled the XMPI demo, and we did just the LAM demo. We actually had some problems there, too -- we had problems making a user MPI program fail in a controllable way (we wanted to show the usefullnes of running an LAM/MPI program under a debugger). But we finally got it, and it worked out ok.

However, we did have major problems with the Sun Workshop debugger -- we just couldn't get it to run. gdb didn't work, either. We had 4 UltraSPARC 10 machines to run down here, but they weren't quite setup the way that we were expecting. In particular, we asked for tcsh to be our default shells. But after some painful processes of elimination, we proved that the tcsh that was installed on those suns was broken -- it caused gdb to fail, and it sometimes caused logins to hang and have tcsh CPU usage to go around 95%
or so. VERY annoying, and very difficult to track down
-- how often do you actually suspect the shell itself? No, you assume other things are wrong (like your . files, the OS, etc.). But switching to csh fixed everything. I've never see anything like it before.

But we didn't figure this out before the LAM demo, so we actually run on nd.edu machines and used gdb (firing up the Workshop debugger invoked just too way too much time). The demo and talk actually went well, though.

I talked with a whole bunch of people throughout the rest of the day -- we wandered the floor some more, talked to some ASCI people, Tony and company at MST, the Compaq sales guys, etc., etc. During my "booth duty" time, I chatted lots of people about LAM/MPI and ND (including some people whose sons/daughters are currently at ND), and particularly with a guy from Sweden about LAM who mentioned that he wanted the ability to checkpoint LAM/MPI processes so that he could take his nodes down and do maintenance on his cluster. And then when he's done, restart the process and keep the MPI job going. I initially said no, you can't do it because of the "socket problem" (i.e., you can't checkpoint sockets -- more info below), but then I started thinking about it, especially with respect to the Condor checkpoint library (very cool stuff). We chatted about this for a while, and I ended up putting it in the background because other things were going on.

Spent a bit more time with Rossen and his IMPI. I don't recall what the exact error was, but we found it and fixed it, and after Rossen worked out the rest of the details, it later worked with LAM/MPI in the pmandel code. Woo hoo!

Spent a good amount of time debugging XMPI and LAM's demo (and figured out the tcsh/csh issues). After figuring out the csh problem, LAM pretty much fell in line right away. Brian and I spent the rest of the afternoon debugging XMPI and stayed after everyone left. We fixed up most everything and fixed up some nagging bugs.

Renzo called in the middle of this and we setup stuff for the BC game at ND this weekend. He's in Vegas this weekend, so no family dinner with Lynzo and the chunky monkey. Bonk! :-(

One of the problems was actually an error in Sun Workshop 5.0's <fstream< implementation. VERY ANNOYING. It turns out that using getline(fstream&, string&) to read in a blank line will start returning true for eof(). ARRGGGHHH!!!

Once we figured this out, Brian and I left for dinner (around 8pm). We passed the Myrinet folks, and chatted with them for a while (lots of laughs -- we share the same exact feelings about writing software, users, distributing software, etc., etc.). They recommended an Italian restaurant for dinner.

Brian and I headed out for dinner, and I brought up the checkpoint/restart problem with Condor's library. We talked about this for a while (we were in one of those cool Italian restaurants with paper tablecloths, so we could draw on it with the provided crayons, etc. Very handy!). A good dinner, with good food. We caught a cab back.


Thursday

More LAM pimping. Had more good chats with Myricomm/Bob Feldman; seems like we could have quite a future there. Near the end of the day, Talked to infiniband people about using their stuff as a high speed fabric for LAM. Had a look at some other booths; talked to the NPACI people, who had some REXEC people, and shared some info about LAM (since REXEC has some common elements with LAM).

Went over to the RealWorldComputing booth; they have some cool stuff, including SCore MPI. Meant to look at that last year, but...

Then we talked to a few linux integrators, pimping LAM. One hadn't heard of LAM (bastards!), but the other was Linux Networks. "Hey Jeff... we talked last year" was the greeting. Amazing. And apparently, Dog and Brian had been there about 5 minutes previously. But we had a nice chat and he gave us t-shirts.

Then the expo was over. We cleared our stuff out of the Research@Indiana booth and went back to the hotel. As we were getting on the shuttle bus, I said to Arun, "hey... some Swedish guy came up to me yesterday and gave me a great idea about checkpointing MPI jobs in LAM..." and then I stepped on the bus. I heard behind me, "Hey... you're the LAM guys! We've been meaning to find you!"

Turns out that the Condor grad students were standing right behind us and heard me mention checkpointing and noticed who we were. It further turns out that they've been having similar ideas -- wanting a checkpointable/migratable MPI. So we chatted on the bus, and then chatted some more in the bar before they had to catch a cab back to the airport. REALLY cool stuff, and we think we can do it. There's some delicious complications, but the fact of the matter is: no one else can do this, and it would be truly fantastic if we could do it.

Condor wants a checkpointable MPI and one that they can schedule/migrate around in Condor, and we want a checkpoint/restartable MPI. This could be the start of a really, really cool collaboration. I'll jot down the notes that are in my head in a technical journal entry after this. I'm still brimming over with goodness about this; I actually think we can make it all work (and get a bunch of papers, become famous, and take over the world). How cool is that?

We then met everyone else from the LSC and went to dinner at the Spaghetti Warehouse in the West End. Good food, and good conversation -- a good time was had by all.

And now I'm back here, typing it all in so that I don't forget it.

Now on to the technical journal entry about Condor/LAM...


So all in all, it was a good SC2000.

Comments (1)

bc:

Want to know more about scinet’s mini plants in mobile containers. where exactly they are in existence.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on November 10, 2000 2:06 AM.

The previous post in this blog was Would you like a mouse pad?.

The next post in this blog is Yes, I would like a mouse pad..

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34