« A reddish green | Main | Calamari airlines »

Fuzzy ethernet

Some food for thought.

PBS is just plain sucking. It's unfortunately been flakey ever since we upgraded it. :-( I did find a bug in our AFS/PBS shepherding code a few days ago that resulted in tokens being allowed to expire during PBS jobs that ran longer than the length of your initial token (which I think it defaulted to 10 hours, regardless of what your real default is), but that was our fault, not PBS's.

Yesterday, there was one job that was "stuck" in the queue and wouldn't die. The job was long done and gone, but PBS thinks that it's in an illegal state, and won't let it leave the queue. Hence, the node that that job was on wasn't released. Today, there are many more jobs like that (but those jobs are still running). I have no idea what the problem is, and I'm kinda annoyed.

We asked again for PBSPro (i.e., the commercial version) -- we first asked about 3-4 weeks ago -- and the PBS guys replied that it was taking them longer than they thought to setup their online store (even though PBSPro is free for educational users). :-( I'm kinda hoping that PBSPro will fix some of this flakiness that we've been seeing. :-(


Rusty from Argonne was here yesterday. His talk was good; I'd seen most of the material before, but it was good stuff anyway. We had good chats with him about optimizing MPI collectives (there are some really cool algorithms for this out there..), the future of LAM and MPICH, MPICH's Abstract Device Interface (version 3), my threaded booter (I gave him a copy of it, too), MPICH's mpd, etc. We had dinner at the Lumsdaine Grill, because Someone forgot to get a babysitter so that we could go to the LaSalle Grill. Ah well -- it was a good home-cooked meal, so I shouldn't complain. :-)

I downloaded the ADI-3 document, and it's huge! Compared to the spartan RPI (request progression interface) approach in LAM, ADI3 is a gargantuan.

I just noticed a post on the Beowulf list -- someone posted LAM vs. MPI/Pro (a commercial MPI) vs. MPICH results. The TCP numbers are clearly in LAM's favor. This, obviously, is because LAM rocks. However, MPI/Pro and MPICH have VIA results (which are obviously better than TCP results)... we need a VIA device... You see the results for yourself. LAM ROCKS!!!.

I've been working on IMPI stuff this week. I got the IMPI attributes on communicators working (i.e., on MPI_COMM_WORLD -- since we don't do anything other and MPI_COMM_WORLD yet, we don't have to maintain these attributes on other communicators, which would take some additional bookkeeping, because relative rank order can change, etc., etc.). I also got MPI_Bcast working in fairly short order.

I noticed a good number of typos and one inconsistency in the IMPI standard. Hence, I am proud to say that I am personally responsible for every item in the IMPI errata document. Well, ok, I only helped discover the first one (an issue with the protocol hiwater/ackmark values), but I still had a hand in it.

This is all for the SC'2000 IMPI demo with HP and MPI/Pro -- we're going to run a GUI Mandelbrot program across all three MPI implementations. Should be pretty cool, actually. We had our second teleconference today, and things appear to be going well. We plan to test the stuff across the internet next week. HP and MPI/Pro have been using LAM to test their IMPI implementations. I gave them instructions for CVS access today, so that they can get the MPI_Bcast and color stuff.

I just can't help it -- LAM ROCKS!.


Seriously, though, it is very cool to be working on a project that matters. That is, LAM is probably only used by a few thousand people around the world (at most), but there are many devoted fans who use it every day. Indeed, many people's software relies on ours to function properly -- much real-world depends on what I do in LAM to function properly. It's very cool.

The level of responsibility can be a bit scary at times (indeed, I remember the first time that I noticed a .mil site downloading LAM; I told Lummy about it, and he just smiled and said, "sleep tight!"). Real world stuff uses my code. Hence, if I fuck up, Bad Things can happen. For example, I know for a fact that companies like GE and Exxon use LAM/MPI.

But isn't this the level of responsibility that a good engineer should embrace? I think so. Being Careful about what you do is not just a state of mind, it is a way of life.


Saw a talk from Vince's advisor today about link-time optimizations. Interesting stuff. Similar to things that are available in Solaris (e.g., -O5, where multiple runs generate profiling feedback data that speed up subsequent runs), but it was neat to hear how it works. He was using it in conjunction with MPICH, so I set him straight in his ways -- since they're using TCP/IP, if they really want asynchronous message passing, they should use LAM since we can do it (via the lamd mode, which has its own tradeoffs -- the asynchronous message passing mode isn't free, so to speak).

He sounded intrigued, and said that he would get the latest version of LAM and give it a whirl. And so we progress, one user at a time, towards world domination...


Well, ND's network is going to start shutting down for maintenance in about 15 minutes, so I'm outta here. Next journal entry will be from home.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on October 13, 2000 8:17 AM.

The previous post in this blog was A reddish green.

The next post in this blog is Calamari airlines.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34