« March 2001 | Main | May 2001 »

April 2001 Archives

April 2, 2001

LAM/MPI 6.5 released

As Renzo said (for a completely different occassion, mind you), "FINALLY!".

The LAM Team of the Laboratory for Scientific Computing at the University of Notre Dame is pleased to announce the release of LAM/MPI version 6.5.

The software package can be downloaded from LAM/MPI's new web site (please update your bookmarks and links accordingly):


LAM/MPI is a portable, open source implementation of the Message Passing Interface (MPI) standard. It contains a full implementation of the MPI-1 standard and much of the MPI-2 standard.

LAM/MPI's features include:

  • Persistent run-time environment for fast user program startup and
    guaranteed termination and cleanup of resources

  • High-performance message passing, including combined shared
    memory/TCP message passing engines

  • Extensive debugging tools
  • Much of the MPI-2 standard, including:

    • support for basic one-sided functionality

    • full implementation of dynamic processes

    • handle conversion between Fortran and C

    • new attribute access functions on communicators, datatypes, and

    • new datatypes

    • support for many parallel I/O features

    • C++ bindings for MPI-1 functions

New features in LAM/MPI 6.5 include:

  • Man pages for all MPI-1 and MPI-2 functions

  • Made LAM aware of PBS so that the same user can have multiple LAM
    universes on the same host simultaneously

  • New SMP-aware "mpirun" command line syntax and boot schema syntax
    for "lamboot"

  • Added finer grain control via an MPI_Info key for MPI_COMM_SPAWN

  • Shared library support on all platforms that GNU libtool supports

  • Revamped the build process; now uses GNU automake

  • Full support for VPATH builds

  • The lamboot and recon commands are noticably faster

  • New "lamhalt" command to quickly shut down the LAM run time environment

  • New "lamnodes" command to retrieve hostnames from nX and cX nomenclature

  • Added MPI_ALLOC_MEM and MPI_FREE_MEM, mainly in anticipation of
    Myrinet and VIA support

  • Added "-s" option to lamboot to allow "rsh somenode lamboot -s
    hostfile" to allow rsh to terminate

  • Expanded many error messages to be more descriptive and generally

  • Updated MPI 2 C++ bindings distribution

  • Updated and patched ROMIO distribution

  • Added syslog messages (for debugging) to the run-time environment on
    remote nodes
  • Bug fixes

LAM/MPI supports portions of the Interoperable MPI (IMPI) standard in a separate distribution -- the 6.4 series (also downloadable from the LAM/MPI webs site). It is expected that the IMPI extensions will eventually merge into the 6.5 series.

XMPI 2.2 will not work with LAM/MPI 6.5. A new version of XMPI will be released soon that will include support for LAM 6.5.

The full source code for LAM/MPI is available for download. Linux RPM's for all three of LAM's message passing engines (pure TCP, combined TCP/shared memory with spin locks, and combined TCP/shared member with semaphores) are also available.

All downloads are available for download from LAM/MPI's new web site (please updated your bookmarks and links accordingly):


A list of mirrors of this site is available at:


The web site also contains details for CVS access to the LAM/MPI source tree, FAQs, MPI and LAM/MPI tutorials, and generally lots of other additional information.

With this release, the addresses of LAM/MPI's two mailing lists will be changed to reflect the new address:

  • General user's mailing list

    This list is for questions, comments, suggestions, patches, and generally anything related to LAM/MPI (in order to control spam, you must be a subscriber in order to post to the list). Web archives of the lists, as well as individual and digest subscriptions are available. See the following URL for more information:


  • LAM/MPI announcement list

    This is a low-volume list that the LAM Team uses to announce new versions of LAM/MPI, important updates, etc. Public posts are not allowed. Web archives of the lists, as well as individual and digest subscriptions are available. See the following URL for more information:


Make today a LAM/MPI day!

I'm off to astound the world with more feats of adequcocity

Some quickies:

  • Spent the entire weekend (and I do mean entire weekend) doing taxes and old Army paperwork. My taxes seem to be a bit complicated this year 'cause it seems that I did an improper IRA conversion to a Roth IRA, which I now have to undo. Ugh! The Army paperwork is mostly really old stuff that I should have done a long time ago. I have to send all that stuff to Georgia first, and then they'll sign stuff and send it on to the proper destinations. Oops. :-(

  • Tracy took care of doing a final walk-through of our old apartment (a.k.a., "the hell hole"), and it's now finally out of our possession. Woo hoo! We sold 2 of our old room air conditioner units, and still have the Big Monster one left -- we left it in the apartment, and the owners said that they would have the new renters call us to work out a deal.

  • I'm going to get a fax machine for home. It's been really, really annoying having to have Tracy fax stuff for me, and the frequency has increased lately, such that it's gone beyond "one or two faxes in a while", such that I really should be doing them at home and not on the company bill.

  • The ND women's basketball team rocked last night and won the national championship. That's just so cool! Many congrats to them (like any of them will ever see this :-) -- they did themselves and ND proud.

  • LAM now seems to pass all tests. I'm going through a checklist before releasing. Brian and I are synching up later today, so it's quite possible that we're finally going to release LAM 6.5 today. Woo hoo!!!

  • Finally saw Gladiator this weekend. Not a bad flick. I give it 10 minutes. Probably would have given it more, but it was way too hyped up for me.

  • xmms crashed earlier today, but much earlier than it usually does. There were only 481 copies running (out of 555 processes total). <shrug>

  • Just to clarify (some have asked) -- I was not affected by the Northpoint DSL shutdown. Telocity is my ISP, and they use many "routing" companies (including Northpoint). The "routing" company around here is Covad, so I wasn't affected. My recent DSL woes have been caused by increased solar flare activity (a catchall to blame random occurances on).

That's it for now.

There are currently 34 xmms processes running out of a total of 112 (30%).

Why's your mom taking the SATs with you?

Forgot to mention -- I had to change my clocks for daylight savings time for the first time in many years. I now live in a part of the world that actually participates in daylight savings time, so I have to switch my clocks twice a year.

What a strange experience.

April 4, 2001

Matthew just told me to go fetch his lunch

The week started off well.

We released LAM 6.5 and 6.4a7. I included the press release about this in a prior journal entry. There was much rejoycing.

But then I checked my e-mail this morning and saw that some guy claimed that he couldn't start parallel jobs under Linux with LAM 6.5. And then 3 or 4 others chimed in saying the same thing. Nooo!!!!!

Needless to say, this is a software developer's nightmare: discover a bug immediately after a big release. It took a while, but I tracked the bug down to a faulty test in LAM's configure script. And it only seemed to affect Linux (it had to do with pseudo-tty behavior). Arrrgghhh!!!

Someone else also found a legitimate bug in the C++ bindings. It's amusing because the C++ bug has been there for quite some time, but it just happened to be found on the heels of the Linux pty Big Bug.

So I released 6.5.1 and 6.4a8 this afternoon.

Hopefully, things will be ok now.

No, Trond from Redhat just e-mailed me and says that all the tests are failing on his Linux 2.4.2 machine. Arrgggghhh!!!! In all fairness, we've never tested on 2.4.2, so I'm kinda hoping it's just some kind of stupid difference between linux 2.2.x and 2.4.x. He's going to give us access to his machines tomorrow to give it a whirl. We'll have to see.

Ugh. All of this made today be pretty crappy.

I finally had my windshield replaced the other day. It cracked itself quite a while ago after a particularly cold evening. I just came out one morning and there was a 2 foot crack across my windshield. It clearly wasn't impact damage of any kind; it just appeared there. So I assumed it was thermal damage.

Anyway, now that I have a garage, I finally called USAA to start a claim on my windshield. It didn't cost me a dime, and they had a guy out here the very next day to replace it.

I watched him do it -- it was fairly interesting. Lots and lots of sealant to keep those windshields in place, and keep water out. The guy told me that Saturns were probably his least favorite windshields to replace (this is all this guy does -- replace glass in cars; he's been doing it for 12 years, so I would guess that he pretty much knows what he's talking about) because they have a larger curve than most, and it makes it a bit more difficult to get the new windshield in, etc.

The main sealant that he used to hold in the new windshield was some caulk-like stuff that he put around the frame before he put in the new windshield. When he was all done, he told me to wait about 2 hours before driving because the caulk would need time to cure. The windshield would still stay in place if I needed to drive, 'cause there's other clips and strips and various insidious devices holding it in place, but apparently (and I didn't know this beforehand) the purpose of windshields is not only to keep wind and rain and whatnot out of the car, they are also to keep passengers in the car in the event of a collision. And if I drove before the caulk cured, in the event of a collision, the windshield could pop out.

So that's an interesting engineering issue -- making caulk-like adhesive and a plate of glass that is strong enough to hold up to several hundred pounds of humans and other loose objects in the car, assumedly all moving with a very large momentum. Woof!

I've had to get my DoD "secret" clearance updated. Apparently, my last background investigation was done in 1990, and they're only good for 10 years. So my original clearance has expired. I got a packet in the mail for my reinvestigation. I had to download some questionairre program (Windoze only, of course) that asked a zillion questions about my history.

One of the things that it asked was all of my addresses for the past 10 years. After some thinking about this, I was surprised to discover that I have lived at 16 different addresses over the past 10 years (including my current address). Wow. No wonder I hate moving!

I also had to be fingerprinted. This seems kinda weird, especially since I've been fingerprinted before (when I entered ROTC). A person's fingerprints never change over their life -- they expand a bit, but my understanding is that the unique characteristics of the whirls and whatnot stay the same, albiet they typically grow in size as your hands get larger. So why did they need them again? Who knows...

I went to my local police station and was surprised to find out that they only do fingerprinting on the third Thursday of every month (no joke). I know that ND security department does it if you just walk in (Brian had it done for his DoE clearance about 2-3 weeks ago). However, my local police department gave me the number to some adjoining precints, so I called them, and one of them does it every day.

I went today and had it done. The officer who took my fingerprints says that they do about 10-15 a week, for all kinds of different agencies. DoD (department of defense), DoE (department of energy), FBI, various insundry banking and trading firms, etc., etc.

I stopped at the mall to get Tracy's birthday present (next week
-- I know I'm safe, 'cause she never reads my journal :-). I parked at one side and had to walk clear across the mall -- through various department stores and whatnot -- to get what I was looking for. How annoying. And there's all the chatty folks in the aisles in the mall with the mini store displays selling cell phones and sunglasses and watches and portable walrus scrubbers, each one of them feeling the need to ask you if you want their particular product as you walk by.

No, I don't want a portable walrus scrubber. I'm just trying to walk by.

Don't get me wrong -- the mall is the qunitissential symbol of American capitolism (Suzanne and Rich -- don't you dare start quoting facts at me here, I'm on a roll), but with all those stores all selling essentially the same items (are clothes from the Gap really much different than Tommy Hilfinger clothes?), I just have to ask myself: why?

How did I possibly enjoy going to the mall when I was a teenager? Oh, wait -- I didn't.

Maybe I'm just one of those people who likes to go get what they want and not have to bother with 16 billion choices. Maybe I was just in a pissy mood because someone found a real bug in LAM earlier today (this was most likely it). Oh well. End of topic...

My aunt gave me the e-mail addresses of my cousins Pat and Chris the other day. I mailed them, but they haven't replied yet, the little weasels. I'm sure they've seen my mail -- they're the ones who were almost sold into slavery to pay off the excessive AOL telephone bill last month, so I'm sure that they're online all the time...

Tracy's parents are visiting her grandmother in Illinois this week. They're stopping by this weekend to visit and to see the house.

Must go; have been doing LAM stuff all day, and no dissertation work. Ugh!

There are 449 xmms's running on queeg, out of 552 processes total -- 81%.

April 5, 2001

This is unbelievable

I am flabbergasted.

As I mentioned in a previous journal entry or two, I am in the process of filling out a background check for the re-upping of my "Secret" DoD clearance. This is a periodic and normal thing.

I downloaded the software from their web site (which had strong encryption export control warnings all over it), and filled out all the questions. At the end, it spits out a .zdb file.

Here's the part that astounds me: they tell me to e-mail this file to them.
They claim that this file is encrypted and it's safe to e-mail (I even spoke to two different people on the phone who claimed that the .zdb file is encrypted). However, there are some major flaws with this claim:

  • The very same web site that allowed me to download the "user" version of the software also had the "security manager" version of the software. This version decrypts .zdb files. So just anyone in the world can download the decrypting software and compromise my .zdb file.

  • I copied my .zdb file to my linux box and ran "file" on it. It said that it was a ZIP file. No way... Yes way. I extracted all the files in it and was horrified to see my social security number in plain text in multiple files. Some parts of the files actually did appear to be encrypted, but if just anyone can download the security manager version of the software, what does that matter?

The best part of this is the two agencies who are running this show. Their names are: "Defense Security Service" and "Security and Counterintelligence Management Office". You would think that with names like this, they would have a clue about data security.

And they wonder why people are talking about a digital Pearl Harbor...

April 8, 2001


It's been a few days since my last entry.

In brief:

  • Tracy's parents were here for the weekend. Partly to visit, partly to see the house, and partly so that Tracy could go shopping w/ her mom.

  • Tracy's dad and I did some house things, like install a new programmable thermostat and put pellets around the foundation of the house to repel ants and mites. The joys of being a homeowner.

  • I think Tracy and I set a record for going out to dinner for 4 days in a row. Woof. Thursday, we were both in snitty moods and didn't feel like cooking (that was the day I found out that my background check stuff wasn't encrypted). Friday was dinner w/ Janna at an English Pub a few minutes up the road from us (fish-n-chips, of course -- it's Lent!). Saturday was with Tracy's parents because Tracy didn't want to cook for her mother. :-) Today was with Tracy's parent's again because Tracy's birthday is tomorrow.

  • Tracy's parents leave tomorrow morning. It was a quick visit.

  • There's been lots of good discussion on the llamas list about All Manner of Things LAM. Good advice and tips and whatnot from the llamas.

  • Trond from RedHat is still running into weird LAM failures in some esoteric circumstances. We haven't been able to duplicate his issues.

  • I'm heading up to ND for this upcoming week tomorrow morning.

  • Been getting ready for my meeting with my dissertation committee this week (no, I'm not stressed...). I'll be spending some more quality time in the library this week, and will spend time writing writing writing...

  • Got some blinds for my office instead of these sheets that are hanging in the windows now. I'll hang them when I get back from ND.

  • (can't remember if I mentioned this in a previous journal entry or not) My new linksys switch came last week. I'd been waiting for quite a while, and kept checking on the status of it at amazon.com. The weird thing is that it said "Delivered", but I never saw it at my door. Hmm. So on Wednesday or Thursday, I called UPS and punched in the tracking number. They said it had been delivered on Monday... to the back door. Doh! Sure enough, it was sitting at my back door. It had been sitting there for 2-3 days. Gotta remember to check for those back door deliveries, I suppose. :-)

Gotta run now. There are 947 xmms instances running on queeg, out of a total of about 93%

April 14, 2001

Strange... the Garelli 5000 had exactly the same problem


I'm behind on journal entries. Let's catch up:

  • I was at ND for most of this past week. The main goal was to have a synchronization meeting with my Ph.D. committee (which was on Wednesday). I drove to ND on Monday morning (Tracy's parents and I left at about the same time). Spent a good amount of time up at ND refining my presentation for my committee meeting and rehearsing with Lummy. All in all, things went pretty well, and my committee was pleased with my work. They made a few suggestions and clarifications which changes a few things that I had planned, but they're not too big of a deal. Rusty drove over from Argonne for the day, and it is always good to talk with him.

  • About 20-40 miles out of Louisville, I realized that I had forgotten my rollerblades. Doh!! I guess I'll be walking to ND all week...

  • Only Brendan, Brian, and I went to wings. It seems that BW3's has deleted our RLYBAD account on the trivial game. This is truly the end of an era -- we have had the RLYBAD account for years, and now it's gone. RLYBAD is dead -- long live RLYBAD!

  • OO Stamtish was fun. I only stayed for an hour or so. The new crew is working at Sr. Bar, so the opportunities for free stuff are now severely limited.

  • Went out to dinner w/ Dog on Wednesday night (which started with me stopping by his office and chatting around 6 or 7pm, and, an hour later, I said, "hey, let's go get some dinner"). Dog is good people. Also had some good conversations with Curt. Curt is also good people.

  • Chatted with Rich about his work -- he was frantically trying to finish his Ph.D. proposal by this Thursday. He's working on multithreaded message passing systems; we talked a bunch about how LAM works and whatnot. He's seems to have three main choices to do his work:

    • Use Sun/MPI, since it's thread hot. But ND still hasn't installed it (even though they've owned it for at least several months)
    • Use LAM/MPI, but it has the major drawback that it's not thread hot, so Rich would have to make it thread hot. Not a minor task.
    • Write his own message passing system from scratch.

    We talked a bit about LAM/MPI and how it worked, and some general message passing things (who knew that it would ever be so incredibly hard to get bytes from point A to point B? It's much harder than one would think...) I tossed the idea out that he could use LAM without using MPI -- I pointed him at all the Trollius man pages and whatnot, and explained how he could get the use of the daemons without using our MPI layer, etc., etc. Who knows -- that might prove to be a workable solution for him.

  • LAM meeting on Thursday was good; Brian has had some success with Scyld. Although it's not quite what we want it to be yet, it does work. We'll probably make it a bit more slick before releasing it. Arun had done a little more on the Myrinet mop-up, but not much. After the meeting, I helped him add some environment variables which will allow the user to specify (at run time) the tiny and short message boundary sizes. This is an important tuning knob, and it turned out to be a little tougher to implement than we thought because we were using compile-time constants to size some static arrays. But we worked around it and it seemed to work; Arun's going to finish the testing this week.

  • Arun has made his decision to fade away from the LAM group, mainly since he will be staying at ND when we go to IU, and his future involvement with LAM is probably going to be pretty limited (if at all). He'll continue to answer LAM mail through the end of this semester, and get Myrinet out the door, but that will more or less be it. Sadness. :-(

  • I had one more meeting w/ Lummy on Thursday before I drove home. We talked about my committee meeting from the day before, clarified a few things, and set a few directions. I've got to write code code code to get the final polished version of my "manager worker" code out (although we decided that, strictly speaking, "manager worker" is not what this program does, so I have to come up with a better name, such as "distributed multithreaded data parallel framework" or something).

  • Drove home Thursday afternoon / evening. Scheesh, gas is expensive!

  • Went to Epiphany on Friday morning to have a look at some of their e-mail woes (I just switched a small "test" group of them over to Outlook Express with the DSL-provided e-mail). OE seems to keep freezing up on them, which is pretty surprising and disappointing (it's the most recent version of OE on Win 95 and 98 machines). I think what's happening is that when OE launches, it launches two windows -- the main window and a separate "checking your mail..." window. It even shows up as two items on the 'doze task bar. The "checking your mail..." window then prompts for their password.

    However, sometimes OE puts the main window on top of the "checking your mail..."/password window. And therefore the user doesn't see it. So they start using OE, even though the "cym..."/password window is still there and waiting. OE is configured to check for their mail every 10 minutes. It seems that if they either manually click on "send/receive" or if the 10 minute timeout expires and OE tries to check for main, it gets confused because the "cym..."/password windows are already open, and hangs. Weird. And lame.

    It seems to have a simple workaround -- always put in the password right away, even if it goes to the back (i.e., bring it to the front and put in your password). We could check the "save your password" box so that the issue never comes up, but I'm not a big fan of that --
    I prefer users to have to think about security once in a while. Plus, it means that anyone could walk up to their computer and access their e-mail. This is probably not a big deal in a Church staff environment, but there are enough random people walking through the offices in a given day that it is something to consider. I hope we don't have to do that, but we'll see.

  • I got home around 2pm (did a bunch of other maintenance, too, since most people had taken Good Friday off), and spent the entire rest of the day doing taxes with Tracy. Ugh. The federal stuff was essentially done (just one or two minor corrections), but the state stuff was extremely confusing. I bought the Indiana and Kentucky programs for TurboTax. Indiana was quite good -- it did all the Right Things for our Tax situation (although it did have the annoying "feature" that if you started going through the interview and navigated to somewhere else in the middle of the interview, you couldn't re-start the interview where you left off -- you essentially had to ditch that data and start the interview again. Very annoying). The Kentucky program, however, sucked. Kentucky allows four filing statuses: single, married filing separately by on this one form, married filing jointly, married filing separately on different forms. Because of our particular tax situation, we needed to do the last option. But Ttax didn't have that option. After grappling with this for several hours (combined with finally figuring out that we needed to use that last option), we finally just looked up the relevant forms in Ttax and filled them out manually.

    Figure this one out: Tracy, had had income in Kentucky for last year, filled out a 2 page return with a single additional schedule for itemized returns. A total of 4 pages. Me, who had no income in Kentucky for last year, and who was not even a resident in Kentucky last year, had to fill in a 2 page tax return combined with about 10 pages of stuff from my federal tax return -- all so that I could say that my tax owed in Kentucky was zero. Gotta love taxes...

That's it for now. I have a separate journal entry brewing about the whole Chinese/American plane collision thing.

There are 139 xmms processes running on queeg right now out of 211 total (66%). When I came home from ND on Thursday night, xmms was frozen, so I had to kill and restart it.

An editorial

A few words about this whole American plane colliding with the Japanese plane thing... I am not a diplomat. I am not a statesman. I am not wise in political ways. These are just my thoughts; they have no correlation to any official positions that I hold, nor are they related -- in any way -- to any of my employers. These are also not well-studied conclusions; they are just are my personal thoughts.

From our point of view, it seems that the Chinese pilot was clearly the aggressor. Our pilots claim that the plane was flying straight and level on autopilot and the Chinese pilot approached at high speeds, multiple times (getting as close as 3 feet from the left wing at one point). The third time, the Chinese pilot apparently misjudged his approach, which resulted in the crash.

This appears to be consistent with the facts that the Chinese plane is much more maneuverable than the American plane. Indeed, I know that if I was the pilot of the American plane, I'd be flying straight and level on computer autopilot for two reasons:

  • specifically so that I could claim that I was not the aggressor.

  • since any foreign pilot would be a variable (regardless of their degree of aggression or not), an unarmed plane only has one defense --
    be completely predictable and hope that the other plane doesn't hit you.

This only makes sense.

It would also be incredibly stupid for the plane to have been in Chinese airspace. While I certainly have no knowledge of that plane's specific mission, I find it hard to believe that such an electronically noisy plane (and therefore easily observable by the Chinese) would have intentionally ventured into Chinese airspace during their mission (i.e., before the collision) without permission when we are not at war with them and with no means of defense. So I find it hard to believe that they were not in international airspace. Did the skirt the border? Perhaps. But were they in Chinese airspace? I doubt it.

Is this really what happened? I would tend to think so. I know some US military pilots, and I'm pretty sure that their reactions would be pretty much what I said above.

But was it really the case? It certainly makes no sense for the American plane to intentionally swerve into the Chinese plane. But did the American plane unintentionally swerve into the Chinese plane? If so, the Chinese plane:

  • clearly must have been too close to the American plane for accepted safety limits (i.e., the Chinese pilot had no time to react to prevent the collision), or

  • was far enough away (i.e., should have had time to react), but the pilot was so unattentive that he didn't notice the American plane lumbering towards him

Either way, the Chinese pilot would share at least some of the fault.

So what really happened? It's hard to say, and I wonder if the public will ever really know what happened. There are multiple factors which influence any situation:

  • Take 10 people who were all direct eyewitnesses at the scene of an accident, and you'll still get multiple different versions of the story.

  • We (the public) accept pretty much whatever the media says, even though the media distorts just about every story reported.

  • And let's not forget that it is possible that the American government is covering up the details of the "real" story. As much as my patriotism doesn't want to acknowledge this fact, it certainly could be the case -- the scientist in me has to concede that point.

So what really happened? I don't know for sure, but I'm inclined to believe some form of what the American pilots claimed. Are all the details exactly right? Perhaps not. But what they say generally makes sense, whereas the Chinese version doesn't.

As for the Chinese accusations about how the American plane landed without permission; technically speaking, they are correct -- the American plane had no permission. But they also had no choice. I do believe the American pilots saying that they had broadcast mayday multiple times and did a 270 degree rotation around the field -- the international signal for "in distress and not in touch with the tower". The fact that the Chinese authorities didn't acknowledge these signals is something that they have not answered.

I initially shared my fellow citizens outrage that the Chinese keeping the plane.

But then someone reminded me of the fact that we did essentially the same thing a few years ago when a Russian pilot defected and landed a MiG in Japan. We examined that plane thoroughly before we sent it back to Russia -- in crates. So it's hard to fault China for doing what it did (in terms of keeping the plane). It is incredibly advanced and secret technology, and it literally fell into their possession.

Granted -- the situation is slightly different than what we did (someone gave it to us rather than a forced landing), but the larger picture is the same: an advanced piece of technology came into their possession that the owners did not intend to happen, and the owners want it back.

As for the technology itself, the crew has said that they were able to destroy all the sensitive stuff in the plane before the Chinese boarded. I'm quite sure that all their non-physical codes were able to be destroyed (computer records, access codes, etc.) as well as any code books and whatnot. Indeed, even if they hadn't, I'm quite sure that as soon as the plane announced its intention to land in China, Pacific Command started the process of changing all relevant codes. This is standard procedure -- event in the event of a possible compromise, all codes must be changed immediately. So I'm not concerned there.

But as for the machines that ran the plan, and the specific crypto devices and other kinds of secret technology on the plane -- I have no idea whether those kinds of things have self-destruct mechanisms that can be activated in the event of capture. I hope so. I'm sure the crew did their absolute best to render any technology in the plane unusable and unstudyable by the Chinese. They are all experts in their respective fields, and are intimately familiar with the machines that they fly with. We have to trust them. And I do; they apparently had at least 10-15 minutes while still in the air to potentially start the destruction process (although it's not clear that they could start the process until they landed; the plane was pretty badly crippled), and they apparently had about 15 minutes on the ground to complete these procedures.

So I have to concede here -- this is probably not outlandish for the Chinese to do. Especially when you look at the fact that we've done just about exactly the same thing. That plane is now just a pawn in the big chess game of foreign diplomacy, whether we like it or not.

That being said, I'm making a big assumption here: that the Chinese pilot did not intentionally hit the American plane so as to force it to land in China. This is a possibility, so I have to mention it, but I would think that the tensions between our countries were not strong enough (before the incident) to precipitate such an action. Indeed, it would be extremely difficult, if not impossible, so plan and execute such a maneuver and guarantee that the American plan would still be able to land (i.e., that it wouldn't be destroyed). Indeed, there are other ways of forcing a plane to land rather than a mid-air collision. So I don't think that China did this on purpose -- it doesn't make sense.

Even more importantly than the plane is the crew. I think that this is what most Americans (myself included) are most inflamed about.

Keeping the crew was quite stupid (that's a gut reaction there). Yes, I can see the Chinese's political reasons for keeping them (and I do think it was political more than anything else -- there's no military reason to keep them), but that doesn't stop me from being angry about it. They needed to silence the American version of the story while their own version was spoon-fed to their public (their media is under even tighter control than ours), keep bargaining leverage in the situation, save face while claiming to wait for an apology from the U.S., and delay as long as possible so that they could keep their scientists and technicians working on the plane.

It was also in their best interests to keep the crew safe and relatively comfortable until they were returned. Think about it: if they had harmed any of the crew and didn't eventually have them killed (or otherwise kept from communicating with American officials), the American story would come out eventually, which would have been a political disaster for China (particularly with the pending trade deals and UN stuff). The crew had to eventually be returned and in perfect health with no mistreatment.

Sure, the crew were questioned. That is to be expected. I'm also sure that the Chinese officials knew that they would get little to no new information from our crew, because I'm quite sure that unless the crew were drugged (or otherwise coerced, but as discussed above, physical violence was not an option), they wouldn't voluntarily give any sensitive information away.

They gambled that they could hold the American crew for quite a while before American government would take a hard line. And they were right. Will there be any reprocussions? Maybe. But certainly less than if they had injured/killed any of the crew. It's hard to see how a new president would be able to take a hard stance and have direct retributions against a foreign power, particularly when that president was insistent upon negotiation for the release of the crew.

That being said, I have to admit that I'm extremely happy that no military action was taken. It would have been a more-stupid-than-normal reason to go to war. Don't get me wrong --
the crew is very important -- you never leave a crewmember behind. But going to war over the fate of 24 people is just not good statistics. The old adage, "the needs of the many outweigh the needs of the few" is highly relevant here; while I'm extremely happy that the crew is home safe, I think that they too (being members of the military) would have understood if the process had taken longer and/or gotten ugly. Military members assume a certain risk when defending the American freedom; we all know this and acknowledge it when we do our jobs. While everyone is happy that it didn't come to that in this case, it certainly could have.

All that said and done, I welcome home the crew of the our plane. Thanks for defending our country. It's said too infrequently, particularly when your everyday job can have you end up being held by a hostile foreign power. Thanks for keeping us free.

April 16, 2001

Oh the irony


I was just informed via e-mail that I won the ND SGI Award for Computational Sciences and Visualization in the College of Engineering at Notre Dame.

It comes with a nice prize check, which I'll be spending at least some of it on wireless networking for my home. I'll get the award at the graduate student awards banquet in May (this is my second time --
I won a GSU Lifetime Achievement Award a few years ago).

How cool is that?


April 18, 2001

C'mon Dave -- tap waits for no man

'tis the season for quickies.

  • My niece apparently loved the Barbie VW bug that Tracy and I sent her for her birthday (in all fairness, Tracy found and picked it out -- I only gave final approval).

  • Progress is being made on my dissertation code. My committee wanted to see a more hierarchical structure for the "manager worker" (gotta think of a better name), which I have been spending the last three days on. It's actually complicated just to launch the thing --
    you have to provide a map file (using inilib, of course). The startup protocols are getting complicated. It's interesting, 'cause I've never thought of using MPI for "startup protocols" before, but that's exactly what I'm doing -- spawning a bunch of sub-tasks, and then exchanging a bunch of startup meta information to setup the structure of the main computation. Kinda cool.

  • The more hierarchical structure will actually make the IMPI tests easier, I think.

  • I ordered some wireless networking stuff from with my prize money. I got a Linksys WAP and a Orinoco silver PCMCIA NIC, both of which are backordered. Sigh.

  • Since I still had more prize money, I decided to get as well (shh... haven't told Tracy yet; I know she doesn't read this, so I'm safe). We've been talking about a DVD player for quite a while, and could never quite rationalize buying one. Now that I had some "free money", it seemed to make sense. Outpost canceled my first order because they decided to stop carrying that model, but they still do have a refurbished version of the same model (for the same price). Go figure. So I ordered that one; it should be here Friday.

  • Tracy dropped and broke her Palm Pilot last week, so we had to order a new one for her. Got it at http://www.staples.com/, and used a coupon that we found at http://www.amazing-bargains.com/ and got it darn cheap without paying shopping. Sometimes, I just love the internet. It was actually supposed to be here today, so I guess it will show up tomorrow.

  • I helped Tracy with some Excel wizardy last night that apparently saved many hours of tedious work for her (it was a non-trivial formula that did some lookups and cross-referencing, handled errors, and generally could make a nice short-order lunch if you needed it to). Yay me! :-)

  • Darrell tells me that I need a "Previous" button on my journal so that it will go to the previous [time increment] in the web version of the jjc. It's on the to-do list, but probably won't be any time soon.

  • Still don't have a lawn mower, so I had to pay someone to cut the lawn again. Ugh. Not that I'm looking forward to mowing my lawn, but I am looking forward to not paying someone to do it. Err... well, maybe I'm looking forward to not having to pay someone to do it (let's not exclude the possibility of Jeff sometimes getting lazy and choosing to pay someone to mow the lawn).

  • Apparently, in my editorial about the whole "spy plane" incident, I said "Japanese" instead of "Chinese" at least once. Doh. I'm a stupid American. I do know the difference, and I'm quite sure which two countries were involved in this incident. I'm just a stupid typer, apparently. :-(

  • Back to the dentist tomorrow, hopefully for the last time in quite a while. Woof.

  • Thunder over Louisville is this weekend; a big airshow followed by the nation's largest fireworks show -- supposedly even bigger than the D.C. 4th of July show. It's over the Ohio River. My old apache unit usually puts in an appearance at the show as well (although I've heard their part is usually quite lame -- I have to say I'm not surprised :-).

  • I added a quickie feature to my MP3 web caster -- I can enqueue my entire audio collection at once into xmms. I have a lot of music. I enqueued my entire "alternative" section and found that I have 1806 MP3s. Putting it on "random play" makes for quite a nice mix, and I never have to think about what to play next -- for about a week (or until xmms crashes, which usually comes first).

Gotta run. There are currently only 21 xmms processes running on queeg (out of 93), because it crashed earlier today, when 948 xmms processes out of 1018 (93%) were running.

April 20, 2001

He's weirder than a five dollar bill

Tracy's new palm pilot came yesterday.

It's the first in a series of toys coming from my latest internet shopping spree. Well, actually, Tracy bought her new pilot because she broke the old one, but I got to play with it first. ;-)

I ordered a cool book that Amazon recommended to me the other day (ok, sometimes I really am a sucker for marketing... At least I went to http://www.bestbookbuys.com/ (I think Dog recommended this site to me a long time ago -- it rocks) and didn't get the book from Amazon! I got it about $10 cheaper somewhere else): Exceptional C++: 47 Engineering Puzzles, Programming Problems, and Solutions. Looks kinda interesting. That should probably come next week sometime.

The DVD player should arrive today as well.

Mandrake 8.0 was released yesterday. Brian managed to find a really fast mirror and we finally managed to download both CD images to nd.edu. It took quite a while, but I finally downloaded them to squyres.com and make CD's from them.

I'll probably eventually install 8.0 on my laptop. Installing it on my desktop will probably wait for a little while; when I can interrupt my work for at least a full day and not worry about it (perhaps after I hand in my dissertation...). Although I really would like to move up to KDE 2.1.1 (I'm on KDE 1.something now), and I would like to see Gnome's Evolution...

In other Linux distro news, it seems that LAM 6.5.1 made the cutoff for RedHat 7.1. Woo hoo!

Another entry is coming about my dissertation code. Those who aren't tech-heads can ignore it, as it will be quite geek-filled with details.

April 21, 2001

Gazizza, Bill

The exact topic of my dissertation has changed several times.

Here's what I presented to my committee last week, with their comments applied, as well as with information from my coding it up (particularly with their changes). Those who aren't geek-minded can probably ignore the rest of this message.

Fair warning: this is a pretty long journal entry!

Background and Algorithmic Overview

The idea is to have a "fully generalized manager-worker framework". However, the end result is that it's not quite the manager-worker model -- it's more like a "fully generalized, threaded distributed work farm". I started with a model for any [serial] kind of computation that looked something like this (it won't render right in pine -- deal -- go look at the web version):

    |=======|   |===========|   |========|
    | Input |-->| Calculate |-->| Output |
    |=======|   |===========|   |========| 

If you throw some queues in there, you can duplicate and therefore parallelize the Calculate step (keep the Input and Output steps serial, because a) they'll probably take very little time, and b) any part that can be parallelized can be thrown into the Calculate step):

    |=======|   |===|   |===========|   |===|   |========|
    | Input |-->| Q |-->| Calculate |-->| Q |-->| Output |
    |=======|   |===|   |===========|   |===|   |========|
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                 ...         ...         ...
                  |     |===========|     |
                  |====>| Calculate |====>|                        |===========| 

That's pretty standard stuff, actually. That's essentially the manager-worker model.

So what I'm doing is two things: extending this model to include threads (still relatively unexplored areas with MPI, particularly since the 2 major freeware MPI implementations have little multithreading support) and to make a distributed scatter/gather scheme.

The goal here is to present a framework (i.e., a library) to the user such that they only have to supply the Input, Calculate, and Output steps. Yes, they do have to be aware of the parallelism, but only so much so that they can make their problem decomposable. The framework takes care of all the bookkeeping. Hence, the user essentially writes three functions (actually, 3 classes, each with a virtual run()-like function, and functions to pack/unpack their input and output data. As briefly mentioned in previous journal entries, I ended up using C++ templates heavily so that the type safety would all work out).

The target audience is people who want parallelism but don't really care how it works. That would be most engineers and scientists -- even some computer scientists| Most of these kinds of users just want to run their results, and run them faster -- they don't care how it works. After all, that's our job (as computer scientists), right?

Back to the description...

From the above picture, if we're only running on one machine (say, a 4-way SMP), the Calculate boxes (instances) will be individual threads. The Input and Output instances will be threads, too. By default, there will be one Calculate thread per CPU -- the Input and Output threads will be "extra" and cause some thrashage of CPU scheduling, but not very much -- particularly when the Calculate step is large enough to run for a while.

Note that the two queues do not have threads running in them --
those queues are just data structures with some intelligent accessor functions. The Input, Calculate, and Output threads access the queues and become a thread active in the queue. But there are no separate threads running the queues themselves.

Using threads is nice because it avoids the whole issue of extraneous memory copying and allows message passing latency hiding (even with single-threaded MPI implementation). If we used the same model with pure MPI instead of threads -- i.e., where Input, each of the Calculate instances, and the Output were all separate MPI ranks on the same node, we'd be doing sends and receives between each of the instances (the queues would possibly be located in the Input and Output ranks), which would invoke at least one memory copy (and probably more). If the input data and output data are large, this could add up to be a non-trivial portion of the wall clock execution time. Using threads within a single process, pointers to input/output data can just be passed between the Input, Calculate, and Output blocks. i.e., pass by reference instead of by value. Therefore, it makes no difference how large (or small) the input and output data is.

Extending this model to cover multiple nodes, let's throw in a definition first. The node on which the Input and Output are run is called the "Server". While it would certainly be possible to run the Input and Output phases on different nodes, this model will assume that they are on the same node, just for simplicity. It is [probably] not difficult to separate them, but this work doesn't focus on that issue. Hence, there's only one server in this model, regardless of however many nodes are involved in the computation.

To extend this model to include multiple nodes, we add a special kind of Calculate instance to the diagram from above -- a "Calculate Relay":

    |=======|   |===|   |===========|   |===|   |========|
    | Input |==>| Q |==>| Calculate |==>| Q |==>| Output |
    |=======|   |===|   |===========|   |===|   |========|
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                 ...         ...         ...
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                  |     |===========|     |
                  |====>| RelayCalc |====>|                        |===========| 

This RelayCalc instance has the MPI smarts to send input data to, and receive output data from, remote nodes. Notice that it just dequeues input data and enqueues output data just like the other Calculate instances. Hence, the Input and Output instances do not need to know anything special about remote access.

Also note that there will be a thread running the RelayCalc instance. One could conceivably model the relays in the queues, but this would entail having 2 relays, and would cause some issues with non-thread safe MPI implementations (although these issues arise elsewhere, anyway), and it would destroy the idea of not having threads running in the queues. While threads are nice and lightweight, we don't need to have extraneous threads running where we don't need them. Not only are they not free (in terms of resources), they do add complexity (e.g., what would threads running in the queues do?).

The RelayCalc fits in the came category as Input and Output --
it's an "extra" thread, but it is not expected to take many CPU cycles (particularly when the Calculate phase is non-trivial).

Note that there is only one RelayCalc instance, regardless of how many nodes it is relaying to. This greatly simplifies the relaying with a single threaded MPI implementation -- indeed, to have N instances of RelayCalc to relay to N remote nodes would mean that a global lock would have to be used to only allow one RelayCalc instance in MPI at any time. This would mean that that all the RelayCalc instances would have to poll with functions such as MPI_TEST. And this would involve continually locking, testing, unlocking between all the RelayCalc instances, which would certainly keep one or more CPUs busy doing message passing rather than working in the Calculate instances, which is not desirable.

Hence, there's only one RelayCalc instance that can do blocking MPI_WAITANY calls to check for messages from any of the nodes that it is expecting output data from (and checking for completion of sent messages -- see below). This will probably serialize message passing in the server, but that is to be expected with a single-threaded MPI implementation anyway.

Indeed, even if the MPI implementation were multi-threaded, there will frequently be less network interfaces than remote nodes (typically only one), so the network messages will likely be at least somewhat serialized anyway. The best that a multi-threaded MPI implementation could do would be to pipeline messages to different destinations across the available NICs, but that's within the MPI implementation, and not within the user's (i.e., the framework's) control. Indeed, a quality single-threaded MPI implementation can pipeline messages anyway (if non-blocking sends are used). So there's actually little gain (and potentially a lot of CPU cycles to lose) in having multiple RelayCalc instances when using a single-threaded MPI implementation -- the same end result of having multiple RelayCalc instances with a true multi-threaded MPI implementation can be achieved with carefully coded single RelayCalc instance using non-blocking sends and receives with a single-threaded MPI implementation.

(There's a lot of the finer details that I didn't cover in the previous two paragraphs; those are currently left as an exercise to the reader. :-) Read my dissertation for the full scoop)

So now let's look at a typical non-server node in the model:

    |=========|   |===|   |===========|   |===|   |==========|
    | RelayIn |==>| Q |==>| Calculate |==>| Q |==>| RelayOut |
    |=========|   |===|   |===========|   |===|   |==========|
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                   ...         ...         ...
                    |     |===========|     |
                    |====>| Calculate |====>|                          |===========| 

A few interesting notes here:

  • The Input and Output instances have been replaced by RelayIn and RelayOut instances, respectively.

  • As far as the Calculate instances are concerned, the model is the same -- it dequeues input, processes, and enqueues output.

The RelayIn and RelayOut instances are the MPI entry and exit points -- input data is relayed to the RelayIn instance from the RelayCalc instance on the Server, and output data is relayed back to the RelayCalc instance by RelayOut. This is why the user has to supply not only the Input, Calculate, and Output instances, but also methods to pack and unpack their the input and output data -- the framework will call them automatically to send and receive the data between nodes.

But again, in terms of the Calculate phase -- nothing is different. It operates exactly as it does on the server node. The framework has just added some magic that moves the input and output data around transparently.

There are now two threads vying for control of MPI. Since we only have a single-threaded MPI implementation, we cannot have both of them making MPI calls simultaneously. The following algorithm allows both threads to "share" access to MPI in a fair manner.

In the beginning of the run, the RelayIn instance has control of MPI because we expect to receive some number of messages to seed the input queue. After those messages have been received, control of MPI is given to the RelayOut. The RelayOut will block while dequeuing output data from the output queue (since the Calculate threads started acting on the data as soon as it was put in the input queue), and then return the output data to the Server. Control is then given back to the RelayIn in order to receive more input data.

That is, the message passing happens at specific times:

  • Messages will only be received at the beginning of the run, or after messages have been sent back to the Server

  • Messages will only be sent after messages have been received and the Calculate threads have converted them to output data

Specifically, incoming and outgoing messages will occur at different (and easily categorizable) points in time. Indeed, outgoing messages will [eventually] trigger new incoming messages, and vice versa. So the simple "handoff" model of switching control of MPI between the RelayIn and RelayOut instances works nicely.

A big performance-determining factor in MPI codes can be latency hiding. Particularly in high-latency networks such as 10/100Mbps ethernet. An advantage of this model is that even with a single threaded MPI, progress can be made on message passing calls which actual calculation work is being done in other threads. This pipelined model can hide most of the latency caused by message passing.

That is, the RelayIn thread can request more input data before the Calculate threads will require it. Hence, when the Calculate threads finish one set of data, the next set is already available --
they don't have to wait for new data to arrive.

A possible method to do this is to initially send twice the amount of expected work to each node. That is, if there are N Calculate threads on a given node, send 2N input data packets. The Calculate threads will dequeue the first N input data packets, and eventually enqueue them in the output. The next N input data packets will immediately be available for the Calculate threads to dequeue and start working on.

Meanwhile, the RelayOut thread will return the output data and the RelayIn thread will [effectively] request N more input data packets. When the N input data packets arrive, they will be queued in the input queue for eventual dequeuing by the Calculate threads. This occurs while the Calculate threads are working -- the message passing latency is hidden from them.

This scheme works as long as the Calculate phase takes longer than the time necessary to send output data back to the Server and receive N new input data packets. If the Calculate phase is short, the RelayIn can initially request more than 2N input data packets, and/or be sure to use non-blocking communication to request new input data blocks so that requests back to the Server can be pipelined.

To improve the scalability of the system by removing some of the bottlenecks in the scattering/gathering, non-server nodes can also have a RelayCalc instance:

    |=========|   |===|   |===========|   |===|   |==========|
    | RelayIn |==>| Q |==>| Calculate |==>| Q |==>| RelayOut |
    |=========|   |===|   |===========|   |===|   |==========|
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                   ...         ...         ...
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                    |     |===========|     |
                    |====>| RelayCalc |====>|                          | (optional)|

This RelayCalc instance will relay input data to additional remote nodes, and gather the output data from them, just like the RelayCalc on the Server node.

The implication of having a RelayCalc step is that we can have arbitrary trees of input and output. That is, the Server is not the only node who can scatter input out to, and gather output data from, remote nodes -- arbitrary trees can be created to mimic network topology (for example). Consider the following tree:

                      the_big_cheese (0)
!===========! !============!
! !
child_a0 (1) child_a1 (3)
! !=====! !=====!
child_b0 (2) ! !
child_c0 (4) child_c1 (5)

the_big_cheese is the Server. It has two children, child_a0 and child_a1. child_a0 has only one child, but child_a1 has two children. The numbers in parentheses represent the MPI rank numbers (with respect to MPI_COMM_WORLD). Note that there is no restriction to having a maximum of two children =- this is just an example. Each node also has one or more Calculate instances. So the end result can be a large, distributed farm of compute nodes.

This refines some of the previous discussion: the various Relay instances (In, Calc, Out) will actually not necessarily talk to the Server -- they'll talk to their parent, child, and parent, respectively. In some cases, the parent will be the Server. In other cases, the parent will be just another relay.

The RelayIn will now need to request enough input data packets to keep not only its local Calculate threads busy, but also all of its children. This is accomplished by doing a tree reduction during the startup of the framework that counts the total number of Calculate instances in each subtree. This will allow a RelayIn to know how many Calculate instances it needs to service. The RelayIn can then use an appropriate formula / algorithm can be used to keep its input buffer full (as described above) before any of the local Calculate instances or the RelayCalc instance needs data.

The astute reader will realize that there are now three threads vying for control of MPI. Therefore, the simple handoff protocol discussed above will not work (although the handoff protocol is still applicable for "leaf" nodes, where there is no RelayCalc instance). To make matters worse, both RelayIn and RelayCalc will potentially need to be blocking in receive calls, waiting for messages to asynchronously arrive. RelayIn will only receive messages at discrete times (as discussed above), but the frequency at which RelayCalc can receive messages is determined by the node's children, and therefore could effectively be at any time. That is, since RelayIn/RelayOut will be driven not only by the actions of its children nodes, but also by its local Calculate threads, the times at which RelayCalc will need to receive messages is not always related to when RelayIn/RelayOut will need to communicate.

Specifically, there will be times when RelayCalc needs to receive a message that will independent of what RelayIn and RelayOut are doing.

It would be easiest of RelayCalc could just block on its MPI_WAITANY while waiting on a return message from any of its children. But this would disallow any new outgoing messages from RelayOut (and therefore any new incoming messages from RelayIn). The implication is that nodes will have to wait for a message from their any of their children before they can send the output data from their local Calculate threads back to their parent, and therefore have to wait request any new input data.

This can be disastrous if a node's children are slower than it is. In this case, a fast node could potentially drain its entire input queue and be blocked waiting for results from any of its children before being able to ask for more input data from its parent. Even worse, this effect can daisy-chain such that slow nodes could cause the same problem in multiple [faster] parent nodes; the fast parents could all get trapped waiting for results from the one slow child.

These questions are addressed further in the following section.

This tree design will help eliminate the bottleneck of having a single Server that has to communicate with N nodes (especially as N grows large) -- the problems of serializing the message passing could easily dwarf the CPU cycles given to the Calculate instances. That is, the communication could become more costly than the Calculation.

But just having a tree structure for scattering/gathering is not sufficient. Indeed, if a leaf node (e.g., child_c0) sends an output block back to its parent, and its parent immediately sends it to its parent, etc., all the way back up to the Server, this would effectively be no different than if all N nodes were connected to the Server directly -- the Server will get a message for every single output data block. This model would then only add hops to input and output data rather than increase scalability.

Instead, the various Relay instances will gather multiple messages into a single message (or a single group of messages) before sending them up the tree. For example, a RelayOut instance can wait for output data from each of its Calculate instances before sending them back to its parent. The RelayOut instance will send all of its N messages at once in a "burst" such that its parent RelayCalc instance will be able to process them all in a short period of time and then relinquish its CPU back to a Calculate instance. I'll refer to this group of messages as a "mega message", below.

Likewise, there will need to be some flow control on the messages from RelayCalc instances. It is desirable to group together multiple "mega messages" into a single "mega mega message" in order to send larger and larger messages as output data propagates up the tree, and therefore decrease the number and frequency of messages at upper levels in the tree. Hence, the mega messages that are received by a RelayCalc must be grouped together, possibly in conjunction with the output data from the local Calculate instances, before sending to the node's parent.

But how to do this? Does the RelayCalc just wait for mega messages from all of its children before enqueuing them all to the output? It would seem simpler to just enqueue the mega messages as they come in, and when the RelayOut sees "enough" messages, it can pass a mega message of its own (potentially larger than any of the individual mega messages that it received) to its parent.

One definition for "enough" messages could be N mega messages (where N is the number of children for this node), and/or M output data enqueues (where M is the number of Calculate instances on this node). This may also be a problem-dependent value -- for example, if the Calculate process is short, "enough" messages may to be a relatively small value.

This scheme will probably work well in a homogeneous world. But what if the node and its children are heterogenous? What is some nodes are more powerful/faster than others, or if the network connections between some of the children are heterogeneous? For example, what if some of the children nodes are connected via IMPI, where network communication to them is almost guaranteed to be slower than network communication to local MPI ranks?

The heterogeneity effect implies the problem discussed in the previous section -- that slow Calculate instances can cause parent nodes to block with no more work to do, and not be able to obtain any more work because RelayCalc has not returned from waiting for results from a child.

Another alternative is to use the "separate MPI thread" approach (where all threads needing access to MPI use simple event queues to a separate MPI thread), and have the separate MPI thread use all non-blocking communication. But instead of using a blocking MPI_WAIT approach, use the non-blocking MPI_TEST polling approach. The problem with this, as discussed previously, is that this could incur a undesirably significant number of CPU cycles, and therefore detract from the main computations in the Calculate instances. If polling only happened infrequently, perhaps using a backoff method (finitely bounded, of course), this might be acceptable.

Note that there will be one "incoming event queue" for the MPI thread where threads can place new events for the MPI thread to handle. But there will be multiple "return event queues" where the MPI thread places the result of the incoming events -- one for each thread that enqueues incoming events.

                  |-============================|   |========|
   |=============>| Shared incoming event queue |==>|        |
   |              |=============================|   |        |
   |                                                |        |
   |   |====|   |===============================|   |        |
   |<==| T1 |<==| Return event queue / thread 1 |==>|        |
   |   |====|   |===============================|   |  MPI   |
   |   |====|   |===============================|   | thread |
   |<==| T2 |<==| Return event queue / thread 2 |==>|        |
   |   |====|   |===============================|   |        |
  ...                                              ...      ...
   |   |====|   |===============================|   |        |
   |<==| TN |<==| Return event queue / thread N |==>|        |
       |====|   |===============================|   |========| 

The various threads that need to access MPI place events on the MPI thread's shared incoming event queue, and then block on (or poll) their respective return event queues to know when the event has finished. An event is a set of data necessary for an MPI send or receive.

The general idea is that the MPI thread will take events from its incoming queue and start the communication in a non-blocking manner. It will poll MPI periodically (the exact frequency of polling is discussed below) with MPI_TEST, and also check for new events on its incoming queue. As MPI indicates that events have finished, the MPI thread will place events on the relevant return event queue. The MPI thread never blocks in an MPI call; it must maintain a polling cycle of checking both the incoming event queue and MPI for event completions.

A special case, however, is that the MPI thread can block on the incoming event queue if there are no MPI events pending. This allows the MPI thread to "go to sleep" when there are no messages pending for MPI (although this will rarely happen).

The polling frequency is critical. It cannot be so high that it takes many cycles away from Calculate threads, nor can be too low such that input queues become drained or threads become otherwise blocked unduly while waiting for MPI events to complete. These conflicting goals seem to indicate that an adaptive polling frequency is necessary.

That is, it would seem that the polling frequency should be high when events are being completed, and should be low when no events are occurring. This would preserve the "bursty" model described above; when an event occurs, it is likely that more events will follow in rapid succession. When nothing is happening, it is likely that nothing will continue to happen for a [potentially long] period of time.

A backoff method fits this criteria: the sleep time between polling is initially small (perhaps even zero). Several loops are made with this small/zero value (probably with a thread yield call in each loop iteration, to allow for other threads to wake up and generate/consume MPI events). If nothing "interesting" occurs in this time, gradually increase the sleep time value. If something "interesting" does occur in this time, set the sleep time value back to the small/zero value to allow more rapid polling.

This allows the polling to occur rapidly when messages arrive or need to be sent, and slowly when no message passing is occurring (e.g., when the Calculate threads are running at full speed).

An obvious optimization to the polling model is to allow the MPI thread to loop until there are no new actions before going to sleep. Hence, if a event appears in the incoming queue, or MPI_TEST indicates that some communication has finished, both the sleep time is reduced and the MPI thread will poll again without sleeping.

This stuff hasn't been implemented yet, these are questions that I do not yet have definitive answers to. It is likely that my first implementation will be modeled on the backoff polling described above. We'll see how that works out.

There's a whole bit in here that I haven't really described about a primitive level of fault tolerance -- if a node disappears, all of the work that it (and all of its children) was doing will be lost, and reallocated to other workers. That is, as long as one Calculate thread remains, the entire computation will [eventually] finish, but likely at a slower rate.

The gist of this is to set the error handler MPI_ERRORS_RETURN such that MPI will not abort if an error occurs (such as a node disappearing). There's some extra bookkeeping code in the RelayCalc that keeps track of what node had what work assigned to it, and will reassign it to another node (including its local Calculate threads, if a) the local Calculate threads become idle, and/or no remote Calculate threads remain alive).

Just to clarify: I am not trying to be fault tolerant for programmer error. If you have broken code (e.g., a seg fault), this framework will not recover from that error. This framework will also not attempt to bring nodes back that previously failed; once nodes die, they -- and all their children -- are dead for the rest of the computation. Future work may make this better, but not this version. :-)

Probably a good way to describe this fault tolerant work is: if you have a program that will run correctly when all nodes are up, your program will run correctly as long as the Input and Output threads stay alive, and at least one Calculate thread stays alive.

Implementation Details

Woof. This journal entry is already much longer than I thought it would be (we're approaching 600 lines here!), so I'll be a little sparse on the implementation details, especially since it's all in a state of flux anyway.

The implementation is currently dependent upon LAM/MPI and will not run under MPICH. This is because I make use of the MPI_COMM_SPAWN function call to spawn off all the non-server ranks. The code could be adapted to allow for using mpirun to start the entire job, but that's a "feature" that I don't intend to add yet.

Unfortunately, the only way that I could think to specify the tree used for the distribution was to specify an additional configuration file. Attempting to specify the tree on the command line resulted in a clunky interface, and arbitrarily long command lines. I used inilib to parse the file; it's a fairly simplistic, yet flexible format.

The server starts up, parses the configuration file, determines how many total non-server nodes it needs, and spawns them. The non-server nodes are spawned with an additional "-child" command line argument so that they know that they do not need to read a configuration file. Instead, they receive their configuration information from their parent in the tree.

Sidenote: what's interesting to me about using MPI_COMM_SPAWN to start all the non-server nodes is that startup protocols have to be used. I'm used to writing startup protocols to coordinate between multiple processes for socket-level (and other IPC) programs, but I've never used MPI itself for startup meta information. It just seems weird. :-)

The server sends a sub-tree of the entire tree to each of its direct children. The sub-tree that it sends is the tree with that child as the root; hence, every child of the server learns about all of its children, but does not learn about its siblings or their children. The process repeats -- each child then sends a sub-tree to each of its direct children, until there are no more sub-trees to send.

One of the parameters in the configuration file is "how many Calculate threads to have on this node". If this value is -1 in the configuration, the value will be determined at run time by how many CPUs the node has. As such, after the sub-trees are distributed, a global sum reduction is initiated from the leaves back to the server. In this way, each relay will learn the total number of Calculate threads that it serves.

After this, it's fairly straightforward (hah!) bookkeeping to distribute input data and gather output data (mainly as described in the algorithms discussed above). The queues actually contain quite a bit of intelligence (particularly the output queue), and are worthy of discussion. However, I'm pretty sure that I have discussed the queues in a previous journal entry (they've been implemented for quite some time now), so I won't repeat that discussion here.

Future Work

Here's a conglomerated list of items from the text above, as well as a few new items that would be desirable in future versions of this software:

  • When a node dies, try to bring it back.

  • When a node dies, allow all of its children to continue in the computation if possible. This may be as simple as the grandparent assuming control of all the children, or may entail something more complicated than a simple tree distribution (perhaps something more like a graph [perhaps a jmesh?]).

  • Allow the job to be started with a full MPI_COMM_WORLD --
    i.e., don't rely on MPI_COMM_SPAWN to start the non-server nodes.

  • Multiple, different Calculate thread kernels for different kinds of input.

  • "Chainable" Calculate threads, such that the output of one CalculateA instance can be given to the input of a CalculateB instance.

  • Allow the Input and Output instances to run on arbitrary (and potentially different) nodes.

  • Allows multiple Input and/or Output instances.

  • World peace.

Sir, we can't eliminate the line item for our oxygen supply

Today was the Thunder over Louisville festival.

It's a big air show and enormous fireworks display over the Ohio River between Indiana and Kentucky. Our house is about 15-20 miles from the river; as I was out in the garage building our new gas grill, I could see a bunch of the military planes fly by on their route to the show. I saw some tankers and some fast movers (have no idea which; I'm not a Zoomie). It was actually kinda cool.

I didn't see my old Apache unit on the TV coverage, but that doesn't mean they weren't there. They've apparently been in the show in years past (I still haven't made it to a show yet; perhaps next year). They had the B1-B Lancer close the military part of the airshow; the TV coverage simply didn't do it justice. That is one Big Friggen' Plane. The only time that I have seen one was at the Chicago air show last two years ago, and it was enormous. Louder than a hum-vee stuck in the mud, too (and that's loud).

We watched the fireworks on TV, too. It was pretty amazing and I'm pretty sure that the TV coverage didn't do it justice, either. The close the bridge over the Ohio river and shoot off fireworks from both there and barges in the river. By my watch, the show was about 25-26 minutes of nonstop fireworks. Amazing stuff.

And just think -- someone probably wrote software to design (and execute!) fireworks shows. You gotta take into account ignition time, launch time, height and flight time, explosion time, fade time, timing to the music, etc., etc. Think of the database of ordinance that drives that simulation. I think the field names alone would probably make airport security monitors shrudder. Think you'll ever see that on Freshmeat? ;-)

Tracy and I tried out our new grill tonight and it works great. Needless to say, I was reading the cooking tips as I was cooking, so I did it entirely wrong, but since burgers and hot dogs are pretty hard to screw up, it all went well. Turns out we got the wrong grill cover, though (it's too small) -- I'll have to exchange that next week.

I promised Janna a good meal cooked off the grill next weekend for letting us use their 4runner to bring the grill from my local True Value hardware store (of course) to our home.

A helpful LAM user (Chris at Advanced Data Solutions -- they do oil field simulations and whatnot) have been encountering failures on his system. He sent some test code, but I still haven't been able to duplication his problems. Hmm.

We've just changed the LAM model for handling signals in user code
-- LAM used to install signal handlers for SIGSEGV and the like during MPI_INIT. Now we don't install any (except for SIGUSR2, which we need for IPC kinds of things); the lamd just notices that the child process died due to a signal from the exit status and sends that back to mpirun. mpirun prints out a pretty error message and then kills the entire parallel app (this used to be done from the signal handler in the user code itself, which was usually reliable, but there were some problematic cases where it would go haywire). The code in mpirun has been bullet-proofed to make it much more robust, too. Unfortunately, we forgot to take into account MPI_COMM_SPAWN and friends -- this new plan doesn't take this into account at all.

The problem here is that there is no mpirun waiting to receive status/death messages when a child process dies (RTF_WAIT is not set for MPI_COMM_SPAWN processes). Hence, if a child encounters a signal, no one will kill the rest of the parallel app. Hrmm.

The easiest solution would seem to turn the signal-handling code back on in MPI_INIT (it's still there -- we left it there for backwards compatibility, you can manually activate it with the "-sigs" option to mpirun) for spawned programs. This is somewhat inconsistent with mpirun'ed programs, but will users notice?

A better solution would seem to have an mpirund pseudo-daemon in the lamd that can "emulate" mpirun in the lamd, just for these kinds of purposes. Heck, mpirun itself can utilize it so that the interface is the same everywhere. I certainly won't have time to get to this until I finish my dissertation, though...

I got notice that DFAS has deposited the first of four back payments into my checking account. It's the reimbursements for hotel and rental car for one of the two AT's that I did in Atlanta. While it's really just a reimbursement, it feels like free money. Must resist urge to spend it on house stuff...


I just noticed an annoying bug in jjc -- it evaluates underscores (i.e., emphasized HTML) before functions. So if you use the HREF function and have a URL with an underscore, jjc will changed that into <em>, and complain that there's no ending tag. Grr... gotta change that order of evaluation...

April 24, 2001

I had a big history exam the next day and the only copy of the Fedaralist Papers that I had was *abridged*!

Some random haiku to sum up today.

ACM website
A vacuum cleaner sucks less
Highly ironic

My WAP came today
But still no ornico
Still hard line tethered

That's a lot of MP3s
Streaming jams all week

MP3s all day
xmms plays them well
Until it crashes

Is Motif that bad?
What if it sucked even more?
Then you have Lesstif

What rev have you got?
rpm -q lesstif
.89 is pain

qstat -sa
The hydra is full; damn CHEGs
I'll compile elsewhere

Just 4 syllables
It seems I know lots of these
I suck at haiku

There are 19 copies of xmms running, out of 96 total processes (19%).

Jimmy has fear? A thousand times no!

Yesterday, I sent the following e-mail to the xwrits author (xwrits is a program that periodically pops open a "take a break" window on my screen to give my wrists a rest) with the following suggestion for his excellent program:

Thanks for xwrits! I think it has saved me much pain... literally.

I have a suggestion.

I have found it convenient to schedule breaks with xwrits, not just for my wrists, but also for taking care of non-computer-related stuff (take out the trash, feed the cat, make sure that the laundered money showed up in my swiss bank account, etc.).

As such, it would be convenient to know (at least approximately) when the next xwrits break is coming.

So when one of my henchmen says, "Hey boss, we got a big player on table 6; you wanna fix a 'special' deck for the next shuffle?" I can say, "Yeah, sure -- I'll do it in about 7 minutes when I have to take a break, anyway."

Here's some random idea on how it could work -- any one of the following methods of showing the time left until the next break would be great (or something even better / easier to implement would be fine as well!):

  • a dockable toolbar app (for KDE or Gnome or whatever)

  • a small window that could sit in the corner of my desktop somewhere -- independent of the main xwrits popup window

  • a specific keystroke (in any context, which could be kinda hard) could bring up a window for 2-5 seconds before automatically closing

Such functionality would be most useful, if possible.


A helpful LAM user found a really obscure name clash with LAM's tputs() function and the termcap/ncurses tputs(). Kudos to Glenn!

I mentioned a few days ago about a LAM user having a problem that I couldn't duplicate. It turns out that this might be due to linux's implementation of TCP/IP -- it seems that after long periods of inactivity when there is data ready to be read on the receiver side, the sender will declare "timeout" and kill the socket. Doh! It's not clear if this is Linux's fault, or is part of the design of TCP/IP, or if Linux's timeout value is just short. Either way, it would be a real pain for us to put in proper heartbeats on the TCP RPI. Arf. I'm hoping that TCP/IP has a "stayalive" functionality that will do its own low-level heartbeats, but I'm not hopeful. Gotta check Stevens sometime soon about this...

2 level decomposition of my dissertation code seems to work, but I think there's a minor memory leak in the RelayOut class. Should be easy to fix, but it's late and I'm tired.

804 xmms's running, out of 876 total, 92%.

April 25, 2001

I'm just not white like you Dave

It appears that the fan on the CPU in my router is dying.

All morning I was hearing weird whirrs and clicks and whatnot. I couldn't locate the source of the noise, so I assumed that it was outside. It only dawned on me to look in the closet where my router lives after an hour or two. I finally realized that the noise was coming from my router machine itself.

I popped the cover and after trying to figure out what was making noise (first candidates were the disks), I finally determined that the motherboard was vibrating to varying degrees, resulting in the rattling sound. It was pretty trippy until I realized that there is a moving part on the motherboard -- the CPU fan.

Flicking the fan with my finger resulted in realigning whatever was rattling and silencing the racket. Oh yeah, it also rebooted my router. :-\ <sigh>

Anyway, no matter how much I flicked the fan, it would always start making noise again a few minutes later. Hence, I'm assuming that the fan will die in fairly short order. Where does one get new CPU fans in Louisville, KY?

As I mentioned yesterday, I got my WAP from http://www.outpost.com/, but still have no wireless network card (it's on back order). I only hope that Outpost stays in business long enough for them to ship the card to me -- they had the best price on it, and I avoided shipping costs.

Did I mention that Tracy and I picked out a grandfather clock? It's a wedding present from my parents. We finally have a house to put it in. It's the McConnell clock from Howard Miller. If all goes well, it will be here in a few weeks. Cool.

I've been working on a paper for SC2001. We'll see if it gets finished in time for submission (extended abstracts are due this Friday). Ugh -- stress! The topic is Tucson -- the software framework that I detailed several days ago (of which, several of those details have already changed :-). The name "Tucson" came from the fact that "it's on the road to Phoenix!" The name Phoenix, of course, refers to the ability of the framework to ressurect processes when they die (in a fault tolerant kind of way).

It's funny. Laugh!

Yes, you can groan now.

While modifying a figure for this paper, I just learned that xfig has a "freehand" mode in the line tool. I had no idea! It's rather amusing, actually, if you are in grid "snap to point" mode --
"freehand" is actually quite blocky. :-)

I'm reading a book by Arthur C. Clark and Stephen Baxter that I randomly picked up in a bookstore a few weeks ago called The Light of Other Days. The pretext is that technology is invented in the late 21st century that allows remote viewing of any location as well as any point in the past. The philosophical implications are staggaring --
no privacy at all anymore.

I just ran across a great quote:

"This isn't the 1990's, Mary. Software development is a craft now."

One can only hope that software development gets much better than it currently is! :-)

Speaking of software sucking, it looks like Telocity's Atlanta routers are hosed again. Riddle me this: DNS works fine (which makes sense, because I'm using their DNS routers -- assumedly their internal network is fine; it's just the connection to the net of the net that is suckin'). So I can lookup IP addresses of anyone I want. I just can traceroute to them. BUT -- I can telnet to port 80 of www.excite.com. Even though UDP and ICMP traceroute packets to www.excite.com die at the Atlanta Telocity routers.

Weird, man. Weird.

There are 129 copies of xmms running, out of 206 total processes (62%).

April 27, 2001

Soon the super karate monkey death car will park in my space.

I just learned something about my phone the hard way.

I called a local hardware store earlier this morning. Later, I called Lummy. Since I call Lummy not infrequently, I have him on speed dial, so I hit the speed dial button. After I hung up with Lummy (his voice mail, actually), I remembered something that I had forgotten to leave in my message for Lummy, so I hit "redial". It called the hardware store, not Lummy. Oops. That is, it redialed the last number that I had punched in, not the last number that I called.

I guess that pretty much makes sense, since I could just as well hit the "Lummy" button instead of "redial", and redial theoretically is better suited to remembering something that you don't necessarily have stored somewhere (i.e., the last number that you punched in). But it still surprised me.

These new-fangled telephones. I just don't get'em. But I heard that they have the internet on computers now. What will they think of next?

I saw an article about how major cell phone companies (Ericcson?) are delaying their rollout of "3G" phones (third generation) due to software glitches. I'm not surprised. I have a fairly simply Audiovox phone which is pretty handy, but it definitely has its share of what are assumedly software glitches. I've even had to "reboot" my phone at least once (take the battery out for several seconds).

It's fairly reliable, but I would imagine that the software inside is actually fairly complex, and therefore susceptible to the "software quality sucks" rule that seems to be the norm of today. :-(

Went out and plunked down a few hundred on a mower today. Woof. Also got a trimmer. On the way home, I'm out of books to read, so I stopped at a local Books a Million and got two new Orson Scott Card books (I'm in the middle of the Homecoming series), a Fatboy Slim CD (ripping MP3s now...), and the Fight Club DVD.

Must spend the rest of the day on the SC paper...

There are 345 copies of xmms running, out of 425 total processes (81%).

April 29, 2001

The Secret of Management

Clean Fatboy Slim? Wha...?

I just noticed that I got the "Kiddie's Clean Version" of the Fatboy Slim CD that I just bought. What an outrage! Darn you, Tipper! I want all my golly-darn profanity and holy smokes swearing! And dang it all if the razzem frazzem songs don't not obscenely suggest that I should have my cake and eat it too, gumdangit!

...aw, bleep it.

In all honesty, I'm curious to know what the difference is. If anyone has the normal one, or has heard both versions, drop me a line and let me know what was changed. Thanks.

All in all, Halfway Between the Gutter and the Stars was somewhat disappointing. Not spazoidal enough. So on the way to the hardware store, I bought some more CDs. I got the new Poe (Haunted), another Fatboy Slim (On the Floor at the Boutique), and a compilation called Louder Than Ever -- Volume 1 (which may turn out to be a rap CD, which I didn't realize -- I was looking for random techno. My criteria for selecting the CD was that it was a compilation, I didn't recognize any of the bands, and it had words like "Da", "Club", and "remix" in the titles. The fact that one of the songs on this CD was humorously named "What U See Is What U Get" was just a bonus [what is a WUSIWUG?]. Regardless, it seems that my CD-selecting filter may need a bit of tweaking...).

The Boutique CD is pretty cool. Good and spazzy. I give it 10 minutes.

I haven't had a chance to listen to the Poe CD yet, but it's got that song Hey Pretty that they play on the radio. I have another Poe CD, and it's good slow stuff; I expect this to be similar.

Tonight was the "Secret of Management" episode of News Radio. It featured both Ft. Awesome and a ball pit. Coincidence? You decide.

Just got my orders for my Army duty this summer.

Arrgh... they're specifically not giving me a rental car! (I've had one the last two times I went down there) This could be a real drag, depending on where my hotel is. Last time, my hotel was fairly close, but the first time I went, my hotel was a good 15-20 minute drive away including a fair amount of highway driving.

This'll be my last tour down there in Atlanta; the place is closing down. Slowly. (like all government entities -- when the decision is made to close an office, it takes months or years to actually do it). I'll have to find another unit after this. I've looked around a little, but nothing has come up yet. I'll have to resume my search, but as with most other things, probably not until post-dissertation...

We submitted our SC2001 paper with 26 seconds to spare. 26 seconds later and the server would have disallowed our submission.

And who said that Computer Science wasn't exciting?

It's now Sunday. Yesterday, I paid my dues and re-joined the Lawn Mowing Association of America (LMAA). It seems that over the past 11-12 years, my membership had lapsed.

I spent much money at the hardware store for various home things (hose, sprinkler, more blinds, a spreader, etc.). Woof. Tracy spent a good deal on internal furnishings, too. Now I understand why the US economy is so dependent upon the

Janna came over for dinner last night; we made steaks on the grill. It was much fun. Jim thought the wireless stuff was way cool (still waiting for my Orinoco card... it should be here sometime this week).

I put up some more windows blinds today, and we'll likely order the rest of our window stuff Real Soon Now. I might well make a "boring" category for journal entries that have to do with Home Stuff. That way, those who don't give a damn can just delete it without reading it. :-)

DSL was out most of yesterday, and about half of today. Woof. Telocity just can't seem to configure those routers in Atlanta properly. At least, that's what I'm assuming -- traceroute's stop in Atlanta for some reason. I can usually get to some sites (e.g., http://www.excite.com/), but not others (e.g., http://www.yahoo.com/ and http://www.nd.edu/). Arrgh!

There are 32 copies of xmms running, out of 109 total processes (29%).

About April 2001

This page contains all entries posted to JeffJournal in April 2001. They are listed from oldest to newest.

March 2001 is the previous archive.

May 2001 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34