Main

LAM/MPI Archives

April 2, 2003

Towards 7.0

Lots of good stuff happening towards LAM 7.0: * Sriram just committed a first version of the CR SSI * Manish just committed a first version of mpiexec * Shashwat may actually have Totalview queueu debugging realy for 7.0 (but hey -- 7.1 will be just fine, too!) * ROMIO extensions are proceeding nicely * It looks like we actually beat LAM 6.5.x performance by just a little * The overall code tree is getting more stable Woo hoo!

April 11, 2003

<heavy_accent>You must die -- I alone am best!</heavy_accent>

I'm working on the gm RPI. I've found a case that repeatably hangs the state machine, so I'm diving into it to find out why. Wow -- I forgot how twisted this code is. Actually, it's not _twisted_, it's more like _complex_. It's actually laid out fairly well and broken up into a million pieces, and *THANK GOD* I left many really long comments to explain what the heck is going on in the code. Can you say: Holy special cases, Batman!

April 12, 2003

Half received unexpected messages and the women who love them, today on Oprah

By accident, I found some keystrokes in Mozilla that I've long since wanted: the ability to switch between tabs: C-PgUp and C-PgDn. I haven't found these documented anywhere. Yay!


Wow. Here's a special case in the gm RPI that I've been chasing all morning... * A peer process sends a message to me * I have not yet posted a receive for this message yet * The envelope is received, and I notice that it doesn't match any of the posted receives. So it's marked as unexpected, and placed in the proper area. * The message is "short", meaning that there's still another message coming -- right behind this one -- that holds the actual content of the message * But the underlying message transport (gm) indicates that there's no message here yet, so we return into the main MPI progression, and back up to the user program * The user program then posts the matching receive * Down in the RPI, the receive is matched with the half-received unexpected message. * I setup such that the next received gm message from this process will go directly into the user buffer * ...but the request was still marked "ACTIVE", so it followed the normal (expected... as opposed to "unexpected") message progression, which meant posting it to the "pending receive" queues, even though it was already halfway finished. So I had previously done about half the Right Thing -- I recognized that the short body was still pending, and tried to setup to receive into it. But I left a little bookkeeping undone, so Badness occured sometime (randomly) later. Wow. As I said last night, "Holy special cases, Batman!"

April 19, 2003

We're hemrogaging space dollars

Well, I borked LAM 7.0b1 already. I left some unititalized variables down in the smp coll SSI. Doh! Fortunately, there was at least one other Big error that borked b1 as well. So it wasn't *completely* my fault. ;-) Actually, I'm not worried at all. Finding bugs are what betas are for.

So here's something that surprised me today: I bought 2 hard drives today, one for Tracy to store her MP3s on (since she got a portable MP3 player for her birthday), and another for a linux machine. Guess which was easier to install? And I mean *significantly* easier! That's right -- linux. By far. The Windoze XP box involved multiple reboots, failed detections, unclear directions, and about 2 hours. The linux one involved running a single program (the Mandrake system configurator thingy), it noticing that I had a new disk, and ran its GUI disk partitioner thing for me. It even moved all the relevant files over to the new partition for me (I mounted in on /home, but already had a few files under /home). Elapsed time: about 10 minutes.

Our Tivo updated itself to version 4.0 today. Woo hoo! It has a feature that I've long been waiting for -- grouping of recorded shows. So my wife's 6,000,000 _Mad About You_ episodes are all listed under a single folder. Even cooler that than: if you go in that folder, the shows are listed by their episode names, not just _Mad About You_ -- cool! I can also now plug my Tivo into my home LAN and have it update over the net rather than via phone. But I don't have any strong incentive to do that -- I'm not considering the Home Media Option (at least not yet). It's the little things in life. :-)

April 20, 2003

This LAM is your LAM, this LAM is my LAM...

LAM 7.0b2 failed a bunch of stuff right away, but I'm still not concerned -- they're mostly a bunch of build system issues that we have not tested in quite a while. This is what betas are for. Fixes are being applied quickly, and b3 will be out soon. The mini-llamas are doing a good job of testing; it's great to have so many people working on this. I anticipate having better test coverage than we've ever had before! Woo hoo!

April 27, 2003

Drunk ducks

Wow. I just noticed that LAM's configure script is over 1MB long. 'nuff said.

June 2, 2003

A rolling moss has no stones

The LAM release is taking a bit longer than expected. With some intense testing over the last week, we've found a bunch of little buglets and at least one big bug (and it bites). Still, I'm confident that we'll get it out the door "soon". The sooner, the better, actually, 'cause there's a whole bunch of stuff that is being held up waiting for CVS to become generally open again. I think the big bug has been solved, but I won't know until I can get on some machines that currently appear to be having problems (the support folks likely won't be in until tomorrow). Doh! But we're guests on the machines in question, so we really can't complain. There's other side issues that are making it take a little longer -- lawyer issues, press release issues, re-vamping the LAM/MPI web site, etc., etc. But the software comes first -- gotta have that done before the release can go forward. :-)

June 27, 2003

Funny user quotes

From a post on the LAM list:
"Suppose the prophecy IS true. Suppose tommorow the war'll be over. And this MPI program runs error free. Isnt that worth fighting for? Isnt it worth DYING for?"

Funny user quotes

From a post on the OSCAR user's list:
...After that I can do the rlogin with out providing a password, but when I run a progran it failed. The progran make a: mpirun hostfile.host /localhome/marc2001/bin/marc and I get a Connection Confused.

Funny user quotes

In response to my last entry, DK has provided the relevant man page entry:
94 ECONNCONFUSED Connection confused. No connection could be made, or maybe it was. We're not certain. The situation may or may not have resulted from the client application attempting to eat Jello(tm) brand gelatin desert while utilizing a pogo-stick.

July 1, 2003

LAM/MPI 7.0 has escaped!

The LAM Team from the Open Systems Lab at Indiana University is pleased to release LAM/MPI version 7.0. Representing over two years of development, version 7.0 includes significant new features for MPI applications programmers, parallel programming/MPI researchers, and system administrators. An abbreviated listing of these features includes: * When used with the Berkeley Lab Checkpoint/Restart (BLCR) single-node checkpointer, parallel MPI jobs can be involuntarily checkpointed and restarted. * Low-latency, high-bandwidth message passing on Myrinet networks. * Extensive run-time tuning and underlying network selection (vs. compile-time selection and tuning). * SMP-aware collectives. Based on the MagPIe algorithms, several of LAM's MPI collective functions have been optimized for use in networks of SMPs. * Integration with PBS, BProc, and Globus. LAM now uses the native job starting mechanisms of each of these environments to launch jobs in parallel. * A comprehensive User's Guide, providing a user-centric description of LAM/MPI, how to use all of its features, and how to tune MPI applications at run-time. * An extensible component architecture such that MPI researchers can write small, self-contained components that "plug-in" to LAM/MPI. * Support for the TotalView parallel debugger. More information and documentation is available at the newly-redesigned LAM/MPI web site: "http://www.lam-mpi.org/":http://www.lam-mpi.org/ Make today a LAM/MPI day!

September 26, 2003

Academic stagnation

I was greatly saddened yesterday. We are working on a research project that is similar in some ways to a well-publicized project from a different university. One of my students recently contacted the researchers of that other project, asking to see their code so that we could learn from it. The head researcher replied saying, "please have your professor contact us, and detail how you will be using our code." So I replied telling them that we explicitly would not be copying their code (although I didn't mention it, the reason we don't do that is because we have a traceable copyright history and are therefore extremely cautious about what code we accept into our tree) -- we only wanted to see the underlying vendor-provided API's in action as some examples of their use so that we can learn from it (the vendor has told us that documentation is "lacking", at best). The head researcher did not reply to me until I pinged him again (almost a week later) essentially saying "no, we're not going to give our code to you." He mainly cited copyright concerns (apparently they have been burned before). This saddens me on a fundamental level. We both work for universities (pretty big ones). The core values of a university are information sharing and distribution of knowledge. Yet we were explicitly denied in this exact kind of information sharing -- sharing that would have fundamentally contributed to the general state of knowledge and advancement of research technologies. How exactly can this be reconciled -- when I explicitly stated that we would not be copying any of their code? Making it more general -- their code would have been a teaching tool for us. Indeed -- if the lack of sharing of research could somehow be justified, how can a university justify themselves in not teaching a fellow academic? This is incomprehensible to me. We'll proceed without their code. We'll write our own code, and it will be entirely unrelated to theirs. It'll be darn good code, too. I'm just fundamentally saddened that a group of fellow researchers snubbed us so directly, seemingly flying in the spirit of open collaboration. To use a phrase that Rich loves: that's intellectually bankrupt. (yes, if you noticed, this was somewhat vague in exactly what the project was, who I contacted/was denied by, etc. That's intentional to protect the guilty)

I ordered a cradle for my Clie today. The cable that came with it works fine, but a cradle is just so much more convenient. I had to order it directly from Sony, and they're back-ordered. [sigh]

WOPR finally resolves in DNS! We had some problems with that -- we chose a new domain name to be the "admin" domain for the machine and registered it well over a week ago. Unfortunately, we all forgot that .org is the only TLD that still enforces a two-DNS-server rule (we only had one listed), so it refused to resolve anything. But that's all fixed now, and e-mail to and from that domain finally works. This allows us to keep moving forward to get WOPR ready for production...

Yahoo! has been announcing for the past several weeks that they were going to break compatability on 25 Sep in order to fix some security problems with older clients. And as advertised, yesterday they did. gaim released a new version in the morning that supported the newest Yahoo! protocol. And it worked just fine. For a while. Last night, I got kicked off Yahoo! when I had a connectivity blip and found that I could not get back on Y!'s IM service (the exact error message from Y! varies). Doing a little web surfing, apparently everyone is having this problem (including other 3rd party IM clients: Trillian, Fire, etc.). Looks like Yahoo! did something a little more substantial after-the-fact. We'll have to see gaim can continue to interoperate. :-\

November 12, 2003

In a world of compromises, I'll give you 17%

This just seems wrong: bq. LAM_CONFIGURE_HOST="`hostname | head -n 1`"

March 15, 2004

"dash oh": bite me

So I wasted a valuable afternoon today because of ickyness in LAM/MPI 6.5.9.

I have the AVIDD-B cluster all to myself to run dissertation performance results today. Mainly, I’m comparing LAM 6.5.x performance to LAM 7.x performance. For what I’m doing the two results should be pretty much the same — the whole point of my dissertation is that I added a bunch of great abstractions into LAM but without any performance penalty. I had it about 2 weeks ago, too, for the same reason. But a bunch of my numbers got borked 2 weks ago — namely the 6.5.9 numbers were way worse than they were supposed to be. Specifically: 7.x performed way better than 6.5.9 on gigabit ethernet.

I thought that it was just a simple missing [memcpy] optimization that we debuted in 7.x. So I added that optimization in my copy 6.5.9 today and re-ran the results. Same crappy performance.

Wha…?

So I removed the optimization from my copy of 7.x, and the same great performance was there. i.e., there was still a huge performance difference between 6.5.9 and 7.x. I spent several hours trying to figure out what the heck the difference was. I even roped Brian into it — we couldn’t remember what optimization we had added that gave such a huge performance increase in 7.0.

While I was going through Changelogs, it hit me. The stupid “[-O]” option to 6.5.x’s [mpirun] — if you don’t explicitly tell LAM 6.5.9 to no do it, it’ll always put data on the network in big endian (“network”) order. This really sucks on Linux boxen, obviously. So you specify “[-O]” to [mpirun] and tell it not to do that. In 7.x, we handle this automatically. Specifically, I had totally forgotten about this option, and none of my 6.5.9 results were run with “[-O]”. Hence, all those results were showing the effects of 2x byte swapping.

ARRRGGGGHHHH!!!!

And I knew about this option. It’s bitten me before. And I’ve scolded users to use it. I wasted several valuable hours on the cluster figuring this out (and it’s why my results weren’t right two weeks ago). Grrr…

March 25, 2004

Props! Huge props!

Let the record show that I ower Brian 3 beers of his choice (size/quantity and brand unspecified) for, at a moment’s notice, dropping everything and figuring out how to use gcov (in parallel, no less) and calculate how much coverage we have in the LAM test suite.

March 12, 2006

LAM/MPI v7.1.2 released

We [finally] managed to get LAM/MPI v7.1.2 out the door. It includes about 1.5 years worth of bug fixes and updates. The reason it took so long is because we have been spending 99% of our time on Open MPI .

The release announcement is here.

Since I’m heading to Cisco (starting tomorrow), this is my last official action in LAM/MPI. I’ve been working with LAM/MPI for about 8 years: my first commit was 23 Oct, 1998, my last commit was minutes ago, moving my name from the “current” section to the “previous” section in the AUTHORS file:

-------------------------------------------------------------------
Author: jsquyres
Date: 2006-03-12 11:18:28 -0500 (Sun, 12 Mar 2006)
New Revision: 10311

Modified:
   trunk/AUTHORS
Log:
It is now time to say goodbye.


Modified: trunk/AUTHORS
===================================================================
--- trunk/AUTHORS	2006-03-12 16:18:16 UTC (rev 10310)
+++ trunk/AUTHORS	2006-03-12 16:18:28 UTC (rev 10311)
@@ -11,7 +11,6 @@
  Indiana University:
   - Brian Barrett (brbarret)
   - Andrew Lumsdaine (lums)
-  - Jeff Squyres (jsquyres)
   - With special thanks to Josh Hursey and Andrew Friedley.


@@ -19,6 +18,7 @@
 ----------------

  Indiana University:
+  - Jeff Squyres (jsquyres)
   - Anju Kambadur (pkambadu)
   - Vishal Sahay (vsahay)
   - Nihar Sanghvi (nsanghvi)
-------------------------------------------------------------------

So long, and thanks for all the fish…

October 1, 2006

Looks like I picked the wrong week to quit smoking

Too funny. I omitted a few e-mail addresses for privacy reasons, even the return address of the person who wrote this, even though they probably don’t deserve the protection:


From: <omitted>@domainsbyproxy.com
Date: October 1, 2006 2:43:35 PM EDT
To: <omitted>
Subject: FWD: FWD: FWD: FWD: Offer on your domain, lam-mpi.org. Please respond
    [lam-mpi.org@domainsbyproxy.com] [lam-mpi.org@domainsbyproxy.com]
    [lam-mpi.org@domainsbyproxy.com] [lam-mpi.org@domainsbyproxy.com]
Reply-To: <omitted>
Hello, 
I am interested in purchasing your domain name, lam-mpi.org. I am only 
interested in the domain and not any web site which may be associated with it. 
My offer for this domain is $366.00. This is a genuine offer and it may be 
negotiable depending on the domain's traffic/visitors. 
I kindly request that you to respond to this e-mail so that I know you have 
received it. 
Thank You,
D.R.T.

February 14, 2007

LAM 7.1.3 escaped!

Woo hoo! Congrats to the Indiana University team for putting out LAM/MPI v7.1.3 (this was the project that I was the head of while I was at IU). Here’s the release announcement:

It is likely that this is the last release of LAM/MPI ever.

About LAM/MPI

This page contains an archive of all entries posted to JeffJournal in the LAM/MPI category. They are listed from oldest to newest.

Army is the previous category.

Notre Dame is the next category.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34