« I love Kung Fu movies... | Main | But it my leadership that got you in that dress »

There's no private property the LSC!

Many days, no journal entry. The usually nemesis is at fault: traveling.

I've been up here at ND for the latter half of this week. Mainly for SC2000 coordination (the freebie mouse pads arrived way early. Yay!) and other miscellaneous tasks. I also made my famous "hockey puck" chocolate chip cookies this week for the efforts of the Engineering Graphics department (ok, mainly because Joanne from EG said that I owed them cookies for their efforts). For the uninitiated, it is widely known that I make the World's Best Chocolate Chip cookies. They're roughly the size and shape of hockey pucks (hence, the name); none of these twice-the-diameter-of-a-nickle and paper thin kinds of chocolate chip cookies for me. Hell no. Soft and chewy in the middle with a 1lb pound bag of chocolate chips in the mix just "so that there should be enough". One of these cookies can serve as a meal. A double batch made 12 cookies this week.

Anyway, that all went well, and we finished up our virtual posters for SC2000. I had to use some evil powerpoint animations in them, but they'll be ok. We still need a result graph from LAM/myrinet (more on this below) for the slides, but everything else is finished.

Sidenote: Myrinet is a proprietary network that runs at gigabit speeds. i.e., orders of magnitude faster than 100Mbps ethernet. You can run TCP/IP over Myrinet -- they provide a driver for it -- but it's at a significant cost in performance over "native" communication over the Myrinet hardware. "Native" communication is provided though a library called "gm". Hence, we're adding a "native gm driver" to LAM to utilize this ultra-fast communications over Myrinet in LAM directly, rather than relying on TCP/IP over Myrinet. This is what Arun has been working on since the beginning of the summer. We want to have [at least] a beta of this stuff working to show off at SC2000.

Arun and I tried to make a result graph for LAM/gm -- just a basic one showing "TCP over Myrinet is good, but gm over myrinet is better!", but unexpectedly got bad seg faults and couldn't produce anything. This generated the rest of my Friday evening, and most of Saturday morning.

Before all we could launch into extensive debugging, though, we had to otherwise finish up the slides. Got some good slides for LAM/gm (Arun), XMPI (Brian), and IMPI (me). After everyone else had left, Lummy wandered in (while I was still working on the slides; perhaps 6:30pm or so). Had a long chat with him about the future of LAM and whatnot. It was especially interesting with the prospects of MPICH's going through an entire re-write (with the focus on their ADI-3 work now -- already a 70+ page document!). MPICH 1.2.1 is probably pretty near the end of the line for that code base; MPICH 2.0 will probably have some elements stolen from MPICH 1.2.x, but will likely be mostly from scratch. This is really cool stuff, actually.

I spent the rest of the night upgrading the version of GM that we had. We reported what appeared to be a bug in gm to Bob -- one of the authors (a very helpful guy, actually), and he said, "you're using a really old version of gm -- you should upgrade and see if the problem just goes away)". Ugh. How embarrassing! Turns out we were using gm-1.1.3, and the latest is gm-1.2.3. Oops.

myri.com is apparently connected to the world through a 300 baud modem; it took about an hour to download the 1.2.3 tarball (only a few megs). It took a few tries to get it installed properly -- we have really old Myrinet hardware (probably a few generations behind current stuff). Myrinet utilizes a kernel module in Solaris, so you have to take some care to build and install it properly. And compiling on the Solaris 2.5.1 140Mhz machines is just painfully slow. Ugh.

So I finally got everything up and running around 11pm or midnight. I ran some test programs, and finally decided that everything was working properly. Then I ran a simple test program through bcheck. Badness. Lots of "read from uninitialized" errors from within libgm itself. Crap!!

After a lot of source diving in libgm, I determined that the problem was a buffer that was supposedly being initialized by an ioctl() call into the gm kernel module. The upper libgm was providing the buffer and expecting the lower kernel module to fill it in. It took a lot of hacking around and source tracing in workshop to absolutely verify that the lower kernel module was, indeed, filling that buffer properly, but it remained a mystery to me as to why bcheck would think that the buffer was uninitialized. Worse yet, sometimes bcheck reported that everything was fine -- no read-from-uninitialized. Hmm.

Hesisenbugs suck.

It didn't occur to me until Saturday morning that bcheck couldn't possibly know that the buffer was filled -- bcheck only monitors the process under debug; it doesn't monitor the kernel module at all. So it makes perfect since that while the buffer is initialized by the kernel module, bcheck simply has no knowledge of it, hence, it reports it as uninitialized when upper libgm reads from it. Although this doesn't explain why bcheck sometimes reported that all was well, I'm 99% sure that this is what is happening. Bob later confirmed my suspicions, too.

Hence, I [effectively] added a memset() to the upper libgm code, and bcheck finally only reported Truly Bad Things --
similar to what we had to do in LAM when we know that uninitialized buffers are ok ("when you optimize code, all coding guidelines and rules are out the window, and painfully splattered on the ground below").

I then set about trying to debug a simpler example than NetPIPE (which a de facto MPI latency/bandwidth benchmark program) -- the program that we were trying to use to get some result graphs for LAM/gm. I made a simple "hello world" ping pong MPI program, and tried to debug that. Arun came in around 11am or so, and we set about stepping through the internals of the gm progression engine inside LAM. Not for the meek.

It's good that Arun came, 'cause he wrote the stuff, and I wasn't completely familiar with it (indeed, I had only seen the internals once before -- when we had a code review about a month or so ago). So his explanations and rationale were quite helpful. We finally tracked down a repeatable kind of error in the simple ping-pong program, but then had to leave for the football game.


ND vs. Air Force. Wow. A real nail-biter, there. I can't believe that we won. It's horrible to say, but our offense really did not look good at all during the game. We had one decent drive, and it was full of 3rd and longs. The rest of our points were off lucky Big plays and the like. :-( Granted, I was in the stadium and didn't have the benefit of instant replay and the like, but it didn't look pretty from the student section.

Our defense was kinda shaky, too. We had some great stops a few times -- held them to 3 points at least twice, for example, a blocked field goal (which put us in overtime -- and later gave us the game), etc. But they were able to throw all over us all day. Our pass defense was just not good.

But in overtime (!) we managed to win the game. Air Force went first and we held them to 3 points. We then came back and got a touchdown, putting the final score at 34-31. Amazing. It's our first overtime victory -- we were previously 0 for 3 in OT.

Some other random points about the game:

  • Great flyby from 3 F-16s (or F-18s...?) during halftime. Well timed, and it was lead by some 1LT who graduated from ND in '97.

  • There was some woman behind us who was clearly visiting some friends here at ND. Whenever she opened her mouth, stupid came out. Some memorable quotes:

    • (during the band's halftime tribute to the military, where there were various military people on the field with the band, the American flag and the flags of all four services were flying on the field) "Is this some kind of Halloween thing?"

    • "I just love that Leprechaun guy! I just wanna scoop him up and hug him!"

    • "So they're not really downs, are they? They're attempts at downs, right? So why does everyone call them downs?"

  • Saw Tony and some other JeffJournal fans after the game. Felt kinda silly, because I didn't recognize Tony right away (it's the beard! I swear it!) -- duh. But then later, I realized that I really hadn't seen Tony since last spring, and I felt [somewhat] better. :-)

  • Tracy and I went to see Pay it Forward afterwards. Not a bad flick. Not quite as complicated and intricate as I had hoped, but still not bad. so I think I give it an official vote of "sympathy".


This morning... back in the lab, and I think I've narrowed the problem down in LAM/gm to an unexpected receive. A pointer is not getting reset properly in the gm progression engine, and when an unexpected receive (definition below) comes in, a linked list is attempted to be searched for a request that is no longer valid (and has actually already been freed). Hence, sometimes it works, and sometimes it doesn't.

I suspect that this is just an error from the "translation" of the TCP engine to gm. i.e., we literally copied the TCP progression engine and gm-ized it; I suspect that this bug is just an error in the gm-iziation process. Hopefully, this will be the last Big Bug...

An "unexpected message" is one of the Big Concepts for MPI implementors. It is possible that a user does a send from one rank before doing a receive on the target rank. Hence, the message may actually arrive at the target before the necessary bookkeeping has occurred to setup to receive that message. Hence, when the target gets such a message, it files the message in the "unexpected" queue. When the matching receive is finally posted, it first checks the unexpected queue to see if the message has actually already arrived before going to actually check the message passing hardware for the message. There's a lot more to it than that, of course, but that's the gist of it.

Hence, when LAM/gm receives an unexpected message, it's checking the list of outstanding receives improperly.

I'm off to go squash this fucker, then a visit to Chez Lummy, then back to Looieville. Rock on!

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on October 29, 2000 12:40 PM.

The previous post in this blog was I love Kung Fu movies....

The next post in this blog is But it my leadership that got you in that dress.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34