Main

Technical Archives

July 14, 2000

Zoom

Spent the remainder of the day hacking minime socket classes (partially as a result of jjc -- because it wants to use no authentication and no encryption to connect up to the web server --
but it was due to happen anyway; better sooner than later). Had to make the authentication and encryption be "mo' separate" than they originally were. This actually led to the sockets being able to accept multiple kinds of encryption and/or authentication in an IMPI kind of way (amazing how this Ph.D. thing really ties in all work that you have previously done...). Also inspired by ssh. It goes like this...

Sidenote: This is lengthy not just because I'm telling people what I'm doing, it's lengthy so that I myself have a record of what I'm doing. Will be quite helpful when I actually go to write that dissertation! Also, I'm about to effectively go on vacation for about 1.5 weeks, and my short term... what was I talking about?

Assumptions:

  • Authentication is a different step than encryption. First you authenticate, then you setup encryption. A successful connection must pass both steps on both sides.

  • Connector and acceptor potentially have different lists of authentication/encryption methodologies. Each entity has a list of auth/enc methods that it will allow -- these lists are in priority order. i.e., "strongest" methods are listed first.

  • The acceptor governs the authentication/encryption choice (i.e., the "server"). As a side effect of this, if the acceptor has no authentication methods defined (for example), the authentication passes (even if the connector has some authentication methods defined).

  • Authentication/encryption methods are indexed by their string names (makes debugging easier, for one thing).

  • Different authentication methods can allow for different levels of security. For example, a "shutdown" authentication is assumedly only given to root -- so that root is the only user who can shut down a minime daemon.

The overall process is just about the same for authentication as it is for encryption, so I'll just describe the authentication.

  1. Connector makes a socket connection to the acceptor.

  2. Acceptor spawns a thread to handle this connection and goes back to sitting on accept(). The new thread (after some internal accounting -- a phrase that will never die) sends an integer count of the number of authenication methods that it has, followed by a '\n' delimited string of the names of authentication methods that it has available (if there are more than 0 methods available).

    Connector receives this count and [potentially] the list of possible names.

  3. If the count is zero, both sides rule that the authentication was successful. If (count > 0), the connector goes through the acceptor's list (in order) and finds if it has any of the same names in its list. If it does, it sends the integer index of the match back to the acceptor. If it does not, it sends back -1 and then hangs up, ruling that the authentication failed because common authentication method could not be found.

  4. The acceptor receives the integer. If it is -1, it hangs up and the thread handling that connection dies (similarly, if the connector hangs up without sending the -1, the acceptor can deduce that there was no match found). Otherwise, both the acceptor and connector call the accept() and connect() methods (respectively) on the selected Authentication object (sidenote: it only makes sense to have two entry points to the Authentication -- it's much easier to have a 2-way protocol where each side knows who they are so that you can have defined challengers and responders, etc.).

  5. The contents of the Authentication::accept() and Authentication::connect() are obviously protocol-dependent (since they now own the socket, they can do whatever the heck they want). Currently, these routines simply return a bool (remember, gentle reader, they were previously setup with whatever initialization/key/etc. information when they were initialized and attached to the Socket instance) indicating success or failure. The only condition that must be enforced is that accept() and connect() both return true or both return false. It would be a Bad Thing if they returned different answers, because then one party will think that it has passed authentication successfully while the other will hang up (upon which case the other party will detect this and also hang up, but it's Not the Right Thing to Do).

Once the authencitation process passes, we essentially repeat the procedure for setting up encryption. So there's multiple potential points of failure: agreeing on authentication, performing the authentication, agreeing on encryption, performing the encryption setup. If any of these steps fail, the whole connection fails and both sides hang up (assumedly with some nice error message saying what went wrong).

So this is the scheme. For all that text above, it's really not that complicated. Now I gotta go implement it. Code to write.

Woo hoo!(tm)

August 12, 2000

Louisville->SBN thoughts

Had some interesting thoughts about minime yesterday while driving up from Louisville:

  • Currently, there are two kinds of endpoints for conduits: unix sockets and TCP sockets. How about creating a third kind -- an internal endpoint. This would allow modules within a single minime to be able to talk to each other across the already-existing conduit abstraction.

  • The minime boot service should be a module. I think that this has been stated before, but the ramifications of this really only occurred to me yesterday. The core minimed will be really dumb, meaning that it only knows how to:

    • launch itself

    • listen for messages (on three different fronts)

    • make connections (on any of the three fronts), which will include authentication and/or encryption

    • receive messages on these connections and give them to the module that will handle them, or handle them itself (e.g., "shutdown" messages)

    • shut itself down in an orderly fashion

  • Astute readers will realize that the previous bullet precludes routing messages across multiple minimed's -- messages always go from one minimed to another, and no further. No problem -- have a "router" module in the minimed that handles this kind of message passing (for non-fully-connected networks, for example) to get from point A to point B -- a la nsend in LAM (although implemented completely differently).

  • Hence, we're looking at a small minimed with probably the following core set of modules:

    • network minime boot: ability to launch minimed's on other hosts and build the minime mesh. Currently only planned to execute once -- not add/delete from the mesh.

    • init.d minime boot: ability to launch minimed at boot up time and either look around to see who else is up and join a mesh (this will be hard) or have a pre-defined mesh (a la a config file, which is much easier) that indicates what other nodes should exist, and just join that mesg. This may be repeated multiple times to add/delete nodes as nodes go up and down. Although I'm defining this to be in the core, I'll likely add this part last.

    • message router: transparently move user-level messages from point A to point B through some route in the minimed mesh. This will be a higher level of messages that conduits/self-executing messages. High-priority messages should use a "c2c"-like routing (which may still involve multiple hops -- i.e., from gateway to gateway, such as A->gateway1->gateway2->B -- but not have to go all the way through the mesh), where low priority messages can go through the minime mesh (i.e., potentially more hops).

    • group operations: perform an operation (or set of operations?) on some group of nodes in the minime mesh, and collate the results in a scalable fashion (most likely using some kind of tree over the minime mesh).

    • uptime: report uptime and some other useful stats (first real useful tool and proof of concept). Will probably use the group operations module to get the uptime for all nodes in a cluster, for example.

    • fork/exec: run arbitrary user programs.

    Obvious security questions have to be addressed, such as only allowing specified authentications to do certain actions such as shut down, run arbitrary programs, etc., etc.

August 25, 2000

Surreality or Blueberries?

Got back out here to LBL, possibly for the last time. Lummy and I are out here for some design meetings about the BLD (Berkeley Lab Distribution). Lummy and I had an uneventful trip out here yesterday, and spent a good portion of the flight chatting about Things, including SC2000, t-shirts for the lab, various future directions of the LSC and some of its members, etc., etc. Good stuff.

We had some good arguments for a few hours about BLD today (Bill Saphir, Eric Roman, Paul Hargrove, and myself). We will continue arguing on Tuesday.

Sadly, though, Bill will be leaving LBL (going to a startup). Best of luck to him -- all hail Bill!

I saw Nathan's final presentation today on what he did with cfengine this summer. We heckled him a bit and asked him a few questions that he wasn't prepared for, but all in all, it was a good presentation -- I learned some things.

Contributed a few minor fixes to the vorbis-dev list today (and had a good typo in one of them -- nice).

Found a paper that was mentioned on HPC Wire today about ye old "6 degrees of separation" issues. These kinds of things are actually closely related to network topology and are highly relevant to my dissertation. It's good stuff. It's interesting to me that these kinds of studies were first started with a psychologist -- Millgram (I'm gonna get this wrong, but it should be somewhat close: the guy who did the pain response tests that had subjects pushing a button that supposedly shocked a "patient" [although it really didn't] and cause the "patient" enormous pain -- they tested how hard they could push a subject into delivering pain to the patient. Only 1 of Millgram's subjects refused to push the button. Peter Gabriel even has a song about this; Millgram's 37.). Millgram told several of his friends/colleagues to deliver a letter to some random person in the country -- someone that he was sure that they did not know. They could only send the letter to someone that they knew on a first name basis and ask them to pass it on in a similar manner in order to get it to its final recipient.

After many such trials, it turns out that the average number of what we now call "hops" was between 5 and 6. More research in this area has led to the now popular "6 degrees of separation" perception, and "the 6 degrees of Kevin Bacon" -- which, if you think about it, is intuitively obvious. Consider: if, by some handwaving, we can assume that most humans on the planet are connected by 6 degrees, then linking any actor to Kevin Bacon -- a vastly smaller domain than all humans on the planet -- seems obvious to be true.

Even though I suck at math, these kinds of things interest me. I read half of a book about this. I can't remember the name of it, but it was also written at Cornell, and the paper that I read today referred to their work several times. Good stuff.

Lummy and I will be heading back to our skanky hotel soon. Outta here.

August 31, 2000

It all started with parallel bladeenc...

Great quote today from HPC Wire:

"The truth is, great software comes from great programmers, not from a large number of people slaving away. Open source can be useful. It can speed things up. But it's not new, and it's not holy water."

-- William N. Joy,
chief scientist,
Sun Microsystems

There's really two separate thoughts in that quote, but they're both true. More people need to recognize this.


Spent the majority of the last 24 hours writing my talk for tomorrow's LSC lunch. The reason that I've spent 24+ hours on this as opposed to 2-3 is because it has turned out to be much more interesting than I originally thought. Lummy has also found it to be very interesting; it is even possible that this will end up in my dissertation.

So tomorrow's topic actually started many months ago. I was visiting here in Berkeley back in January of this year, and only had a few MP3s loaded on my laptop from my CDs. I was out here for quite a length of time (3 weeks or so), and I was getting sick of the same old MP3s every day. So I started looking at Lummy's CDs (he wanted to rip them, too). So I volunteered to help. :-)

The problem was that it was just too damn slow. The encoding took forever. So after a few nights of hacking, I got a preliminary version of the parallel bladeenc MP3 encoder working (see the "technical details" page on that site for all the nitty gritty of how parallel bladeenc works). This did vastly improve the MP3 encoding process; we could listen to songs faster this way (and before you ask, I only kept MP3s [that were generated from Lummy's CDs] that I already own the CD of). However, I never did get the parallel encoder Right -- it's just about right, but not quite right. Without going into Big Detail, suffice it to say that there's some MP3 framing issues that Jeremy Faller and I tried to figure out and gave up in light of the fact that we had no MP3 documentation.

This brings up an important point -- even though the parallelization of bladeenc was at the very top level, I had to dive down very deep into its tangled code in order to parallelize it. Indeed, there is an odd bit reservoir that has to be drained in order to make the whole scheme work. It took a long time to figure out, and like I said, there's still some MP3 framing issues that are unresolved. To end this already-lengthy explanation: you can't diff the results of parallel bladeenc with that of serial bladeenc. Bonk.

(Some of the following may have already been in a previous journal entry -- I don't remember -- so you'll cope if you've already read it)

So I basically let that go, and stumbled across the vorbis project (I think that I saw it on slashdot or something). Vorbis intrigued me for the following reasons:

  • Vorbis is a totally free encoder -- it does not have the same legal issues surrounding it as MP3 encoders do.

  • The vorbis algorithm supposedly has better sound quality than the MP3 algorithm. I won't argue this either way -- I don't know much about these kinds of things... math, ick -- but it does sound good. :-)

  • The vorbis stuff is all in a library --
    libvorbis.a. So writing an encoder is trivial. Indeed, they have a sample encoder in their distribution. So no diving into the source code to figure out how it works.

  • The vorbis library stuff specifically discretizes the steps in the encoding process; all the steps are fast, except one (hmm...).

  • There is an active vorbis development community. I got very little feedback from the bladeenc community. :-(

So with all this in mind, I posted to the vorbis development list asking "has anyone thought about parallelism?". After some informative discussions with the main developers, I started thinking about it (although not actively coding). After coming up with a suitable architecture, I spent a few hours one night and coded up a multi-threaded vorbis encoder. It seemed to work like a champ... but the output was not diffable with the output from the serial encoder. :-(

I queried the vorbis development list again, and found out that the library is not thread safe -- yet. There's apparently still one issue that makes it not thread safe, something that the developers eventually intend to fix. Oh well.

So then I started thinking about an MPI architecture for such a beast. Conceivably, it would be very similar to the architecture of parallel bladeenc. However, I would like to use threads if/when possible, so I started thinking about mixing MPI and threads (and not in a crass OpenMP/MPI kind of way), and how that would work, particularly in light of the fact that the 2 open source MPI implementations are not thread safe. Ugh.

However, since the vorbis stuff is neatly discretized, it really reduces down to a generic parallel master/slave problem (i.e., replace vorbis_analyze() with a generic calculate() function, and the framework is applicable to any master/slave problem). I realized this yesterday after I made an offhand remark to Lummy that I was thinking about talking about a potential parallel vorbis encoder for my LSC Friday talk.

I spent some time thinking about it (much scrap paper, test programs, etc.), and realize that this is actually quite a thorny issue. It's much more complicated (and interesting!) than one would initially think. Having a generalized master/slave solution for local and remote computation is something that no one has done yet. Yes, we all understand master/slave, but no one has published how to do this with threads and MPI. Indeed, if such a thing was coded up, it could certainly serve as a framework for any task farm kind of parallelism -- only a small number of interface functions would probably be required:

  • input / preprocess (on the master)
  • calculate (on the slaves)
  • postprocess / output (on the master)
  • registration for the marshalling/unmarshalling of data to be exchanged between the master and slave nodes

So I've spent about 24 hours of thinking and writing about this, and when I went to print it out for the first time, I was surprised to discover that I had written about 12-13 LaTeX fullpages about this topic (a lot of which is pseudocode). I've talked with Lummy about this; I only realized late this morning that this is an extension of the parallel image processing toolkit (PIPT) -- it's a few levels beyond what the PIPT is, but some of the ideas in it are definitely influenced by the PIPT. Sadly, I'll only be able to deliver some excerpt of this tomorrow, but I think Lummy and Bill and I might discuss this further this evening. Andy things that Jeremiah and Rich might follow up on this (if I don't) for GGCL and other things.

I'm not going to go into detail here about what I have already written (perhaps I'll fill in the details here later). I'm just struck by the irony that this all started by a personal pet project and may end up as a chapter in my dissertation (we'll see, but Lummy did mention it).

Threading, in general, can get quite complicated. Having multiple independent threads that don't interact with each other is easy --
anyone with a pthreads book (or man page) can do it. But having multiple threads that have to interact with each other and share data can get arbitrarily difficult. This whole scheme pretty much is in the latter domain -- threads need to share queues and data and whatnot; there's some thorny locking issues involved (which makes it so interesting). I haven't even solved all the issues yet -- I don't know how to give some slaves small amounts of work and give other slaves large amounts of work, for example (there's some convoluted locking issues involved).

Who knows -- this might go nowhere. But it seems pretty interesting, so I wanted to get it On The Record.

September 8, 2000

It Madonna calls, I'm not here

Spent much of yesterday thinking about the generalized manager/worker problem, and spent most of today re-writing my paper about it (I had a good quote to Tracy today, "if I were a theoretician, the paper would be done". But I'm not, and there were still a number of non-trivial practical issues that concerned me. So I re-wrote it). I solved the problem with the variable rate input/output in the input data unit queue -- a cool use of a condition variable and something I call a "reservation system" (which really amounts to two queues managed by mutex and condition variable... which are more or less the same thing anyway ;-). You get all the benefits of a variable rate I/O and blocking (instead of spinning). Here's the problem:

The input thread reads in chunks of input at a time and preprocesses them (working on the idea that preprocessing takes much less time than the "real" calculation). The preprocessed data units go into a data queue. The calculate entities (which can be local threads or remote threads that are serviced by MPI proxies) are removed from the input data unit queue (which will involve an MPI_SEND/MPI_RECV pair if the thread is remote from the manager) and processed. This is the step that takes the most time -- the actual calculation. When each unit is finished processing, it is placed in the output data unit queue (which, again, will involve MPI_SEND/MPI_RECV is the thread is remote). The output thread removes the output data unit from the queue, postprocesses it (again, taking orders of magnitude time less than the actual calculation), and then writes it out to the output datafile.

Believe it or not, this is not rocket science. Pretty standard stuff, actually. Here's the problem:

When we mix local and remote threads, however, he need to give different amounts of work to threads based upon their location -- this can help hide latency for remote threads, because we give them larger amounts of work. They request work less often, which directly translates to less messages, and therefore less latency (latency is bad, Bad, BAD!).

Since we're dealing with multiple threads that have to share the same data structures (i.e., the input data queue), that data structure must be locked in some fashion to prevent multiple threads from entering it simultaneously. That would also be BAD! So we lock it with a mutex -- again, pretty simple. The input thread locks the mutex, adds one or more input units. It's will probably be more than one, actually, for efficiency. Just like sending a smaller number of large messages is more efficient than sending a larger number of small messages -- even if the total number of data bytes is the same in both cases -- it is more efficient to lock something a fewer number of times for longer periods of time rather than a larger number of times, each for a short amount of time. The total lock time is the same in both cases, it's the overhead time (i.e., the time necessary to lock and unlock) that is different. But that's not the whole story -- there's concurrency as well! So it's not quite so simple; hopefully our worker threads will be busy enough with real calculations that their blocking time will be minimal in comparison.

And it's even more complicated than that. In order to save latency again (provided that the input units are somewhat small -- and yes, this is problem-specific), we want to only send one message to each remote node, not one message per input unit. So we have to request M input units per CPU per node. Hence, the proxy MPI thread must determine at the beginning of time how many CPUs each worker node has (no problem -- MPI_GATHER into num_cpus[]). For each remote node, the MPI proxy requests M*num_cpus[i] units, packs them into a single message, and sends them off. A similar thing happens with the output data on the way back --
the MPI proxy on the worker node packs all the output data into a single message and sends it back to the manager MPI proxy.

Anyway, back to the problem. So the input data unit queue is locked with a mutex. We want local threads to be able to retrieve X data units from the queue, but want remote data threads to retrieve Y data units. This is for several reasons:

  1. The latency reasons that were cited above; when Y > X, the remote threads do more work than the local threads to hide latency.

  2. When X and Y are relatively prime, given that one input unit translates to some T amount of time of computation, it can help offset the synchronizations necessary between local and remote threads. That is, it adds "jitter" to the scheduling, such that local and remote threads will be synchronizing at different times, which can reduce contention by preventing bottlenecks.

The question is how to do this? My previous scheme used a semaphore and mutex such that any thread (either a local thread or an MPI proxy thread) could remove single units at a time. This is no good, because retrieving a series of individual units may not guarantee their contiguousness -- another thread may slip in and grab a unit from the middle of your stream. I needed a way to atomically get an arbitrary number of units from the queue, and to be able to do it with blocking, not spinning (spinning == eating CPU cycles, and therefore taking them from someone else; blocking == allowing the thread to be swapped out until some external event wakes it up).

So here's what happens: the input works as before, putting in Q input units at a time (where Q >= 1). It then broadcasts to the condition variable (RTFM Tannenbaum). The calculate and MPI proxy threads are a bit more complicated. They get the mutex and check to see if the reservation queue is empty (this is different than the input data unit queue, but protected by the same mutex). If it is, it puts (threadID, num_requested_units) at the tail of the reservation queue. If it is actually at the head of the reservation queue, it checks to see if there are already enough units in the input data unit queue. If so, it takes them, removes itself from the reservation queue, and unlocks the mutex. If there aren't enough input data units, or this thread is not at the head of the reservation queue, it goes to sleep on the condition variable.

The input thread's broadcast to the condition variable wakes up all the calculate threads that are waiting on it (if any) and they all check to see if a) they are at the head of the reservation queue, and b) if there are enough input data units to service their request. If a thread finds that both a) and b) are true, it services itself, removes itself from the reservation queue, and broadcasts to the condition variable again to wake up the next thread in line. And so on.

Of course, it's not quite as clean as this -- there's other sticky issues like the input queue can be drained (i.e., the input is exhausted) but there's not enough units to fulfill a thread's request, etc. So the extra logic gets a bit corn-fuzing.

Interestingly, the output thread protection is much simpler -- the output thread comes in and takes as many contiguous output units off the output queue as possible. It gets more complicated when we can have an arbitrary number of input and output files being processed simultaneously; dequeuing from the output queue becomes an absolute nightmare. Consider just one issue: if we're processing B input files into B output files, all B input files will be read fairly quickly. Their data will be processed in order, and the processing will take some time. But we want the output files to be ready upon output of the final output unit in each file -- i.e., we have to close() the file. Hence, since only the input thread knows when the input has been exhausted, it has to submit some kind of sentinel value into the input data unit queue for a given output file that filters all the way through the pipeline to the output thread such that it says, "when you get all the output data units from this file, you can close it."

Pair that with the fact that the output data units will be coming in a random order from the calculation threads since they may all be operating at different speeds (indeed, since we're using MPI, the remote worker nodes may not be the same kind of machine as the manager machine). So some kind of order has be associated with input/output data (i.e., sequence numbers). The output thread has to re-order the output units into order, postprocess them, and then write them out to disk. And know when to close() the file since most POSIX systems work with write-on-close semantics.

Whew!


The paper is up to 18 pages now.

I'll probably spend more time re-writing the paper again. It seems that I get all the way through it and then realize one more critical threading/synchronization issue that throws the whole thing off, and makes me start the pipeline from scratch all the way back at the input stage. Still, it's all good, because it's way cool stuff. I've redesigned the pipeline from input to output 3-4 times now, each time better than before. :-)

I talked to Lummy about the paper today, and told him that I'd like to have something for him (and possibly Jeremy, Rich, and maybe even Jeremiah) to read in the next day or two.

September 23, 2000

Bring it on!!

The threaded version of the booter, indeed, seems to improve performance. Again, these are not on unloaded machines, so we can't say for 100% sure, but it certainly seems like it (I know pine will display this table badly; deal):

Number of nodes 2-way 3-way 4-way 5-way
32 0:37.5 0:29.1 0:28.3 0:21.2
147 1:01 0:48.5 0:55.4 0:43.4

(same conditions as before, AFS-cached, etc., etc.)

We have weirdness with the trinary and quad trees in the 147 again. :-( I'm still guessing that there are some strategically "bad" (i.e., heavily loaded) machines in the mix that are causing this. Indeed, it seems to "hang" on the last few nodes on the 4-way in the 147 tests. But again, the only real way to test this would be with a large number of unloaded nodes. :-\

Flying monkeys

Archiving some more test results...

Per Lummy's suggestion, I have compared lamboot vs. a serial ring-like boot of several different sizes to compare the two different topologies. My hypothesis was that they would be roughly equivalent -- the rsh latency would dominate any bookkeeping and efficiency of the two codes.

I used the threaded scaleboot version -- not that it mattered, 'cause there would only be one thread/child anyway. Here's the results:


Program Number of nodes
8 32 147
lamboot 0:23.1 3:18 15:xx
ring boot 0:22.6 3:15 15:06

I unfortunately forgot to run /bin/time on the biggest lamboot, so I could only go off the timestamps from my unix prompt. Doh...

Also, with all this big testing with lamboot, I am soooo glad that I wrote lamhalt (to replace wipe) -- it takes down a running LAM by simply sending messages to all the lamds, as opposed to doing a whole new set of rsh's to each machine to kill the daemons.

As Arun says, "'wipe' sounds silly and doesn't have the syllable 'lam' in it." lamhalt rocks.

I'm wearing velvet pants

Got my 10/100 switch yesterday. Woo hoo! Finally plugged it in this morning and got it operating (plug-n-play -- hah!); no more ethernet collisions -- yeah! All 3 of my linux boxen immediately switched to 100Mbps/full duplex, but the windoze box stayed at 10Mbps/half duplex. Figures.

Yeah, ok, I'm still behind a 1.5Mbps DSL line. So what. (actually, I'm streaming MP3s around here behind my firewall, so ethernet collisions were becoming a bit of a problem in terms of performance)


I got version one of my scalable booting working. It does an n-ary tree-based boot across a group of machines. Seems to work pretty well, but is not 100% bug-free yet. That is, it still hangs sometimes -- I think it is because it has done an rsh to a remote node and fails. Here's some preliminary results (units are expressed in seconds):

Number of nodes lamboot Binary tree Trinary tree Quad tree
16 1:02 0:12.8 0:09.8 0:06.4
32 3:07 0:46.5 0:29.3 0:27.5
147 14:06 1:28 1:00 1:07

Pretty good looking so far. Some notes...

  • All results were with the binary already AFS cached.

  • The 16 node tests were conducted on unloaded machines. The 32 and 147 node tests contained nodes that were in use, some of which were heavily loaded (shh!). So these numbers are not perfect. But they are a good ballpark.

  • The difference between 3 and 4 children is sometimes small. This can make sense -- consider the 32 node case. With 3 children each, the farthest leaf from the root will be 3 hops. With 4 children each, it is the same. Hence, with each of 3 and 4 children, we still have the same number of "timesteps".

  • Also, the algorithm is sub-optimal, particularly where there are heavily loaded hosts. I believe that this explains why 3 and 4 children on the 167 test results seem weird (it seems that some of the key parent nodes in the 4 node tests were heavily loaded -- I checked). This is not conclusive proof -- I would need a large number of unloaded machines to be able to test this theory. :-( See below.

I had a brief discussion last night with Lummy about this. I presented some timings of lamboot vs. the tree boots. He wants me to run a ring boot as well, and compare. I initially didn't see why he wanted me to do this -- indeed, I thought that it would be the same as lamboot (and I'm still not convinced that it's not the same -- the majority of time is dominated by rsh latency), but he made the good point that I don't have any numbers to back this theory up. As such, I don't know for a fact that they're not different. And I do agree, they are different topologies, so there could be a difference. They're different code bases, too, so subtle differences could mean a lot (although the scaleboot stuff derived from the inetexec.c that is central to LAM's lamboot). I'll code up the ring and see what happens...

The current implementation essentially works like this:

  1. Invoke the program with the -master switch and provide a hostfile.

  2. The program figured out that it is the master, and decides 1) that it has no parent, and b) reads in the hostfile.

  3. Switching into "parent" mode, it does what I call "multi-rsh" for its number of children (default is 2, but can be overridden on the command line). i.e., it fork/exec's rsh's into the background to the children's hosts. This is more complicated than it sounds...

    • The multi-rsh routine is given a list of username/hostname pairs, and a list of argv to execute on each.

    • First, you have to send a "echo $SHELL" command to the remote host to see what the user's shell is.

    • When that comes back, if they are running Bourne shell (and you'd be surprised at how many people do...), the Real argv (denoted by food) has to be surrounded with "( . ./profile foo )" so that their .profile will be executed, and paths will be setup, etc., etc. Goofy, but true.

    • Once this is determined, fork/exec the rsh with the real command to be executed.

    • Keep in mind that there are multiple rsh's fork/execed into the background simultaneously; they all have to be tracked by watching their stdout and stderr to determine where they are.

    • Additionally, when an "echo $SHELL" finishes, it has to be replaced with the real argv and re-launched.

    This results in one big-ol' state machine. It's somewhat hairy, but once I figured out some nice abstractions in C++, it worked out ok.

  4. After all the commands are executed, the parent waits for its children to contact it (we passed some command line parameters to each child indicating the parent's IP address and the port number that it was waiting on). This means sitting on an accept() N times, waiting for each of the N children to connect.

  5. As each child connects, give them a list of (M - N) / N username/hostname pairs to boot (where M == total number of hosts that this parent has to boot).

  6. The children go off and do their thing, potentially booting grandchildren.

  7. As each child finishes multi-rsh'ing its children (but before doing the accept()s to give its children work to do), it sends a number upstream to its parent indicating how many children were launched. These numbers all filter up to the root/master so that cumulative stats can be kept about how far along the boot is.

  8. The cycle is broken in two conditions (they're actually the same condition, but I call it two conditions here for ease of explanation):

    • A child is executed who has no children. When it contacts its parent to get a list of children to boot, it will receive "0" and therefore recognize that it is a leaf in the overall boot tree. It will then send a "-1" up to the parent and close the socket.

    • When a child has received "-1"'s from all of its children, it will send a "-1" up to its parent and close the socket.

    Hence, these "-1"s are propagated up the tree to the root/master, so that when everyone finishes booting, the master knows. It would be fairly easy to put in a fan out after this fan in and complete the barrier process so that the whole tree knows when it has booted, but it wasn't necessary for these tests.

There is a limitation to this approach: we have to wait for the multi-rsh to finish before we can give work to our children. Depending on the number of children used, and depending on the relative speed of the children of a given parent, this may involve some children waiting an a period of time before being given work. This is conjecture at this point, but 1) it seems reasonable, and 2) I hope to prove it with the following...

A single-threaded approach was actually fairly difficult. It involved some big select() statements, and a lot of lookups and extra bookkeeping. i.e., when select() returns, you have to scan down the list of available file descriptors, figure out which socket is ready, figure out where in the process that socket is, and then react. This created a lot of code (thank god for the STL and hash maps!). While the approach seemed to work, I think a multi-threaded approach will be much simpler in design.

With a multi-threaded design, we can have a thread for each rsh. It therefore only needs to monitor its own progress. We don't even need to have select() statements, because it's ok for each thread to block on read() statements, waiting for I/O from the remote process. I believe that the whole programming model will become significantly easier. And, as I mentioned above, there's a chance that there will be greater performance because each child will be able to go at its own speed and not be forced to wait for any of its siblings.

So I'm off to go implement the multi-threaded approach. I should be able to scavenge parts from lamboot and the scaleboot stuff...

September 28, 2000

Be the ball

By request, I did a pseudo-release of the jeffjournal client and server today to a limited audience. We'll see if it works out for them.

Dog just installed the patches for the Solaris Forte6 compiler today, and I gave it a whirl. Initial impressions:

  • It's slow.

  • They still didn't fix the linker bug. Compiling minime (which uses STL heavily) still gets all the same STL linker errors. <sigh>

  • They seem to have fixed much of the Memory Badness with using bcheck in C++, but I still get a fairly lengthy "blocks in use" report at the end of my run. <sigh> At least these are not potentiall fatal errors, though...

  • Jeremy claims that they don't have iterator_traits. I hope he's wrong...

  • Trippy message from running a multithreaded program through the new bcheck:
    RTC: Enabling Error Checking...
    RTC: Patching executable code.
    RTC: Done patching code.

October 3, 2000

San Demas High School football RULES!

I hit a problem the other day: how to tell when a machine is down? i.e., given a random IP name/address, how do you tell that it is down without a really lengthy timeout?

For example, try sshing to a machine that is down. Not one that doesn't have ssh enabled -- that rejection is almost immediate. And not to an IP that doesn't exist, or isn't reachable by you -- those rejections are also immediate. But ssh to a machine that is currently powered off, or not connected to the net.

It can take a long time to timeout.

A Solaris 7 machine takes almost 4 minutes (3:45) for a telnet to timeout to a host that isn't there. It takes 15 minutes for ssh to timeout (again, on Solaris). Quick testing showed that the majority of the 3:45 time was spent inside a single connect() call.

But a linux machine takes 3 seconds for telnet to timeout to a host that isn't there. What's it doing differently? How can it tell so quickly that the machine is not there?

Interestingly enough, the Solaris telnet reported "Connection timed out", whereas the Linux telnet reported "No route to host". So they're definitely doing something differently. Hmmm...

I ran my connect() test on both Solaris and Linux, and the results were identical to telnet -- Solaris sits for a long time on connect(), and then eventually times out. Linux only sits for a few seconds in connect() and then returns with a "no route to host" error.

Hmm. If connect() does not report the same error in the same way across multiple OS's, how do I do this? Indeed, Linux's behavior is great -- but what do I do on Solaris (and anyone else who doesn't return in 3 seconds)?


I got to thinking about the problem, and decided to look at some network and hacking tools. ping was my first stop. ping works in interesting ways. I didn't realize that it had its own protocol stack (like TCP and UDP). It works like this: you open an ICMP socket (you don't don't bind it to a port). From that socket, you send packets to the ping recipient. The ICMP stack on the other side will reply right back to you. Here's the catch: all ICMP replies come to a single point -- so if you have multiple ping programs running simultaneously, they'll see each other's ping replies (makes sense, if you think about it). Hence, you have to put some encoding in the payloads of the ping requests (which the remote ICMP stack will echo right back at you) to know which requests are yours and which you can discard.

Hence, here's a nice way that you can tell if a machine is up --
send it an ICMP packet. If you don't get one back in a relatively short timeout (probably even user-runtime-settable), rule it as "down". No problem.

Wait -- there's a catch. You have to run as root, 'cause the ICMP stuff is protected. Crap. We don't like setuid programs.

nmap was my next stop. They've got all kinds of goodies in there. SYN scans, FIN scans, etc., etc. They note, however, that many of these are not available to non-root users. Hence, they try the connect() thing as well when a non-root user tries to scan a machine. Again, Linux bails in 6 seconds saying "machine is not up" (this must be due to Linux's short connect() timeout). Solaris, however, takes much longer -- 1 minute. But it is significantly less than 3:45 that we saw in both telnet and the raw connect() call.

Some poking around in nmap revealed the following:

  • It's actually pretty small; only a dozen or so .c files. For something as full featured as nmap, I would have guessed that it would have been larger. Who knew?

  • It seems to be pretty well coded -- I could actually code the code pretty easily. They have good voodoo; color me impressed.

  • The non-root ping scan tries a connect(), but does it in a non-blocking way, and repeatedly uses select() to check if the connect() has finished yet. A neat trick --
    this allows them to set their own timeout (evidentially somewhere around a minute; I didn't bother checking what it actually was).

So I'm going to have to try this -- code up my own non-blocking connect() and put it in my threaded booter and give it a whirl. Too tired right now, though -- this will be tomorrow's activity.


"I'd actually like to see a non-blocking MPI_WAIT."
- Shane Herbert, MPI-2 Forum.

October 11, 2000

Fuzzy dice on a motorcycle

I've fallen behind on my journal entries. Cope.

Brief synopsis on dissertation stuff: got the threaded boot "fully" working. It still sometimes hangs in a very large boot (e.g., .helios scale boots, ~148 hosts or so) near the end. I suspect that someone drops out of the mesh before the boot finishes, but I haven't had a controlled failure yet to check the logs and see what is really going on. Additionally, sometimes a given node drops out when we do arbitrarily large numbers of children (e.g., at 12, foster.helios.nd.edu somehow decides that it doesn't want to boot. I don't know if this is an artifact of foster's parent screwing up [e.g., running out of file descriptors], or if foster itself somehow legitimately getting hosed. It's hard to tell, too, because all of these machines are actively being used when I do my tests :-).

I had to spend a good amount of time writing the jmesh algorithm into the code. I was using the Boost Graph Library, which was written by Rich Lee and Jeremy Siek here in the lab. However, as is always a danger with developing code, the APIs and concepts are continually changing, and the docs that I have (i.e., the book that Jeremy is writing on the BGL) is not consistent with what is available for public consumption at www.boost.org. Additionally, Jeremy's local CVS copy changes stuff even further. As a result, I spent a long time before I actually got it working. Arf.

However, I did come up with an iterative method to generate a list of edges in a jmesh that doesn't use any lookups at all
-- it just generates pairs of vertex numbers, and then we smush those into the constructor of a BGL graph. As such, it's considerably faster than the version that I wrote before -- the prior version would go to each vertex, check how many edges it had, determine if it needed more, etc., etc. This new version just does bookkeeping as it goes along with a small number of integer variables, and All is Well.

Now that I've finally got that working, I can add the stuff for all the nodes to make the connectivity implied in the jmesh, drop the boot tree connectivity, and then sit there waiting for commands. Not for now...


Unfortunately, however, Bill from NIST sent around an e-mail today saying that we'll be having a conference call about the SC'2000 IMPI demo on Friday morning. Doh!!! I haven't done butkis on IMPI yet. I've got to do the following:

  • Finish implementing the attribute color stuff per IMPI section 2.5
  • Implement MPI_BARRIER on IMPI communicators
  • Make LAM/MPI compliant with the IMPI errata
  • Get the pmandel demo code working with a few instances of LAM

It would be Really Good to get this all working by the call on Friday so that forward progress can be claimed...

Sidenote: it's really been quite a while since I've worked on IMPI. I am finding out how much I forget about how it works. Doh!

Saw a great quote in the IMPI code today:

Honk if you love silence.

I even remember putting that comment in there. Classic. :-)


Found two interesting python MPI projects:

I don't think that either project's main goal is formal python MPI bindings, but instead have some main "real" project that is [at least partly] in python, and they wanted to use MPI. I conversed with the sourceforge project author (at Lawrence Livermore); they're actively using it. I asked if there will ever be a formal release (all that's on sourceforge is CVS, not a real distribution). Haven't heard back yet.


Tony Hagale got my journal up and running. Woo hoo! Not entirely pain-free, though. Had to upgrade his C++ compiler, etc. He had some initial problems with quoting, as well. Not quite sure if that was a local configuration issue or a bug in my code ('cause it doesn't happen to me :-).


Started running MojoNation on squyres.com. Speaking from a distributed/crypto standpoint, that's some really cool shit!


Much work to do to get IMPI into shape. Miles to code before I sleep. Rusty will be here all day tomorrow; he's giving a talk on MPICH's daemon, and then Lummy and I are going to the LaSalle grill with him for dinner (yummy). Should be quite interesting.

February 19, 2001

Look Dave, no strings.

Ugh. I've spent the past few days fighting the return semantics of rsh and ssh.
In trying to make the tree-based booter industrial strength by putting it into LAM, I found out that not all rsh implementations are created equal. Grrrr...

It seems that some versions of rsh pretend to close stderr, but will in fact actually send things across it later. i.e., read() will return 0, but then will later return a positive number and have valid bytes in the buffer.

ARRGHHH!!

There's also some mysterious things happening that I don't fully understand yet (this only happens when you scale to above 25 nodes or so). So I finally decided that if rsh cannot be trusted, the whole framework in LAM for generic remote-launching is wrong. i.e., the whole issue is about determining if the remote program started successfully or not. How to do this in a programtic fashion?

It currently goes like this (and rsh can be replaced with ssh or whatever):

  1. Open two pipes
  2. fork() a child process
  3. Close the respective pipe ends in the parent and child processes
  4. Tie the pipes to the stdout and stderr in the child process
  5. The child exec() the rsh command command
  6. The parent watches the pipes:
    • If something comes across stderr, our heuristic says to abort
    • It something comes across stdout, buffer it
    • When stderr and stdout close, the child is done, quit the loop
  7. The parent calls waitpid() to wait for the child to die
  8. If the return status of the child is not 0, abort

If we incorrectly determine that a remote program failed to start (i.e., it actually did start, but the local node thinks it didn't), the remote program gets stranded, and is left running forever because no one will ever contact it again. Among other reasons why this is bad, this is anti-social behavior.

Plus, the code is complicated as well because of the statefull nature it has to maintain while checking multiple data sources in a non-blocking way. Ugh. And I didn't even mention how we have to check and see if the other side is running a Bourne or Korn shell...

The long and the short is that the remote agent (rsh, ssh, whatever) cannot be trusted to give reliable information. So the only thing to do is to disregard the information that it gives and determine if the remote program started correctly by a different means. One way to do that is to have the remote process call the spawning process back with a TCP socket.

If the remote process doesn't call back within a timeout period, the spawner can reason that it failed and give up on it. If the remote process starts up properly and is unable to contact the spawner (perhaps it took a long time to start, and the spawner has timed out already), it will just abort. This prevents orphaned remote processes.

Specifically, I'm looking at something like:

  1. Parent creates listening socket for the callback
  2. Parent launches a thread to wait for the callback on that socket
  3. Parent makes three pipes (for stdin|out|err)
  4. Parent fork()s a child
  5. Parent closes appropriate ends of the pipes
  6. Parent launches two threads to monitor the pipes
  7. Parent launches a thread to block on waitpid()
  8. Child closes appropriate ends of the pipes, ties the other ends to stdout|err
  9. Child exec()'s the remote agent
  10. Parent blocks on a queue

    • When either of the pipe threads wake up on a read, they buffer the data and put it in an event and queue it up for the parent
    • Closing either of the pipes is similar -- an event is queued up for the parent followed by the thread committing suicide
    • When waitpid() returns, the return status is queued up in an event for the parent, and the thread commits suicide
    • When the listening thread succeeds on accept(), it begins the authentication/connection protocol. Upon success, it queues up an event for the parent (including the open socket file descriptor) and commits suicide.

  11. When all the threads die, it means that the remote process has started up, the remote process has authentications and indicated that it wants to run, a socket is still open to the remote process, the remote agent is now dead, and all threads/processes have been reaped, so the parent can now continue.

In the previous scheme, the remote agent would launch the remote program. The remote program would immediately close stdin|out|err and then fork a child into the background as a user-level daemon, and then quit. This would allow the remote agent to finish normally (hah!). The child process would then continue on to do whatever it was launched to do.

In the new scheme, there is no need to have the remote agent finish until the callback to the spawner has completed and there is no more gain to having the remote agent process around anymore. i.e., in the previous (linear) scheme, it was necessary for the remote agent to quit before the next step would proceed (wait for a callback). In this scheme, they are independent events -- the remote agent quitting has little bearing on the callback since those are in different threads. Indeed, it may be advantageous to have the remote agent stick around until the callback occurs successfully to give one more way to abort the remote process if something goes wrong. That is, if something goes wrong and the callback gets mucked up, send a signal or some kind of message down the stdin pipe to the remote agent, which will get passed to the remote process that will cause the remote parent and child to abort.

Additionally, just like giving each remote process a thread to manage it, giving a thread to each of the stdout and stderr pipes eliminates the combined state machine and uses blocking reads. This makes the algorithm for monitoring the pipes much simpler. Hence, we can monitor the pipes, waitpid(), and the callback separately, and therefore greatly simplify the code (why didn't I think of this earlier?).

Jeff's law of non-blocking:

writing blocking algorithms is much simpler than writing non-blocking algorithms.

Jeff's law of blocking:

writing concurrent blocking algorithms introduce their own problems, but generally only in terms of infrastructure, and are typically problems that are already solved.

What's even cooler is that the remote process can startup, call back the spawner, and give a "I'm ready to go" message, or "things suck over here; I can't run so I'm going to abort" message. i.e., the remote process can decide whether it's going to run or not (e.g., check to see if the load is not too high) and send back a yay or nay to the spawner. Even cooler than that -- an integrated startup protocol allows for authentication instead of security through obscurity (security though obscurity isn't, for those of you who care!).

I'm currently in the middle of re-writing all this code (it takes time to setup the infrastructure and whatnot). The result should


xmms currently has 619 of 703 processes on queeg (88%).

February 21, 2001

I think Joe saw us in the movie theater last night

I've gotten an unexpected result from my thread booter.

When booting across the ND helios cluster of 161+ sparcs (some of which should fail, BTW -- at least 2-3 are down at any given time, and about 5-10 are a different version of Solaris than the rest such that there are shared library linker problems trying to run on them).

Even with about 10-20 nodes expected to fail, about 1/3 of them fail to boot properly on a regular basis. This is many more than expected.

The main reason is that the parent that is trying to boot them times out. i.e., if the child does not callback on the socket within N seconds, the parent decides that the remote boot must have failed (even if the boot does succeed at some point later, and the child does try to boot). The parent rules that that child is dead and moved on to the next.

The weird thing is that this was happening a large percentage of the time; much more than I expected. Worse than that, it was inconsistent -- I would get different results every time I did a helios-wide boot (even if they were only separated by only a few seconds). This is clearly not good enough.

One solution is to increase the timeout time (I was using timeout values of 5, 10, and 30 seconds -- the problem occurs with all the values). Increasing the timeout value to 120 seconds seems to make it work most of the time; most bootable helios machines actually boot properly. However, this significantly adds to the overall boot time because we now have to wait 2 minutes for each individual failure before moving on to the next child, which is undesirable.

So I think I need to change my booting algorithm yet again (this is the point of research, isn't it?).

  • Still keep the basic tree-based structure.

  • To overcome the problem with slow children, we need a system where the work of one child can be given to another, but need to keep this in a tree-based structure (vs. a monolithic server) so that we don't run out of resources. That is, some kind of first-come, first-serve basis, since we know that if a child requests work, it is ready to go. Faster children will naturally ask for more work.

  • Right now, each parent node receives a list of all the children that it is responsible for booting. It divides this list up into N sub-lists (where N is the number of immediate children that it will boot), spawns a thread for each, and gives each thread one of the sub lists. This needs to change.

  • Instead, spawn off N threads and give them each one child to boot. The parent thread keeps the rest of the list of nodes that it is ultimately responsible for booting.

  • If a child fails to boot by some kind of immediate failure (e.g., a ping to that child fails), the parent can kill that thread and launch a new thread and give it the next node from its master list.

  • When [if] a child actually boots successfully (which is defined by the grandchild opening a socket back to the child and saying, "Ok, I'm ready -- gimme some work"), it asks the parent for a subset of nodes from the parent's pool. The parent will give a list roughly of size total_size/N so that each descendant's subtree will be about the same size, which then child then passes on to the grandchild.

  • Aside from the parent keeping the list of children, this is more or less how it happens now.

  • Here's the new part: when a parent finishes (actually, when any node finishes -- whether it was originally a parent or a leaf), it sends an "all done -- got any more work?" message to its parent.

    • If the parent's parent has any more work, i.e., it has some nodes left in its pool because one or more of its children were slow to boot, it will give a subset list (of about the same size as it has given out to every other node who queried) to the querying child.

    • If the parent's parent doesn't have any more work, it passes the request to its parent, where the same procedure is repeated. If any work is eventually found, it is sent back down the tree to the original child who queried.

As such, with this scheme, it is possible for a grandchild (or some node even further down) to steal work from a slow child. This scheme can allow for long timeouts which may be necessary (particularly in an active workstation environment), but still allow for speed in the overall boot -- we just eliminate the blocking on slow children by potentially taking away their work from them.

A side-effect of this is that the overall tree may become lop-sided. But that doesn't really matter much, because the idea here is to parallelize in time, not in space. So if we have a slightly taller-than-optimal meta-tree at the end, it doesn't matter -- the meta tree is only for booting and will be discarded anyway.

It's good to be a gangsta

Some hard-learned C++ knowledge:

cin always uses file descriptor 0.
cout always uses file descriptor 1.
cerr always uses file descriptor 2.

This is particularly annoying where you close file descriptors 0, 1, and 2 and re-open them to something else, because cin/cout/cerr will still use those file descriptors, and will read/write to the new things that you opened!

I finally figured out that that was what was happening on my tree-based booters -- I had debugging cerr's that ended up writing down sockets, which caused havoc on the remote side, because it got unformatted and unexpected messages. Doh!!!

In hindsight, this completely makes sense (and may even be by design; I don't have an iostream book handy). Consider that cin/cout/cerr are not tied to the OS -- they have no way of knowing when file descriptors 0, 1, and 2 have been closed and reopened to something else. For example, cout's operator<<(...) assumedly eventually boils down to:

write(1, ...);

In which case, cout has had no indication that file descriptor has been closed and re-opened into something else.

Just a point of wisdom for readers out there... it caused me three days of grief.

April 21, 2001

Gazizza, Bill

The exact topic of my dissertation has changed several times.

Here's what I presented to my committee last week, with their comments applied, as well as with information from my coding it up (particularly with their changes). Those who aren't geek-minded can probably ignore the rest of this message.

Fair warning: this is a pretty long journal entry!


Background and Algorithmic Overview

The idea is to have a "fully generalized manager-worker framework". However, the end result is that it's not quite the manager-worker model -- it's more like a "fully generalized, threaded distributed work farm". I started with a model for any [serial] kind of computation that looked something like this (it won't render right in pine -- deal -- go look at the web version):

    |=======|   |===========|   |========|
    | Input |-->| Calculate |-->| Output |
    |=======|   |===========|   |========| 

If you throw some queues in there, you can duplicate and therefore parallelize the Calculate step (keep the Input and Output steps serial, because a) they'll probably take very little time, and b) any part that can be parallelized can be thrown into the Calculate step):

    |=======|   |===|   |===========|   |===|   |========|
    | Input |-->| Q |-->| Calculate |-->| Q |-->| Output |
    |=======|   |===|   |===========|   |===|   |========|
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                 ...         ...         ...
                  |     |===========|     |
                  |====>| Calculate |====>|                        |===========| 

That's pretty standard stuff, actually. That's essentially the manager-worker model.

So what I'm doing is two things: extending this model to include threads (still relatively unexplored areas with MPI, particularly since the 2 major freeware MPI implementations have little multithreading support) and to make a distributed scatter/gather scheme.

The goal here is to present a framework (i.e., a library) to the user such that they only have to supply the Input, Calculate, and Output steps. Yes, they do have to be aware of the parallelism, but only so much so that they can make their problem decomposable. The framework takes care of all the bookkeeping. Hence, the user essentially writes three functions (actually, 3 classes, each with a virtual run()-like function, and functions to pack/unpack their input and output data. As briefly mentioned in previous journal entries, I ended up using C++ templates heavily so that the type safety would all work out).

The target audience is people who want parallelism but don't really care how it works. That would be most engineers and scientists -- even some computer scientists| Most of these kinds of users just want to run their results, and run them faster -- they don't care how it works. After all, that's our job (as computer scientists), right?


Back to the description...

From the above picture, if we're only running on one machine (say, a 4-way SMP), the Calculate boxes (instances) will be individual threads. The Input and Output instances will be threads, too. By default, there will be one Calculate thread per CPU -- the Input and Output threads will be "extra" and cause some thrashage of CPU scheduling, but not very much -- particularly when the Calculate step is large enough to run for a while.

Note that the two queues do not have threads running in them --
those queues are just data structures with some intelligent accessor functions. The Input, Calculate, and Output threads access the queues and become a thread active in the queue. But there are no separate threads running the queues themselves.

Using threads is nice because it avoids the whole issue of extraneous memory copying and allows message passing latency hiding (even with single-threaded MPI implementation). If we used the same model with pure MPI instead of threads -- i.e., where Input, each of the Calculate instances, and the Output were all separate MPI ranks on the same node, we'd be doing sends and receives between each of the instances (the queues would possibly be located in the Input and Output ranks), which would invoke at least one memory copy (and probably more). If the input data and output data are large, this could add up to be a non-trivial portion of the wall clock execution time. Using threads within a single process, pointers to input/output data can just be passed between the Input, Calculate, and Output blocks. i.e., pass by reference instead of by value. Therefore, it makes no difference how large (or small) the input and output data is.


Extending this model to cover multiple nodes, let's throw in a definition first. The node on which the Input and Output are run is called the "Server". While it would certainly be possible to run the Input and Output phases on different nodes, this model will assume that they are on the same node, just for simplicity. It is [probably] not difficult to separate them, but this work doesn't focus on that issue. Hence, there's only one server in this model, regardless of however many nodes are involved in the computation.

To extend this model to include multiple nodes, we add a special kind of Calculate instance to the diagram from above -- a "Calculate Relay":

    |=======|   |===|   |===========|   |===|   |========|
    | Input |==>| Q |==>| Calculate |==>| Q |==>| Output |
    |=======|   |===|   |===========|   |===|   |========|
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                 ...         ...         ...
                  |     |===========|     |
                  |====>| Calculate |====>|                  |     |===========|     |
                  |     |===========|     |
                  |====>| RelayCalc |====>|                        |===========| 

This RelayCalc instance has the MPI smarts to send input data to, and receive output data from, remote nodes. Notice that it just dequeues input data and enqueues output data just like the other Calculate instances. Hence, the Input and Output instances do not need to know anything special about remote access.

Also note that there will be a thread running the RelayCalc instance. One could conceivably model the relays in the queues, but this would entail having 2 relays, and would cause some issues with non-thread safe MPI implementations (although these issues arise elsewhere, anyway), and it would destroy the idea of not having threads running in the queues. While threads are nice and lightweight, we don't need to have extraneous threads running where we don't need them. Not only are they not free (in terms of resources), they do add complexity (e.g., what would threads running in the queues do?).

The RelayCalc fits in the came category as Input and Output --
it's an "extra" thread, but it is not expected to take many CPU cycles (particularly when the Calculate phase is non-trivial).

Note that there is only one RelayCalc instance, regardless of how many nodes it is relaying to. This greatly simplifies the relaying with a single threaded MPI implementation -- indeed, to have N instances of RelayCalc to relay to N remote nodes would mean that a global lock would have to be used to only allow one RelayCalc instance in MPI at any time. This would mean that that all the RelayCalc instances would have to poll with functions such as MPI_TEST. And this would involve continually locking, testing, unlocking between all the RelayCalc instances, which would certainly keep one or more CPUs busy doing message passing rather than working in the Calculate instances, which is not desirable.

Hence, there's only one RelayCalc instance that can do blocking MPI_WAITANY calls to check for messages from any of the nodes that it is expecting output data from (and checking for completion of sent messages -- see below). This will probably serialize message passing in the server, but that is to be expected with a single-threaded MPI implementation anyway.

Indeed, even if the MPI implementation were multi-threaded, there will frequently be less network interfaces than remote nodes (typically only one), so the network messages will likely be at least somewhat serialized anyway. The best that a multi-threaded MPI implementation could do would be to pipeline messages to different destinations across the available NICs, but that's within the MPI implementation, and not within the user's (i.e., the framework's) control. Indeed, a quality single-threaded MPI implementation can pipeline messages anyway (if non-blocking sends are used). So there's actually little gain (and potentially a lot of CPU cycles to lose) in having multiple RelayCalc instances when using a single-threaded MPI implementation -- the same end result of having multiple RelayCalc instances with a true multi-threaded MPI implementation can be achieved with carefully coded single RelayCalc instance using non-blocking sends and receives with a single-threaded MPI implementation.

(There's a lot of the finer details that I didn't cover in the previous two paragraphs; those are currently left as an exercise to the reader. :-) Read my dissertation for the full scoop)


So now let's look at a typical non-server node in the model:

    |=========|   |===|   |===========|   |===|   |==========|
    | RelayIn |==>| Q |==>| Calculate |==>| Q |==>| RelayOut |
    |=========|   |===|   |===========|   |===|   |==========|
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                   ...         ...         ...
                    |     |===========|     |
                    |====>| Calculate |====>|                          |===========| 

A few interesting notes here:

  • The Input and Output instances have been replaced by RelayIn and RelayOut instances, respectively.

  • As far as the Calculate instances are concerned, the model is the same -- it dequeues input, processes, and enqueues output.

The RelayIn and RelayOut instances are the MPI entry and exit points -- input data is relayed to the RelayIn instance from the RelayCalc instance on the Server, and output data is relayed back to the RelayCalc instance by RelayOut. This is why the user has to supply not only the Input, Calculate, and Output instances, but also methods to pack and unpack their the input and output data -- the framework will call them automatically to send and receive the data between nodes.

But again, in terms of the Calculate phase -- nothing is different. It operates exactly as it does on the server node. The framework has just added some magic that moves the input and output data around transparently.

There are now two threads vying for control of MPI. Since we only have a single-threaded MPI implementation, we cannot have both of them making MPI calls simultaneously. The following algorithm allows both threads to "share" access to MPI in a fair manner.

In the beginning of the run, the RelayIn instance has control of MPI because we expect to receive some number of messages to seed the input queue. After those messages have been received, control of MPI is given to the RelayOut. The RelayOut will block while dequeuing output data from the output queue (since the Calculate threads started acting on the data as soon as it was put in the input queue), and then return the output data to the Server. Control is then given back to the RelayIn in order to receive more input data.

That is, the message passing happens at specific times:

  • Messages will only be received at the beginning of the run, or after messages have been sent back to the Server

  • Messages will only be sent after messages have been received and the Calculate threads have converted them to output data

Specifically, incoming and outgoing messages will occur at different (and easily categorizable) points in time. Indeed, outgoing messages will [eventually] trigger new incoming messages, and vice versa. So the simple "handoff" model of switching control of MPI between the RelayIn and RelayOut instances works nicely.


A big performance-determining factor in MPI codes can be latency hiding. Particularly in high-latency networks such as 10/100Mbps ethernet. An advantage of this model is that even with a single threaded MPI, progress can be made on message passing calls which actual calculation work is being done in other threads. This pipelined model can hide most of the latency caused by message passing.

That is, the RelayIn thread can request more input data before the Calculate threads will require it. Hence, when the Calculate threads finish one set of data, the next set is already available --
they don't have to wait for new data to arrive.

A possible method to do this is to initially send twice the amount of expected work to each node. That is, if there are N Calculate threads on a given node, send 2N input data packets. The Calculate threads will dequeue the first N input data packets, and eventually enqueue them in the output. The next N input data packets will immediately be available for the Calculate threads to dequeue and start working on.

Meanwhile, the RelayOut thread will return the output data and the RelayIn thread will [effectively] request N more input data packets. When the N input data packets arrive, they will be queued in the input queue for eventual dequeuing by the Calculate threads. This occurs while the Calculate threads are working -- the message passing latency is hidden from them.

This scheme works as long as the Calculate phase takes longer than the time necessary to send output data back to the Server and receive N new input data packets. If the Calculate phase is short, the RelayIn can initially request more than 2N input data packets, and/or be sure to use non-blocking communication to request new input data blocks so that requests back to the Server can be pipelined.


To improve the scalability of the system by removing some of the bottlenecks in the scattering/gathering, non-server nodes can also have a RelayCalc instance:

    |=========|   |===|   |===========|   |===|   |==========|
    | RelayIn |==>| Q |==>| Calculate |==>| Q |==>| RelayOut |
    |=========|   |===|   |===========|   |===|   |==========|
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                   ...         ...         ...
                    |     |===========|     |
                    |====>| Calculate |====>|                    |     |===========|     |
                    |     |===========|     |
                    |====>| RelayCalc |====>|                          | (optional)|
                          |===========| 

This RelayCalc instance will relay input data to additional remote nodes, and gather the output data from them, just like the RelayCalc on the Server node.

The implication of having a RelayCalc step is that we can have arbitrary trees of input and output. That is, the Server is not the only node who can scatter input out to, and gather output data from, remote nodes -- arbitrary trees can be created to mimic network topology (for example). Consider the following tree:

                      the_big_cheese (0)
!===========! !============!
! !
child_a0 (1) child_a1 (3)
! !=====! !=====!
child_b0 (2) ! !
child_c0 (4) child_c1 (5)

the_big_cheese is the Server. It has two children, child_a0 and child_a1. child_a0 has only one child, but child_a1 has two children. The numbers in parentheses represent the MPI rank numbers (with respect to MPI_COMM_WORLD). Note that there is no restriction to having a maximum of two children =- this is just an example. Each node also has one or more Calculate instances. So the end result can be a large, distributed farm of compute nodes.

This refines some of the previous discussion: the various Relay instances (In, Calc, Out) will actually not necessarily talk to the Server -- they'll talk to their parent, child, and parent, respectively. In some cases, the parent will be the Server. In other cases, the parent will be just another relay.

The RelayIn will now need to request enough input data packets to keep not only its local Calculate threads busy, but also all of its children. This is accomplished by doing a tree reduction during the startup of the framework that counts the total number of Calculate instances in each subtree. This will allow a RelayIn to know how many Calculate instances it needs to service. The RelayIn can then use an appropriate formula / algorithm can be used to keep its input buffer full (as described above) before any of the local Calculate instances or the RelayCalc instance needs data.


The astute reader will realize that there are now three threads vying for control of MPI. Therefore, the simple handoff protocol discussed above will not work (although the handoff protocol is still applicable for "leaf" nodes, where there is no RelayCalc instance). To make matters worse, both RelayIn and RelayCalc will potentially need to be blocking in receive calls, waiting for messages to asynchronously arrive. RelayIn will only receive messages at discrete times (as discussed above), but the frequency at which RelayCalc can receive messages is determined by the node's children, and therefore could effectively be at any time. That is, since RelayIn/RelayOut will be driven not only by the actions of its children nodes, but also by its local Calculate threads, the times at which RelayCalc will need to receive messages is not always related to when RelayIn/RelayOut will need to communicate.

Specifically, there will be times when RelayCalc needs to receive a message that will independent of what RelayIn and RelayOut are doing.

It would be easiest of RelayCalc could just block on its MPI_WAITANY while waiting on a return message from any of its children. But this would disallow any new outgoing messages from RelayOut (and therefore any new incoming messages from RelayIn). The implication is that nodes will have to wait for a message from their any of their children before they can send the output data from their local Calculate threads back to their parent, and therefore have to wait request any new input data.

This can be disastrous if a node's children are slower than it is. In this case, a fast node could potentially drain its entire input queue and be blocked waiting for results from any of its children before being able to ask for more input data from its parent. Even worse, this effect can daisy-chain such that slow nodes could cause the same problem in multiple [faster] parent nodes; the fast parents could all get trapped waiting for results from the one slow child.

These questions are addressed further in the following section.


This tree design will help eliminate the bottleneck of having a single Server that has to communicate with N nodes (especially as N grows large) -- the problems of serializing the message passing could easily dwarf the CPU cycles given to the Calculate instances. That is, the communication could become more costly than the Calculation.

But just having a tree structure for scattering/gathering is not sufficient. Indeed, if a leaf node (e.g., child_c0) sends an output block back to its parent, and its parent immediately sends it to its parent, etc., all the way back up to the Server, this would effectively be no different than if all N nodes were connected to the Server directly -- the Server will get a message for every single output data block. This model would then only add hops to input and output data rather than increase scalability.

Instead, the various Relay instances will gather multiple messages into a single message (or a single group of messages) before sending them up the tree. For example, a RelayOut instance can wait for output data from each of its Calculate instances before sending them back to its parent. The RelayOut instance will send all of its N messages at once in a "burst" such that its parent RelayCalc instance will be able to process them all in a short period of time and then relinquish its CPU back to a Calculate instance. I'll refer to this group of messages as a "mega message", below.

Likewise, there will need to be some flow control on the messages from RelayCalc instances. It is desirable to group together multiple "mega messages" into a single "mega mega message" in order to send larger and larger messages as output data propagates up the tree, and therefore decrease the number and frequency of messages at upper levels in the tree. Hence, the mega messages that are received by a RelayCalc must be grouped together, possibly in conjunction with the output data from the local Calculate instances, before sending to the node's parent.

But how to do this? Does the RelayCalc just wait for mega messages from all of its children before enqueuing them all to the output? It would seem simpler to just enqueue the mega messages as they come in, and when the RelayOut sees "enough" messages, it can pass a mega message of its own (potentially larger than any of the individual mega messages that it received) to its parent.

One definition for "enough" messages could be N mega messages (where N is the number of children for this node), and/or M output data enqueues (where M is the number of Calculate instances on this node). This may also be a problem-dependent value -- for example, if the Calculate process is short, "enough" messages may to be a relatively small value.

This scheme will probably work well in a homogeneous world. But what if the node and its children are heterogenous? What is some nodes are more powerful/faster than others, or if the network connections between some of the children are heterogeneous? For example, what if some of the children nodes are connected via IMPI, where network communication to them is almost guaranteed to be slower than network communication to local MPI ranks?

The heterogeneity effect implies the problem discussed in the previous section -- that slow Calculate instances can cause parent nodes to block with no more work to do, and not be able to obtain any more work because RelayCalc has not returned from waiting for results from a child.

Another alternative is to use the "separate MPI thread" approach (where all threads needing access to MPI use simple event queues to a separate MPI thread), and have the separate MPI thread use all non-blocking communication. But instead of using a blocking MPI_WAIT approach, use the non-blocking MPI_TEST polling approach. The problem with this, as discussed previously, is that this could incur a undesirably significant number of CPU cycles, and therefore detract from the main computations in the Calculate instances. If polling only happened infrequently, perhaps using a backoff method (finitely bounded, of course), this might be acceptable.

Note that there will be one "incoming event queue" for the MPI thread where threads can place new events for the MPI thread to handle. But there will be multiple "return event queues" where the MPI thread places the result of the incoming events -- one for each thread that enqueues incoming events.

                  |-============================|   |========|
   |=============>| Shared incoming event queue |==>|        |
   |              |=============================|   |        |
   |                                                |        |
   |   |====|   |===============================|   |        |
   |<==| T1 |<==| Return event queue / thread 1 |==>|        |
   |   |====|   |===============================|   |  MPI   |
   |   |====|   |===============================|   | thread |
   |<==| T2 |<==| Return event queue / thread 2 |==>|        |
   |   |====|   |===============================|   |        |
  ...                                              ...      ...
   |   |====|   |===============================|   |        |
   |<==| TN |<==| Return event queue / thread N |==>|        |
       |====|   |===============================|   |========| 

The various threads that need to access MPI place events on the MPI thread's shared incoming event queue, and then block on (or poll) their respective return event queues to know when the event has finished. An event is a set of data necessary for an MPI send or receive.

The general idea is that the MPI thread will take events from its incoming queue and start the communication in a non-blocking manner. It will poll MPI periodically (the exact frequency of polling is discussed below) with MPI_TEST, and also check for new events on its incoming queue. As MPI indicates that events have finished, the MPI thread will place events on the relevant return event queue. The MPI thread never blocks in an MPI call; it must maintain a polling cycle of checking both the incoming event queue and MPI for event completions.

A special case, however, is that the MPI thread can block on the incoming event queue if there are no MPI events pending. This allows the MPI thread to "go to sleep" when there are no messages pending for MPI (although this will rarely happen).

The polling frequency is critical. It cannot be so high that it takes many cycles away from Calculate threads, nor can be too low such that input queues become drained or threads become otherwise blocked unduly while waiting for MPI events to complete. These conflicting goals seem to indicate that an adaptive polling frequency is necessary.

That is, it would seem that the polling frequency should be high when events are being completed, and should be low when no events are occurring. This would preserve the "bursty" model described above; when an event occurs, it is likely that more events will follow in rapid succession. When nothing is happening, it is likely that nothing will continue to happen for a [potentially long] period of time.

A backoff method fits this criteria: the sleep time between polling is initially small (perhaps even zero). Several loops are made with this small/zero value (probably with a thread yield call in each loop iteration, to allow for other threads to wake up and generate/consume MPI events). If nothing "interesting" occurs in this time, gradually increase the sleep time value. If something "interesting" does occur in this time, set the sleep time value back to the small/zero value to allow more rapid polling.

This allows the polling to occur rapidly when messages arrive or need to be sent, and slowly when no message passing is occurring (e.g., when the Calculate threads are running at full speed).

An obvious optimization to the polling model is to allow the MPI thread to loop until there are no new actions before going to sleep. Hence, if a event appears in the incoming queue, or MPI_TEST indicates that some communication has finished, both the sleep time is reduced and the MPI thread will poll again without sleeping.

This stuff hasn't been implemented yet, these are questions that I do not yet have definitive answers to. It is likely that my first implementation will be modeled on the backoff polling described above. We'll see how that works out.


There's a whole bit in here that I haven't really described about a primitive level of fault tolerance -- if a node disappears, all of the work that it (and all of its children) was doing will be lost, and reallocated to other workers. That is, as long as one Calculate thread remains, the entire computation will [eventually] finish, but likely at a slower rate.

The gist of this is to set the error handler MPI_ERRORS_RETURN such that MPI will not abort if an error occurs (such as a node disappearing). There's some extra bookkeeping code in the RelayCalc that keeps track of what node had what work assigned to it, and will reassign it to another node (including its local Calculate threads, if a) the local Calculate threads become idle, and/or no remote Calculate threads remain alive).

Just to clarify: I am not trying to be fault tolerant for programmer error. If you have broken code (e.g., a seg fault), this framework will not recover from that error. This framework will also not attempt to bring nodes back that previously failed; once nodes die, they -- and all their children -- are dead for the rest of the computation. Future work may make this better, but not this version. :-)

Probably a good way to describe this fault tolerant work is: if you have a program that will run correctly when all nodes are up, your program will run correctly as long as the Input and Output threads stay alive, and at least one Calculate thread stays alive.


Implementation Details

Woof. This journal entry is already much longer than I thought it would be (we're approaching 600 lines here!), so I'll be a little sparse on the implementation details, especially since it's all in a state of flux anyway.

The implementation is currently dependent upon LAM/MPI and will not run under MPICH. This is because I make use of the MPI_COMM_SPAWN function call to spawn off all the non-server ranks. The code could be adapted to allow for using mpirun to start the entire job, but that's a "feature" that I don't intend to add yet.

Unfortunately, the only way that I could think to specify the tree used for the distribution was to specify an additional configuration file. Attempting to specify the tree on the command line resulted in a clunky interface, and arbitrarily long command lines. I used inilib to parse the file; it's a fairly simplistic, yet flexible format.

The server starts up, parses the configuration file, determines how many total non-server nodes it needs, and spawns them. The non-server nodes are spawned with an additional "-child" command line argument so that they know that they do not need to read a configuration file. Instead, they receive their configuration information from their parent in the tree.

Sidenote: what's interesting to me about using MPI_COMM_SPAWN to start all the non-server nodes is that startup protocols have to be used. I'm used to writing startup protocols to coordinate between multiple processes for socket-level (and other IPC) programs, but I've never used MPI itself for startup meta information. It just seems weird. :-)

The server sends a sub-tree of the entire tree to each of its direct children. The sub-tree that it sends is the tree with that child as the root; hence, every child of the server learns about all of its children, but does not learn about its siblings or their children. The process repeats -- each child then sends a sub-tree to each of its direct children, until there are no more sub-trees to send.

One of the parameters in the configuration file is "how many Calculate threads to have on this node". If this value is -1 in the configuration, the value will be determined at run time by how many CPUs the node has. As such, after the sub-trees are distributed, a global sum reduction is initiated from the leaves back to the server. In this way, each relay will learn the total number of Calculate threads that it serves.

After this, it's fairly straightforward (hah!) bookkeeping to distribute input data and gather output data (mainly as described in the algorithms discussed above). The queues actually contain quite a bit of intelligence (particularly the output queue), and are worthy of discussion. However, I'm pretty sure that I have discussed the queues in a previous journal entry (they've been implemented for quite some time now), so I won't repeat that discussion here.


Future Work

Here's a conglomerated list of items from the text above, as well as a few new items that would be desirable in future versions of this software:

  • When a node dies, try to bring it back.

  • When a node dies, allow all of its children to continue in the computation if possible. This may be as simple as the grandparent assuming control of all the children, or may entail something more complicated than a simple tree distribution (perhaps something more like a graph [perhaps a jmesh?]).

  • Allow the job to be started with a full MPI_COMM_WORLD --
    i.e., don't rely on MPI_COMM_SPAWN to start the non-server nodes.

  • Multiple, different Calculate thread kernels for different kinds of input.

  • "Chainable" Calculate threads, such that the output of one CalculateA instance can be given to the input of a CalculateB instance.

  • Allow the Input and Output instances to run on arbitrary (and potentially different) nodes.

  • Allows multiple Input and/or Output instances.

  • World peace.

May 4, 2001

Jimmy James: Macro Business Donkey Wrestler

My DSL is still down. This sucks.

That is, it has been up periodically, but only for about 30 minutes to 2 hours at a time. The real work that I have been able to do this week is negligible. Arrggh!!!

Telocity is still blaming Bell South for this, and they're probably right. The packets either end up in a router loop right outside my DSL modem or make it down to Atlanta (which is only 1 step further) and then die. That seems to be consistent with having my local phone provider just sucking horribly. :-(

All in all, my internet uptime this week is probably well under 25%. :-(


I beefed up my monitoring script -- it runs via cron every minute and checks my connection to DNS, Notre Dame, and Excite (for some reason, I can usually reach Excite, but just about everything else is unreachable). I had to re-write it in Perl because it was becoming to complicated for shell script.

I've never played with Perl's CPAN modules -- they're pretty cool. I was pleased to discover that they have a Ping module and several HTTP modules. The Ping module offers all three kinds: tcp, udp, and icmp. And you can do anything you want with the HTTP modules.

So I ICMP ping the Telocity DNS servers and ND, and HTTP get / from Excite (so that they don't think I'm trying to DoS them by pinging every minute... that would have to trip some kind of alarm, I'm sure! :-).

HAH! I think we just came back on the air -- 9:20am. We'll see how long this lasts....


Since I was off the air most of yesterday, I spent a little time reinstalling my laptop again. After trying to install yet another package and realizing that some component hadn't been installed, I said "screw it" and just reinstalled the whole thing, and selected "install everything". Not that that actually installs everything on the install CDs, but it does install most things that you need (but still not pine, curiously... I guess they want you to use Evolution or KMail. <shrug>).

Anyway, I got the laptop reinstalled and only had to manually install a handful of RPMs. I got everything working, and even managed to get the WEP going on my orinoco card at 11Mbps. It seems that linux distros don't do what the PCMCIA package recommends that they do (e.g., /etc/pcmcia/wireless.opts is not where the options go), but I managed to find where 'drake puts wireless options and to get it all going.

I saved instructions on what I did, because:

  • 'drake 8.0 doesn't come with the orinoco_cs driver module compiled, although it does come with the source code (which I thought was weird). It took a bit of futzing around and some helpful suggestions from Brian to get it compiled properly.

  • The default wvlan_cs driver that comes with 'drake 8.0 doesn't seem to support WEP.

  • Others have essentially the same laptop that I do, so if you want the instructions, let me know. I have no idea if RedHat uses the same location for the wireless options, but I'll bet that if it's not, it's very similar (/etc/sysconfig/network-scripts/ifcfg-ethX).

  • Soon enough I'll have a new laptop and need to repeat the procedure again...

As an experiment, I plugged the audio out of my laptop into the AUX input of my stereo, downstairs.

Yes, indeed -- soon I was streaming MP3s from the server upstairs to my laptop downstairs, and out through my stereo. How cool is that?!? I was pumping out Fatboy Slim at very loud volumes.

Even cooler -- I had forgotten that I was still streaming MP3s to my desktop upstairs. It seems that that tiny little pentium that I have working as my router and my MP3 server does pretty well. Let's hear it for old technology -- it can still be the work horse for all those "little" jobs that you don't want to have to buy a new, hefty (and expensive!) machine for!

Yummy!


May 25, 2001

'morning sir. Are you going to introduce me to your bi-atch?

Tuscon. a.k.a., "I never new that queues could be so complicated!"


DSL went out while I was typing this. <sigh>

Looking through my log, I see that connections to ND were really spotty yesterday (indeed, I felt that heavily as I was working), including a full 20 minute outage around 2pm, a 10 minute outage on Wednesday, an outage from 4am to 7pm on Saturday (although that may well have been my router getting hosed -- when the power blinked recently, the router froze until I rebooted it), fairly crappy connectivity last Monday-Wednesday, and some sustained outages on the previous Saturday....

Overall, it's not as bad as it sounds. Sustained outages (like the one I'm having right now...) aren't too often, and seem to usually be the fault of Bell South (packets dropping in Atlanta). Spotty connectivity does happen not infrequently, but ND might well be to blame for that, because their external router is so overwhelmed, and the internal network is, well, less than perfect (oh for the days when Shawn was running the network...). Indeed, it's quite possible that my connectivity to IU will be better than my connectivity to ND if I only have to worry about the periodic sustained outages and not spotty connectivity.

We'll see how it works out; I don't have accounts a IU yet, but that paperwork is crunching through the vast papermills... I haven't decided yet on how to change my e-mail address. I might wait until after my defense; haven't decided yet.



  • Wow; the MPI queue turned out to be quite complicated; I mostly
    worked out the model in one days, but spent all the next day
    working on the engine itself, and a few details (bugs) in the
    outer parts.

  • everything seems to work except some MPI_Cancels at the end (I
    think the requests are already dead), and it's slow. Gonna
    have to revamp the enqueueing/dequeueing so that you can
    enqueue/dequeue lots of things at once, not just one at a time
    (why didn't I learn my lesson the first time?)


  • enqueueing dequeuing a list at a time really helps
    (std::list<>::splice() is very handy -- O(1), baby!)

  • BRAINSTORM: don't enqueue a list (complicates matters greatly,
    especially w.r