I've got a bunch of things that I want to put down about a possibility about making LAM/MPI be checkpoint/restartable. I'll break it into multiple parts:
- Some LAM terminology
- The "checkpointing sockets" problem
- Possibilities
- lamd problems
- Possibilities with Condor
- Checkpointing without Condor
- Making this portable
- Other problems
Some LAM terminology
Since others will be reading this text, I'm going to throw in some LAM definitions that I'll be re-using throughout the text below:
-
lamd: The lamd is the LAM daemon that is run on every host in a "normal" LAM run-time environment. It provides several services to running LAM/MPI jobs, such as process control, an out-of-band messaging channel, key=value global publishing, a scoping mechanism, etc.
- C2C: An acronym for "client-to-client", meaning that MPI communication goes directly from the source process to the destination process. This is usually via TCP sockets, but can also be via shmem or GM (myrinet), or whatever other network connects to MPI ranks.
-
nsend() / nrecv(): the function calls in the LAM/MPI implementation that are used for the out-of-band messaging channel. That is, MPI ranks can use nsend() and nrecv() to send messages to each other. These messages go from the source rank to the local lamd, then to the remote lamd, and then to the destination rank. Hence, the out-of-band messaging channel goes through the lamd, not through C2C channels.
- LAM universe: one instance of the LAM/MPI run-time environment. That is, the LAM run-time environment is typically instantiated with the
lamboot command and a file specifying a list of hosts. The LAM universe then exists among that set of hosts.
Here's a few assumptions that we make because of the LAM/MPI environment:
- LAM/MPI is completely user-level. All processes belong to the user -- nothing runs as
root. That is, each user has their own set of lamd's and user MPI programs.
- LAM/MPI currently cannot "overlap" universes except in batch systems. By "overlap", I mean have multiple, different LAM universes of the same user on the same machine. i.e., while a user can run as many MPI programs as they want in a single LAM/MPI universe (and even have them share the same machines safely without interfering with each other), you cannot have multiple LAM/MPI universes on the same machine without a special exception. It will be trivial to make LAM be able to overlap universes in a Condor environment, but I felt that I should mention this.
The "checkpointing sockets" problem
So the Condor project has a library that can checkpoint a running program and start it up again at a later point. It can even migrate it to a different machine. That is, it serializes the entire image of the process (stacks, heap, program, data, etc., etc.) and dumps it into a file (or socket, apparently). The astute reader will recognize that things like open files will present a problem in this scheme -- particularly in the case of migration. i.e., if a process has an open file and it migrates to a new node, what happens with read() and write() calls in the process to that open file on the new node?
The answer is that the library leaves a "proxy" agent (I think their terminology for it is a "shadow process") back on the original node. So read() and write() calls on the new node are proxied back to the original node where the real operation takes place, and the result is piped back to the new node where the program is running.
This is all fine and good for most system calls -- i.e., intercept all system calls, shuttle them back to the proxy agent, and then pipe the results back -- but it doesn't work for sockets. More to the point, it could work with sockets (at least I think it could), but then performance on the sockets will suck, and that is unfortunately important to us in MPI-land (i.e., latency would rise dramatically, and there could be potential bandwidth issues as well, depending on the proxy implementation). Hence, we have "the socket problem".
The solution is to close all sockets before allowing an MPI job to be checkpointed, and then re-establish them after the job has been restarted. Multiple problems arise from this, though. The MPI job will assumedly still know where its sibling ranks were located (and could therefore reestablish sockets to them), but zero or more ranks may have moved -- so trying to establish sockets to the old addresses may not work anymore. LAM needs to become aware of which ranks moved and where they moved to.
This is particularly problematic with LAM's shared memory/TCP scheme. i.e., if rank X migrates, it needs to re-figured out if rank Y is on the same machine or not. Specifically, it needs to re-initialize its entire connection table and either [re]connect its sockets, or [re]setup shared memory to communicate with Y. Even more generally than the TCP/shmem problem, this is definitely going to change the RPI somehow.
There are other issues as well -- how do we start up a LAM job under Condor? LAM currently uses a separate daemon process (the lamd) for a bunch of additional services, such as process control (fork/kill), an out-of-band message channel, and a global database for arbitrary key=value pairs (for MPI-2 MPI_PUBLISH). I guess it also functions as a scope mechanism as well -- providing a "universe" for a single user.
Possibilities
For efficiency reasons, we may only want to only checkpoint/migrate some ranks -- not all of them. Hence, there are two kinds of ranks: a rank that will get checkpointed (and possibly migrated), and a rank that will not. It seems to make sense to notify the entire parallel application (i.e., all ranks) when even one rank is checkpointed with intent to exit (e.g., because it will be migrated). So there's even two types of checkpoints: (a) one to just save the process's state (i.e., checkpoint the entire parallel application just for save/backup purposes), (b) and one to migrate one or more of the ranks to a different node.
We'll discuss (b) first (checkpointing for the purpose of migrating), because it lays the groundwork for (a).
Checkpointing for migration: the checkpointed rank
So it seems that LAM needs to take some actions before it allows itself to be checkpointed, and them immediately after it restores from a checkpoint. So if a LAM job can get some signal when it wants to be checkpointed (possibly via nrecv() from the local unix named socket, which we currently implement with SIGUSR2 so that the MPI process knows to go check the socket), a signal handler can be fired, read the message, realize that it wants to be checkpointed, flush and close down and invalidate all its communication channels (including the local unix socket to the lamd [or lamd-like underlying services] sockets, GM ports, shmem, etc.), and then checkpoint itself. This will require at least one new RPI function so that we can keep the RPI abstraction clean and apply this to all of our RPIs --
close/invalidate procs (with the assumption that no new communication will happen before we re-invoke _rpi_c2c_addprocs() to re-add all the communication channels again).
The Condor guys tell me that there is a checkpoint_and_exit() function that, when called, dumps the state of the program out to a file (or a socket), and then exits. Very handy! When the process is restored, it just returns from this function. Ultra cool!
So after returning from this function, an MPI rank must obtain the [potentially new] locations of its sibling ranks. I'm thinking that this will come from an nrecv() from the underlying infrastructure (i.e., Condor) -- it will get an array of information saying where everything is (how to do different RPI's? GM ports vs. TCP addresses/ports, for example? Might have to re-init those as well; re-look for open GM ports, etc.).
That is, the run-time system that potentially moved the ranks in the first place will know precisely where all the ranks are, so it can provide the location information to each rank. Once this information is provided to each rank, the ranks can effectively re-do some of the stuff that they did during startup (contact their local "lamd", establish C2C communications with the other ranks by calling _rpi_c2c_addprocs(), etc. I'll explain why "lamd" is in quotes later).
Specifically, the sequence of events on a single MPI rank will be something like the following:
- Receive SIGUSR2.
-
nrecv() a message indicating three things:
- One or more MPI ranks is going to migrate.
- Whether this rank needs to checkpoint.
- Whether this rank is going to migrate.
- Flush all C2C and local "lamd" communications.
- Close down all C2C connections.
- Close down connection to the local "lamd".
- If this rank is to checkpoint:
- If this rank is to migrate, call
checkpoint_and_exit(). The steps below will commence when the rank has been migrated and starts up again, and returns from checkpoint_and_exit(). - If this rank is not going to migrate, call
checkpoint().
- Re-establish a local socket with the local "lamd".
-
nrecv() a message with new location information on all MPI ranks. - Repeatedly invoke
_rpi_c2c_addprocs() (and whatever else is necessary, perhaps _cpi_c2c_init()?) to re-establish C2C communication channels. - Return from SIGUSR2 handler and continue processing in user code as if nothing had happened.
I think that's essentially it. There's a bunch of details in there, of course, particularly in the re-initializing C2C connections bit, but that should all be resolvable with some clear and potentially clever re-entrant C2C init code. Hence, when we go through this checkpoint/migrate phase and re-establish C2C communications, we essentially re-initialize the C2C subsystem -- do the exact same thing as when we do it the first time. That would probably be the cleanest approach.
Checkpointing for migration: the non-checkpointed ranks
Upon further thought, I guess there is little difference between checkpointed ranks and non-checkpointed ranks. There could be a slight optimization in that it is really only necessary to send new location information for ranks that have migrated -- the old location information is sufficient for any rank that has not migrated. However, it may make it easier in terms of less complexity to only have one code path -- just receive all new location information.
However, the question does arise -- when one MPI rank out of a parallel job is migrated, what happens to the other ranks while the rank is in process of moving? There are two approaches:
- Make the other ranks freeze and wait for the migrating ranks to be restored and C2C communications have been re-established. This certainly makes implementation of the MPI side easier -- the non-migrating ranks can just sit blocking on the
nrecv() waiting for new location information. The underlying "lamd" can just delay sending the new location information until the migrating ranks have been restored.
- Allow the other ranks to continue in the user program while the MPI rank(s) in question migrate. They would have to freeze at the first blocking communication involving the rank(s) that are being migrated. Any non-blocking communication can continue (e.g., Isend, Send_init, etc.), but would have to be "suspended", indicating that they just get put in a queue, and will only be attempted when the destination rank(s) are actually restored from migration and C2C communication has been restored to them.
This will add complexity to the MPI implementation, and it slightly changes the scheme presented above -- the non-migrating ranks will have to delay the second part of the scheme (i.e., starting with the nrecv() to get the new location information) until they get a second signal indicating that one or more of the migrating ranks are now ready.
This could get arbitrarily complicated -- take the case where N ranks migrate. What if they get restored at different times? i.e., if one rank gets restored much earlier than the rest -- does the underlying "lamd" signal the other ranks in the job with just the new location information for that one rank? Or does it wait for all N ranks to be restored before signaling everyone? The coarse-grain approach is clearly easier; the question is what actually happens most of the time: does Condor (and others) piecemeal restore migrated processes, or all at once?
So this raises some interesting questions:
- With the "easy" model of making all MPI ranks wait until all migrated processes are restored, is there really much of a difference in migrating one rank versus migrating all ranks? Since they all block waiting for the one migrated node to be restored, particularly if that one rank can't be restored immediately. For example, the MPI rank that was migrated was running on an idle workstation that suddenly became non-idle, forcing the MPI rank to migrate. But say that there are no more idle workstations available, so this MPI rank must wait in limbo for a while for another machine to become idle. But during this time, the entire rest of the MPI application must also wait. What happens to the accounting records during this time? Are Condor users "charged" with the time that the rest of their MPI ranks are blocking?
- There is also the argument that most MPI programs tend to operate at least in some kind of lock-step. i.e., the MPI ranks are at least loosely synchronized (e.g., per iteration). So even if the non-migrating ranks are allowed to continue, they'll eventually block anyway because they'll try to communicate with a rank that is in process of migrating (or, by the domino effect, try to communicate with a rank who is blocking trying to communicate with a rank that is in progress of migrating, etc.), which could potentially (and usually!) eventually cause the whole MPI process to block anyway. More to the point: is there anything gained by allowing non-migrating MPI ranks to continue while one or more MPI ranks are in process of migrating? My gut feeling says no.
Hence, it may make sense to really only migrate the entire MPI process at once, or only migrate ranks when it is known that they can be placed immediately. This may not be possible, so it may be easiest to just make all MPI ranks block until migrated ranks are restored and C2C communication is restored. The accounting issue still needs to be addressed, though.
However, I have very little experience in the dynamic process migration area -- I'm curious to what the Condor folks have to say about these ideas and questions.
Checkpointing for saving state (no migration)
For checkpoints that do not involve migration -- i.e., checkpointing just for the purpose of saving state -- it may or may not be necessary to close all communications channels. On the one hand, no rank is migrating, so it would seem silly to close and re-establish communications with the exact same location information. On the other hand, if we want to re-start the checkpointed process later, the re-started process will return from the checkpoint() (notice -- not checkpoint_and_exit()) function. If we re-start the process on an entirely different set of nodes (e.g., a PBS or Condor job is checkpointed and then later fails because someone powers off a node, so we restart the job in a later PBS/Condor job -- the ranks will be on entirely different machines and have a different topology), we will need to re-learn the location knowledge and re-establish C2C channels.
Using this argument, it's probably better to treat a backup/save checkpoint (even with no migration involved) as a checkpoint with all ranks migrating (per the procedures shown in the previous section), so that all ranks close all communications channels and then receive new location information from the underlying system (lamd/Condor) and then re-establish all communication channels.
This would allow the most flexibility for re-starting a job. That is, even if the job does get restarted from a set of migration files, it doesn't matter if it is on the same set of nodes or not -- it will re-establish all C2C communication channels and continue from where it left off.
lamd problems
The lamd is really helpful in standalone environments. But does it really make sense in a Condor (or other run-time system)? We mainly use the lamd for the following kinds of services:
- Process control (startup, shutdown, abort)
- Out-of-band messaging
- key=value publishing
- File transfer (mainly for non-uniform filesystems)
- Scoping mechanism
Normally, each MPI rank is associated with a single lamd that is located on the same machine. They communicate through a named unix pipe. When the lamd sends a message to an MPI rank, it pushes a message down the socket and then tweaks the process with SIGUSR2.
Note that there may be multiple MPI ranks per lamd --
it is common to run multiple MPI ranks on a single machine. In this case, they all share a common lamd (although the MPI ranks don't know or care that they are sharing a lamd).
It should also be noted that the out-of-band messaging can also be the primary message channel for an MPI job. That is, C2C communications aren't necessarily setup. It's a run-time flag to mpirun -- the user can specify to use the lamd for all communication instead of C2C. Although this imposes extra hops on the all messages (even MPI_Send / MPI_Recv messages), it can provide true asynchroncity (sp?) for non-blocking messages. That is, LAM/MPI is single threaded, so it can only make progress on messages while it is inside of LAM/MPI function calls. In the "lamd" mode, once a message is given to the lamd, the lamd is a separate process, so it can make progress on the message independently of the main thread of control in the user program. While this may seem counterintuitive and incur too much extra overhead, several LAM users who rely on non-blocking message passing have told us that they can get significant speedup using this mode as opposed to C2C.
So LAM's normal model is that each MPI rank has a single lamd that it is associated with. This may be problematic with Condor (or any other run-time system) for multiple reasons:
- If the MPI rank ever migrates off a given machine, the
lamd will also have to be migrated with it. Hence, both processes will need to be treated as a single process by Condor, which I assume would create some special exceptions in the Condor code. This is not attractive.
- Even worse, if multiple MPI ranks are sharing a single
lamd, if one of those MPI ranks migrates and the others do not, what happens to the lamd? It would seem that we need to create a new one on the machine where the MPI rank migrates to, and then have the network of lamd's reorient themselves to include the new lamd. Or, if the MPI rank migrates to a node that already has a lamd, it can just join that lamd, and no new lamd is necessary. But this would seem quite complex to implement!
Hence, it would seem desirable to be able to ditch the lamd when running in some other run-time environment (such as Condor).
Possibilities with Condor
Our short conversation with the Condor folks is that a LAM/MPI program will need to interact with their "starter" somehow, or have a custom LAM/MPI starter written that knows things about MPI programs.
My first impression (and admittedly, I don't know much about how Condor works) is that the least-cost solution here would be to have a custom LAM/MPI "starter" that can mimic the lamd services. It would seem that Condor must already provide most of what we need; the starter can simply provide a translation between what LAM/MPI expects and the native Condor underlying services. Hence, the majority of LAM/MPI wouldn't need to change -- it just opens up a local unix socket to what it thinks is the lamd, but in reality it's a Condor "starter" (or whatever).
More specifically, some of LAM's calls such as nsend(), nrecv(), rploadgo(), rpdoom(), etc., can probably translate to Condor semantics without too much trouble. So if Condor can open a socket and effectively have an nrecv() implemented locally, it can receive local packets from MPI ranks, and then process and interpret them.
Admittedly, this would put more of a burden on the Condor folks, but I think we could help out a bit as well. :-)
Checkpointing without Condor
In a non-Condor environment, it would still be highly desirable to be able to checkpoint. Can we do this without the rest of Condor? I would assume that we could make it so. I think that the key for doing this outside of Condor would be a new pseudo-daemon in the lamd to handle these kinds of things -- to furnish the new location data, for example. We'll probably also need a command like rempirun to restart a checkpointed job. Possible scenarios include:
- A separate LAM executable (
mpicheckpoint) that can checkpoint a running MPI program to a set of rank files. The checkpointing will follow the same scheme as outlined above. A run-time flag can specify whether the job should stop or continue after the checkpoint. It might also be desirable to provide a LAM-specific API call for this as well (MPIL_Checkpoint(char* directory, int stop_flag) or something). Note: we're not talking about migrating here; see below.
- A separate LAM executable (
rempirun) can take a set of rank files from mpicheckpoint and restart the job on an arbitrary set of nodes. Note that this would not have to happen in the same LAM universe -- it could have much later, for example, after the LAM universe that the original job was running in has been destroyed and a new one takes its place. Some extra condor-checkpoint-library bootstrapping is probably necessary to restart the job, but after that, it just uses the lamd to get the new location data, etc., just like it would in a Condor environment.
- A separate LAM executable (
lammoverank) can migrate one or more ranks to different nodes within the current LAM universe. This can work exactly the same way as it does in Condor. As mentioned above, this will require an extra pseudo-daemon in the lamd to know where ranks are moving and provide new location data to all the ranks.
-
Making this portable
There is desire to run LAM/MPI in other run-time environments (as alluded to in comments above) in addition to Condor. Scyld is an obvious target, since they have their own set of process control stuff (bproc) and whatnot. Scyld might be a bit more challenging because they seem to only support process control, not the other services that we need. Someone (Jeremiah?) suggested that we might be able to get away with one lamd somewhere in the system; I'm not quite sure that this would work, but it will definitely take a) further thought on the issue, and b) investigation of bproc and the rest of the Scyld infrastructure.
PBS is another obvious target (as well as any other batch schedulers). It would be nice to ditch the lamd in a batch environment, and rely on the batch system's underlying services for process control (the benefits are obvious, not the least of which is job accounting and guaranteed cleanup, a notorious problem for non-native support in batch schedulers), but the out-of-band messaging and global publishing still need to happen as well. PBS's TM can do the process control and can do the global publishing too (IIRC), but I don't think it provides any kind of out-of-band messaging. That will require more thought... Our initial ideas about PBS/TM (from a while ago) didn't include ditching the lamd, but perhaps this is a bit more natural extension of making this whole concept portable (i.e., replacing the lamd with underlying services, when available).
Or will a "one lamd" idea work here, too? Not sure how such an idea will work, but it's worth thinking about.
The real trick, however, will be to do this in a run-time-decidable way. That is, it would be nice, at run time to decide which underlying service to use -- native lamd, Condor, PBS/TM, Scyld, etc. That is, a user can take the same executable (assuming that their LAM was compiled for support for all of them) between all systems without having to recompile/relink. That would be nice, but not an absolutely necessary goal.
Upon a moment's reflection, from the proposed schemes above, the difference between native lamd and Condor would not be known to the MPI process -- if Condor truly emulates the lamd, there's no need to know. Whether or not the LAM has been compiled with checkpoint/migrate support is an entirely different issue (because I assume we'll need to get some Condor headers/libraries and some #if code for the checkpoint/migrate LAM code).
In order to make this workable for PBS/TM and/or Scyld (i.e., to keep the abstraction level clean), we'll have to implement lamd services in the lower levels of PBS/TM and Scyld as well. Hmm. I guess we'll have to cross the line into the root-level services earlier than we thought!
For PBS/TM, all the TM stuff is in one file, so extending that should be easy. But to do true messaging, it may take a bit more --
we may have to do some actual hacking in the MOM itself. It could be as simple as adapting the lamd's to fit in the MOM. We'll have to see. As for Scyld, I have no idea. :-)
Other problems
- Voluntary vs. involuntary checkpointing. Is there much of an issue here? Probably not -- I don't see why involuntary checkpointing can't work just like voluntary checkpointing.
- How about open files and whatnot? Particularly after a migration? Condor can proxy this stuff back to the original node, but does this make sense in a batch situation? What if we don't own those nodes anymore? This might be ok for Condor, but about about PBS / Scyld? It would seem bad for PBS. :-(
- Are we trying to solve the "node goes down" problem? i.e., involuntary checkpoint at timed intervals (to files, not sockets...?), and if a node crashes at some point, we can
rempirun the set of checkpoint files (which would seem highly desirable). But what about open files, etc.? If the node crashes, there's no Condor proxy to take the request back to on the original node ('cause it's down). So does checkpointing with the Condor library solve the "node goes down" problem? Or perhaps only in a limited scope (i.e., your open files won't be preserved)...? Granted, anything outside of the MPI API is outside the scope of what we need to worry about, but this does seem to be a "real world" concern that would be good to take care of. Even if it just means setting open file descriptors to -1 or NULL upon restoration of the process so that the job can know that the files are closed or something.
- So what happens to
lamboot and lamhalt under Condor? Does they effectively become noops (we can't ditch them, because users will still invoke them)? And then mpirun talks to various Condor services (for example) to do the things that the lamd would have done? One of the current functions of mpirun is to serve as a rendezvous point for the ranks so that they can all become aware of each other. Does this still need to be? It would seem that it would need to be changed somehow -- since the migration problem changes all the location information anyway, Condor itself must provide a way to get this information, potentially making mpirun's rendezvous point irrelevant.
- Does this (running under Condor, PBS/TM, or Scyld) make sense with the
MPI_COMM_CONNECT and MPI_COMM_ACCEPT models? i.e., how does a Condor job get more nodes? Or how do multiple Condor jobs join together? In vanilla LAM, only jobs in a single universe can join together. Will this be true in Condor (etc.)? More to the point:
- What would it mean to allow multiple LAM universes together? What about the obvious security concerns with this?
- How will a universe be defined in Condor? Will you have to (for example) ask for M nodes and start M different jobs and have them
CONNECT / ACCEPT to each other?
- If this is the case (still only connect within a single universe), is
CONNECT / ACCEPT useful within a Condor context?
- The same question applies to
SPAWN -- does the user have to request a maximum number of nodes ahead of time? Or, when SPAWN is invoked, does this have to allocate nodes from Condor dynamically and then spawn on them? This scheme would seem attractive, but it may cause the MPI application to hang while waiting for nodes to become available?
- In a dynamic environment like Condor, is dynamic processing useful at all, given that a
SPAWN may have to block waiting for the underlying system to make nodes available? Does the whole MPI application (or, at least the ranks who invoke SPAWN) have to block waiting for this to happen? (no one has answered this yet -- it's not even defined in the MPI standard)
Summary
So these are my initial thoughts. In spite of all the unanswered questions listed above, I believe that this can work. Some trips Wisconsin<-->South Bend and some teleconferencing and a ton of e-mail will likely be necessary. But this is ultra cool stuff, and will be immediately useful to lots of people in the real world. Plus, we'll get lots of papers out of it, become famous, and one or two people might degrees out of it. :-)