Got my 10/100 switch yesterday. Woo hoo! Finally plugged it in this morning and got it operating (plug-n-play -- hah!); no more ethernet collisions -- yeah! All 3 of my linux boxen immediately switched to 100Mbps/full duplex, but the windoze box stayed at 10Mbps/half duplex. Figures.
Yeah, ok, I'm still behind a 1.5Mbps DSL line. So what. (actually, I'm streaming MP3s around here behind my firewall, so ethernet collisions were becoming a bit of a problem in terms of performance)
I got version one of my scalable booting working. It does an n-ary tree-based boot across a group of machines. Seems to work pretty well, but is not 100% bug-free yet. That is, it still hangs sometimes -- I think it is because it has done an rsh to a remote node and fails. Here's some preliminary results (units are expressed in seconds):
| Number of nodes | lamboot | Binary tree | Trinary tree | Quad tree |
|---|---|---|---|---|
| 16 | 1:02 | 0:12.8 | 0:09.8 | 0:06.4 |
| 32 | 3:07 | 0:46.5 | 0:29.3 | 0:27.5 |
| 147 | 14:06 | 1:28 | 1:00 | 1:07 |
Pretty good looking so far. Some notes...
- All results were with the binary already AFS cached.
- The 16 node tests were conducted on unloaded machines. The 32 and 147 node tests contained nodes that were in use, some of which were heavily loaded (shh!). So these numbers are not perfect. But they are a good ballpark.
- The difference between 3 and 4 children is sometimes small. This can make sense -- consider the 32 node case. With 3 children each, the farthest leaf from the root will be 3 hops. With 4 children each, it is the same. Hence, with each of 3 and 4 children, we still have the same number of "timesteps".
- Also, the algorithm is sub-optimal, particularly where there are heavily loaded hosts. I believe that this explains why 3 and 4 children on the 167 test results seem weird (it seems that some of the key parent nodes in the 4 node tests were heavily loaded -- I checked). This is not conclusive proof -- I would need a large number of unloaded machines to be able to test this theory. :-( See below.
I had a brief discussion last night with Lummy about this. I presented some timings of lamboot vs. the tree boots. He wants me to run a ring boot as well, and compare. I initially didn't see why he wanted me to do this -- indeed, I thought that it would be the same as lamboot (and I'm still not convinced that it's not the same -- the majority of time is dominated by rsh latency), but he made the good point that I don't have any numbers to back this theory up. As such, I don't know for a fact that they're not different. And I do agree, they are different topologies, so there could be a difference. They're different code bases, too, so subtle differences could mean a lot (although the scaleboot stuff derived from the inetexec.c that is central to LAM's lamboot). I'll code up the ring and see what happens...
The current implementation essentially works like this:
- Invoke the program with the
-masterswitch and provide a hostfile. - The program figured out that it is the master, and decides 1) that it has no parent, and b) reads in the hostfile.
- Switching into "parent" mode, it does what I call "multi-rsh" for its number of children (default is 2, but can be overridden on the command line). i.e., it fork/exec's
rsh's into the background to the children's hosts. This is more complicated than it sounds...- The multi-rsh routine is given a list of username/hostname pairs, and a list of
argvto execute on each. - First, you have to send a "
echo $SHELL" command to the remote host to see what the user's shell is. - When that comes back, if they are running Bourne shell (and you'd be surprised at how many people do...), the Real
argv(denoted byfood) has to be surrounded with "( . ./profile foo )" so that their.profilewill be executed, and paths will be setup, etc., etc. Goofy, but true. - Once this is determined, fork/exec the
rshwith the real command to be executed. - Keep in mind that there are multiple
rsh's fork/execed into the background simultaneously; they all have to be tracked by watching theirstdoutandstderrto determine where they are. - Additionally, when an "
echo $SHELL" finishes, it has to be replaced with the real argv and re-launched.
This results in one big-ol' state machine. It's somewhat hairy, but once I figured out some nice abstractions in C++, it worked out ok.
- The multi-rsh routine is given a list of username/hostname pairs, and a list of
- After all the commands are executed, the parent waits for its children to contact it (we passed some command line parameters to each child indicating the parent's IP address and the port number that it was waiting on). This means sitting on an
accept()N times, waiting for each of the N children to connect. - As each child connects, give them a list of (M - N) / N username/hostname pairs to boot (where M == total number of hosts that this parent has to boot).
- The children go off and do their thing, potentially booting grandchildren.
- As each child finishes multi-rsh'ing its children (but before doing the
accept()s to give its children work to do), it sends a number upstream to its parent indicating how many children were launched. These numbers all filter up to the root/master so that cumulative stats can be kept about how far along the boot is. - The cycle is broken in two conditions (they're actually the same condition, but I call it two conditions here for ease of explanation):
- A child is executed who has no children. When it contacts its parent to get a list of children to boot, it will receive "0" and therefore recognize that it is a leaf in the overall boot tree. It will then send a "-1" up to the parent and close the socket.
- When a child has received "-1"'s from all of its children, it will send a "-1" up to its parent and close the socket.
Hence, these "-1"s are propagated up the tree to the root/master, so that when everyone finishes booting, the master knows. It would be fairly easy to put in a fan out after this fan in and complete the barrier process so that the whole tree knows when it has booted, but it wasn't necessary for these tests.
There is a limitation to this approach: we have to wait for the multi-rsh to finish before we can give work to our children. Depending on the number of children used, and depending on the relative speed of the children of a given parent, this may involve some children waiting an a period of time before being given work. This is conjecture at this point, but 1) it seems reasonable, and 2) I hope to prove it with the following...
A single-threaded approach was actually fairly difficult. It involved some big select() statements, and a lot of lookups and extra bookkeeping. i.e., when select() returns, you have to scan down the list of available file descriptors, figure out which socket is ready, figure out where in the process that socket is, and then react. This created a lot of code (thank god for the STL and hash maps!). While the approach seemed to work, I think a multi-threaded approach will be much simpler in design.
With a multi-threaded design, we can have a thread for each rsh. It therefore only needs to monitor its own progress. We don't even need to have select() statements, because it's ok for each thread to block on read() statements, waiting for I/O from the remote process. I believe that the whole programming model will become significantly easier. And, as I mentioned above, there's a chance that there will be greater performance because each child will be able to go at its own speed and not be forced to wait for any of its siblings.
So I'm off to go implement the multi-threaded approach. I should be able to scavenge parts from lamboot and the scaleboot stuff...