« What we need around here is an anti-whining ordanance! | Main | It's good to be a gangsta »

I think Joe saw us in the movie theater last night

I've gotten an unexpected result from my thread booter.

When booting across the ND helios cluster of 161+ sparcs (some of which should fail, BTW -- at least 2-3 are down at any given time, and about 5-10 are a different version of Solaris than the rest such that there are shared library linker problems trying to run on them).

Even with about 10-20 nodes expected to fail, about 1/3 of them fail to boot properly on a regular basis. This is many more than expected.

The main reason is that the parent that is trying to boot them times out. i.e., if the child does not callback on the socket within N seconds, the parent decides that the remote boot must have failed (even if the boot does succeed at some point later, and the child does try to boot). The parent rules that that child is dead and moved on to the next.

The weird thing is that this was happening a large percentage of the time; much more than I expected. Worse than that, it was inconsistent -- I would get different results every time I did a helios-wide boot (even if they were only separated by only a few seconds). This is clearly not good enough.

One solution is to increase the timeout time (I was using timeout values of 5, 10, and 30 seconds -- the problem occurs with all the values). Increasing the timeout value to 120 seconds seems to make it work most of the time; most bootable helios machines actually boot properly. However, this significantly adds to the overall boot time because we now have to wait 2 minutes for each individual failure before moving on to the next child, which is undesirable.

So I think I need to change my booting algorithm yet again (this is the point of research, isn't it?).

  • Still keep the basic tree-based structure.

  • To overcome the problem with slow children, we need a system where the work of one child can be given to another, but need to keep this in a tree-based structure (vs. a monolithic server) so that we don't run out of resources. That is, some kind of first-come, first-serve basis, since we know that if a child requests work, it is ready to go. Faster children will naturally ask for more work.

  • Right now, each parent node receives a list of all the children that it is responsible for booting. It divides this list up into N sub-lists (where N is the number of immediate children that it will boot), spawns a thread for each, and gives each thread one of the sub lists. This needs to change.

  • Instead, spawn off N threads and give them each one child to boot. The parent thread keeps the rest of the list of nodes that it is ultimately responsible for booting.

  • If a child fails to boot by some kind of immediate failure (e.g., a ping to that child fails), the parent can kill that thread and launch a new thread and give it the next node from its master list.

  • When [if] a child actually boots successfully (which is defined by the grandchild opening a socket back to the child and saying, "Ok, I'm ready -- gimme some work"), it asks the parent for a subset of nodes from the parent's pool. The parent will give a list roughly of size total_size/N so that each descendant's subtree will be about the same size, which then child then passes on to the grandchild.

  • Aside from the parent keeping the list of children, this is more or less how it happens now.

  • Here's the new part: when a parent finishes (actually, when any node finishes -- whether it was originally a parent or a leaf), it sends an "all done -- got any more work?" message to its parent.

    • If the parent's parent has any more work, i.e., it has some nodes left in its pool because one or more of its children were slow to boot, it will give a subset list (of about the same size as it has given out to every other node who queried) to the querying child.

    • If the parent's parent doesn't have any more work, it passes the request to its parent, where the same procedure is repeated. If any work is eventually found, it is sent back down the tree to the original child who queried.

As such, with this scheme, it is possible for a grandchild (or some node even further down) to steal work from a slow child. This scheme can allow for long timeouts which may be necessary (particularly in an active workstation environment), but still allow for speed in the overall boot -- we just eliminate the blocking on slow children by potentially taking away their work from them.

A side-effect of this is that the overall tree may become lop-sided. But that doesn't really matter much, because the idea here is to parallelize in time, not in space. So if we have a slightly taller-than-optimal meta-tree at the end, it doesn't matter -- the meta tree is only for booting and will be discarded anyway.

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on February 21, 2001 1:40 AM.

The previous post in this blog was What we need around here is an anti-whining ordanance!.

The next post in this blog is It's good to be a gangsta.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34