« To Be or Not To Be | Main | Blueberry pineapples »

San Demas High School football RULES!

I hit a problem the other day: how to tell when a machine is down? i.e., given a random IP name/address, how do you tell that it is down without a really lengthy timeout?

For example, try sshing to a machine that is down. Not one that doesn't have ssh enabled -- that rejection is almost immediate. And not to an IP that doesn't exist, or isn't reachable by you -- those rejections are also immediate. But ssh to a machine that is currently powered off, or not connected to the net.

It can take a long time to timeout.

A Solaris 7 machine takes almost 4 minutes (3:45) for a telnet to timeout to a host that isn't there. It takes 15 minutes for ssh to timeout (again, on Solaris). Quick testing showed that the majority of the 3:45 time was spent inside a single connect() call.

But a linux machine takes 3 seconds for telnet to timeout to a host that isn't there. What's it doing differently? How can it tell so quickly that the machine is not there?

Interestingly enough, the Solaris telnet reported "Connection timed out", whereas the Linux telnet reported "No route to host". So they're definitely doing something differently. Hmmm...

I ran my connect() test on both Solaris and Linux, and the results were identical to telnet -- Solaris sits for a long time on connect(), and then eventually times out. Linux only sits for a few seconds in connect() and then returns with a "no route to host" error.

Hmm. If connect() does not report the same error in the same way across multiple OS's, how do I do this? Indeed, Linux's behavior is great -- but what do I do on Solaris (and anyone else who doesn't return in 3 seconds)?


I got to thinking about the problem, and decided to look at some network and hacking tools. ping was my first stop. ping works in interesting ways. I didn't realize that it had its own protocol stack (like TCP and UDP). It works like this: you open an ICMP socket (you don't don't bind it to a port). From that socket, you send packets to the ping recipient. The ICMP stack on the other side will reply right back to you. Here's the catch: all ICMP replies come to a single point -- so if you have multiple ping programs running simultaneously, they'll see each other's ping replies (makes sense, if you think about it). Hence, you have to put some encoding in the payloads of the ping requests (which the remote ICMP stack will echo right back at you) to know which requests are yours and which you can discard.

Hence, here's a nice way that you can tell if a machine is up --
send it an ICMP packet. If you don't get one back in a relatively short timeout (probably even user-runtime-settable), rule it as "down". No problem.

Wait -- there's a catch. You have to run as root, 'cause the ICMP stuff is protected. Crap. We don't like setuid programs.

nmap was my next stop. They've got all kinds of goodies in there. SYN scans, FIN scans, etc., etc. They note, however, that many of these are not available to non-root users. Hence, they try the connect() thing as well when a non-root user tries to scan a machine. Again, Linux bails in 6 seconds saying "machine is not up" (this must be due to Linux's short connect() timeout). Solaris, however, takes much longer -- 1 minute. But it is significantly less than 3:45 that we saw in both telnet and the raw connect() call.

Some poking around in nmap revealed the following:

  • It's actually pretty small; only a dozen or so .c files. For something as full featured as nmap, I would have guessed that it would have been larger. Who knew?

  • It seems to be pretty well coded -- I could actually code the code pretty easily. They have good voodoo; color me impressed.

  • The non-root ping scan tries a connect(), but does it in a non-blocking way, and repeatedly uses select() to check if the connect() has finished yet. A neat trick --
    this allows them to set their own timeout (evidentially somewhere around a minute; I didn't bother checking what it actually was).

So I'm going to have to try this -- code up my own non-blocking connect() and put it in my threaded booter and give it a whirl. Too tired right now, though -- this will be tomorrow's activity.


"I'd actually like to see a non-blocking MPI_WAIT."
- Shane Herbert, MPI-2 Forum.

Comments (2)

KK:

What is San Demas close to????

KK:

What is San Demas close to????

Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About

This page contains a single entry from the blog posted on October 3, 2000 11:54 AM.

The previous post in this blog was To Be or Not To Be.

The next post in this blog is Blueberry pineapples.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34