« April 2001 | Main | June 2001 »

May 2001 Archives

May 2, 2001

Dave, do you think Earnest Hemmingway ever gave a reading that bad?

I wrote up a journal entry the other day, but it got lost. :-(

I was working on my laptop -- I had just installed Mandrake 8.0 and was playing with my new wireless card. It mostly works, but not entirely.

  • emacs seems to have arbitrarily wide wrap lengths; 78 or 79 chars in text mode. Gotta figure out how to change that.

  • Aurora is a cute GUI boot screen. However, it only seems to want to run in "NewStyle" mode, not "Traditional" mode (I'd prefer the Traditional mode, because it still shows each item as it starts, whereas the "NewStyle" only shows a small number of amorphous icons at the bottom of the screen and alternates highlighting them -- I have no idea what it means).

  • I haven't tried to play MP3s yet.

  • The fonts are somewhat icky. Took me a while with playing around with my konsoles to find a reasonable font. The konsole scrolling is pretty slow, too -- it wasn't slow before.

  • I set the non-framebuffer kernel to be the default; it seems that it's slightly faster video (e.g., scrolling in konsoles).

  • Konquerer is cool, but it crashes a lot. And the crashes are repeatable, too. So I'm still stuck with Netscape...

  • My wireless card was detected and installed properly without me having to do anything. Cool.

  • pine wasn't installed by default. Dunno why; it's possible that I deselected it somewhere during the install (but I don't recall doing that...). It was on the CD, so I just RPM installed it. It had built in support for SSL and IMAP, so it rocked right off the bat.

  • dig also wasn't installed (same disclaimer). I found bind-utils on the 'drake CD and installed that, too.

  • I don't know anything about postfix, so I uninstalled it and put in sendmail instead.

  • It seems that there's a new wireless NIC driver called orinoco_cs (as opposed to wvlan_cs, which 'drake installed for me by default). However, 'drake didn't compile orinoco_cs as a module, so I set about compiling my own kernel. Needless to say, on my little laptop, it took several hours. I used the config file that 'drake supplied, and just tweaked a few values. It finally all built, but when I rebooted with it, I got a mysterious "cannot boot from that root device" error. Dunno what that's all about. I restored my original modules and all was well. I'd really like to use the orinoco_cs driver, though, 'cause it works with WEP and whatnot. Might have to play a little more there...


In the meantime, my DSL has been really crappy this week. It was out for about 3/4 of this past weekend, and multiple times through Monday and Tuesday. It was out for about a half-hour this morning, and is now out again.

This really sucks. Especially when I'm trying to do stuff in nd.edu -- I just get cutoff. Arrghh!!

I started a primitive script to log how much I'm offline, and when. I just does a ping to the Telocity DNS servers, a ping to www.lsc.nd.edu, and a ping to www.excite.com every minute. Why Excite? Well, it seems that I'm never offline to them. Just about everything else goes (www.yahoo.com, for example), but Excite seems to stay reachable. The problem almost always seems to be the Telocity routers in Atlanta (traceroute's stop there).

I wish that whoever was maintaining those routers would get their act together!


I learned something about xmms's thread leak -- it only happens when streaming files to it via http. It doesn't happen when playing files from a real CD, or MP3s from a local filesystem.

Weird, eh? Must be a bug strictly in the http streaming code.



There are 240 copies of xmms running, out of 326 total processes (73%).

May 4, 2001

You wake up at C-TAC, SFO, LAX

Our great room furniture came today. Woo hoo!

I didn't tell Tracy that it was coming, of course (they called yesterday to setup a delivery time). So she was extra pleased to see it when she came home. It's the little things in life. :-)


Today is the Oakes race in Louisville. It's basically the local's version of the Kentucky Derby -- it's a Big Event. Tomorrow, Louisville will be absolutely flooded with billions of people out outside of Louisville, so there's a lot of local pride wrapped up in the Oakes race.

We got invited to go through GE, but I had to turn it down so that I could work work work... Maybe next year.


I did some rough stats on my monitoring so far and found out that DSL has been up 44% of the time since I started monitoring (Wed, May 2, 10:11am). Granted, that's really only 2 days, but as Holly said, "That's a poor IQ for a glass of water!"

Ok, it was much funnier when Holly said it. And it made sense, too.

The point is that in the past two days, DSL has been down more than it has been up. And it has cost me a lot of work. Bonk. :-(

And since it's Bell South's problem, it's not like switching to a different DSL carrier will fix the problem -- they all use Bell South since Bell South is the local tellco. Double plus unbonk. :-(


I accidentally killed xmms earlier, so the stats there are pretty low right now.

This is unusual for me to send so many journal updates in one day
-- normally, I leave the journal window open for quite a while and let it accumulate, but given how flaky DSL is, I'm going to submit now so that I know it gets recorded properly...



There are 17 copies of xmms running, out of 104 total processes (16%).

Jimmy James: Macro Business Donkey Wrestler

My DSL is still down. This sucks.

That is, it has been up periodically, but only for about 30 minutes to 2 hours at a time. The real work that I have been able to do this week is negligible. Arrggh!!!

Telocity is still blaming Bell South for this, and they're probably right. The packets either end up in a router loop right outside my DSL modem or make it down to Atlanta (which is only 1 step further) and then die. That seems to be consistent with having my local phone provider just sucking horribly. :-(

All in all, my internet uptime this week is probably well under 25%. :-(


I beefed up my monitoring script -- it runs via cron every minute and checks my connection to DNS, Notre Dame, and Excite (for some reason, I can usually reach Excite, but just about everything else is unreachable). I had to re-write it in Perl because it was becoming to complicated for shell script.

I've never played with Perl's CPAN modules -- they're pretty cool. I was pleased to discover that they have a Ping module and several HTTP modules. The Ping module offers all three kinds: tcp, udp, and icmp. And you can do anything you want with the HTTP modules.

So I ICMP ping the Telocity DNS servers and ND, and HTTP get / from Excite (so that they don't think I'm trying to DoS them by pinging every minute... that would have to trip some kind of alarm, I'm sure! :-).

HAH! I think we just came back on the air -- 9:20am. We'll see how long this lasts....


Since I was off the air most of yesterday, I spent a little time reinstalling my laptop again. After trying to install yet another package and realizing that some component hadn't been installed, I said "screw it" and just reinstalled the whole thing, and selected "install everything". Not that that actually installs everything on the install CDs, but it does install most things that you need (but still not pine, curiously... I guess they want you to use Evolution or KMail. <shrug>).

Anyway, I got the laptop reinstalled and only had to manually install a handful of RPMs. I got everything working, and even managed to get the WEP going on my orinoco card at 11Mbps. It seems that linux distros don't do what the PCMCIA package recommends that they do (e.g., /etc/pcmcia/wireless.opts is not where the options go), but I managed to find where 'drake puts wireless options and to get it all going.

I saved instructions on what I did, because:

  • 'drake 8.0 doesn't come with the orinoco_cs driver module compiled, although it does come with the source code (which I thought was weird). It took a bit of futzing around and some helpful suggestions from Brian to get it compiled properly.

  • The default wvlan_cs driver that comes with 'drake 8.0 doesn't seem to support WEP.

  • Others have essentially the same laptop that I do, so if you want the instructions, let me know. I have no idea if RedHat uses the same location for the wireless options, but I'll bet that if it's not, it's very similar (/etc/sysconfig/network-scripts/ifcfg-ethX).

  • Soon enough I'll have a new laptop and need to repeat the procedure again...

As an experiment, I plugged the audio out of my laptop into the AUX input of my stereo, downstairs.

Yes, indeed -- soon I was streaming MP3s from the server upstairs to my laptop downstairs, and out through my stereo. How cool is that?!? I was pumping out Fatboy Slim at very loud volumes.

Even cooler -- I had forgotten that I was still streaming MP3s to my desktop upstairs. It seems that that tiny little pentium that I have working as my router and my MP3 server does pretty well. Let's hear it for old technology -- it can still be the work horse for all those "little" jobs that you don't want to have to buy a new, hefty (and expensive!) machine for!

Yummy!


Let's start the Fire Marshall debate!

I forgot to mention one thing about reinstalling my laptop...

When I reinstalled, I selected to use XFree86 3.something instead of 4.0.3. This resulted in much better X performance -- the Konsole scrolling issue that I was complaining about in a prior journal entry was nonexistant. Indeed, it was back up to performance levels that I was used to.

However, I'm pretty sure that I was previously running XFree86 4.something before I installed 'drake 8.0 (i.e., under 'drake 7.2), and I didn't have these issues.

Oh well. What the heck do I know about the difference, anyway? Nothing. It works great, and is back being nice and fast, so that's all that I care about.

:-)



There are 489 copies of xmms running, out of 574 total processes (85%).

May 11, 2001

I really have no idea, Dave. I've been stone-cold drunk since about 8 this morning.

Oops. The last rant should have been under the "technical" category.

Mary had a great response:

If a manager doesn't spawn, it would be shot. At the very least, its demons should be exorcized. Get thee to a rectory.


A few weeks ago, I found the Andromeda software package for streaming MP3s from a web server. It was much slicker than the thingy that I was using, so I installed it (trivial install -- just a single .php file). It works nicely. It only lacks one feature -- the ability to enqueue arbitrary directory trees (something I only recently added to my thingy, but quite handy).

I pinged the author with my thoughts, not expecting much (per most freeware development, IME). He actually responded, and we had a good chat (via e-mail, of course).

He contacted me a few days ago with a beta for the next release of Andromeda. The big new feature is playlists. We found a few issues, and he fixed them. We also discovered that there's an inherent limitation of cookies that at least I wasn't aware of. Cookies have a maximum length on Apache servers -- about 8k. That is, the sum total length of all cookies given to a given server must be <= ~8k (remember that all cookies are given on one HTTP request line). Apparently, IIS allows a bit longer than this.

This is a big bummer, since Andromeda was storing playlists in cookies. Either way, there's a finite limit for the playlist. Bonk!

We pondered over this for quite some time, actually. There's just no better way to do this than without some form of server-side storage (files, a database, sessions, whatever). And to do that properly without allows a DoS, you have to have both a login and some kind of finite bound on the playlists anyway.

Urgh. :-( (one of the wonderful points about Andromeda is that it's a single .php file with no extra storage required). Adding this complexity is not attractive.

Indeed, I think there is a real missing chunk of software that allows client/server stuff without a database -- flat files only. Such packages would be extremely useful when you are running your software on some ISP's web servers, and database usage costs extra. Flat files would be a bit more bookkeeping, and probably less efficient, but if you need a non-high-performance web package, what would it matter?


Indeed, I have found that I am using the word "indeed" a lot lately.

I blame Arun.


Epiphany continues to have problems with Outlook Express. Bobbe in particular is having a horrid time. OE is doing random things. Sometimes it freezes on the splash screen. Sigh.

I think her machine has just degraded to the point of being non-function. It's a Windoze 95 box, several years old. I think that 'doze itself has just degraded enough to the point of non-determitiveness (is that a word? Probably not). It probably wasn't helped by the fact that I got all the latest "updates" from Microsoft. Ugh.

I really don't want to reinstall the whole machine. Particularly since that machine has all the databases and whatnot that have all the parish records, etc. Ugh.

So my solution is to loan them one of my old machines so that Bobbe has something to use for e-mail. At the same time, their fiscal year starts in July, and they'll be replacing that machine. So this stopgap is good enough for now.


At the same time, they got some donation money to get a new machine to replace one of their other machines -- a P100 with 8MB of RAM, IIRC. You have no idea how painful it is to use the machine (it's on the desktop of one of the church staff members). They gave
$1500 to get a new computer.

They're Gateway folk, so I perused the GW web pages, and noticed that they were running P4 specials. Since we had to use all the money, we ended up getting a 1.3GHz P4 with 128MB of RAM. Way more than necessary. But then again, perhaps it just means that this machine will last 4 years instead of 2.


I've been reading "Exceptional C++ : 47 engineering puzzles...".

I think Kevin, Jeremy F., Arun, and Brian and I will use this book as the basis for an e-mail version of C++ Friday Lunch. Perhaps doing one puzzle a week or so. I created a GNU mailman list for this on my DSL router, but had to reconfigure DNS to make this happen. It'll take 2 days to propagate around the rest of the world before we can really start.


I'm heading to to ND next week. It'll be Arun's last LAM meeting, and graduation is that weekend. My specific purpose is to attend the graduate awards dinner to receive the SGI HPCC award (and prize check
-- whoo hoo!!).

They listed me on the ND HPCC web page. Yay.


Lots of discussion on the OSCAR lists this week. Summary of decisions:

  • Move OSCAR development to sourceforge
  • Have 4 lists: oscar-announce, oscar, oscar-dev, oscar-core. The first three are typical open source lists, the last is "members only" for administrative kinds of things.
  • Interesting discussion occurring about how to have multiple MPI implementations on the cluster. I had a really long proposal which I thought was elegant, but then someone pointed out that it was functionally the same as modules. Duh. But modules are good things, so if we put modules in OSCAR, by associativity, that will be a good thing.


You know that you have a large uptime when the average history in your command windows is around 4500 commands.


Excellent! The Lone Gunmen tonight used a song off one of the Fatboy Slim CDs that I just bought -- Weapon of Choice. That song is cool. Seeing it on the Long Gunmen was double extra chocco latte cool.



There are 456 copies of xmms running, out of 530 total processes (86%).

That's a good ploy, Dave, to pretend that the ship is sinking.

Linux really sucks sometimes.

I'm working heavily on Tucson, and since yesterday morning I've been fighting a bug where the manager wouldn't spawn children properly. LAM/MPI would return an error and say that the rpcreatev() (one of the underlying functions under MPI_COMM_SPAWN that is used in LAM to actually spawn a remote process) had failed.

I couldn't figure out why this routine was failing -- it's used successfully in many different places. It's used in mpirun itself, and isn't failing there, for example. So why is it failing in MPI_COMM_SPAWN?
I tried to use gdb and ddd to track the problem down, but gdb kept seg faulting. Sigh. Linux debuggers are generally useless. I was reduced to printf debugging in a multi-threaded, parallel program. Do you have any idea how painful that is? Sigh.

It took me quite a while, but I finally figured out what the problem was.

Each LAM client has a global structure named _kio that contains, among other things, the PID of the process that is using LAM. That is, each MPI program has to call MPI_INIT, which, in turn, calls the internal LAM function call kinit, which opens a socket to the local LAM daemon and does some other bookkeeping things. One of the things that it does is cache the PID of the kinit-calling (i.e., MPI_INIT-calling) process on this global _kio struct. That way, if you fork, if you invoke a LAM function call, it will know that this process is not registered with the LAM daemon and can therefore throw an error.

Note that only some MPI functions will end up doing this compare-the-PID thing. One class of examples are MPI functions that need to send out-of-band (OOB) information, such as MPI_COMM_SPAWN.
This scheme actually works fine, and has prevented me from doing stupid things in the past.

However, it has caused me much grief over the last 24 hours because Linux implements threads are processes. Hence, each thread has a different PID. End result: MPI_COMM_SPAWN will end up comparing the thread's PID with the one cached on _kio. If they don't match, boom.
This is a problem if any thread other than the one that invoked MPI_INIT invokes these MPI functions. i.e., even if we guarantee that only one thread is "in MPI" at any given time, the current scheme in LAM will fail because each thread has a different PID.

ARRRGGHHH!!!

I don't quite know how to solve that in LAM yet (there's probably some way to get a unified PID for all the threads in a single process... need to look that up...), but I do know how to solve it in Tucson: force all MPI calls to be in a single thread. What a pain.

ARRRGGHHH!!!



There are 377 copies of xmms running, out of 460 total processes (81%).

May 12, 2001

Dave, we're *not* sinking!

I started the C++ Friday Lunch list today.

I subscribed everyone. We'll start next week after everyone has a chance to get the book.


Been working on Tucson heavily.

It took a lot longer to do the MPI queue than I thought. Particularly with respect to arrays of requests. Every time I thought I had it right, I realized that the abstractions were just slightly off, and that would cascade into a whole chain of side-effects and whatnot.

Urrrghhh...

Took quite a while to get it right. I think I've got it right now
-- it all compiles -- but I'm too tired to try it (it can't possibly work -- it's hundreds of lines of code that's all brand new). I'll debug tomorrow.

I really want to have it working -- or at least major parts of it working that I can have some kind of reportable results on Tuesday for me meeting w/ Lummy.


I'm seeing some really weird cron behavior on queeg. Until now, I thought the problem was with my script somehow and so I ignored it. The problem is that I sometimes get double entries in my checking-DSL-connectivity log. That is, it's fired up by cron every minute to check my DSL connectivity. Sometimes I get an entry in the log at xx:xx:59 and xx:xx:00.

I thought my script was just mucking up somehow (it is actually somewhat complicated), so I never bothered to check, because both entries in the log were correct. But today I noticed that cron itself is actually launching the script twice.

My line in crontab is:

 * * * * * /usr/local/bin/check_up.pl 

Watching /usr/log/messages, sometimes I see double entries:

 May 12 22:42:59 queeg CROND15952: (jsquyres) CMD (/usr/local/bin/check_up.pl)
May 12 22:43:00 queeg CROND15954: (jsquyres) CMD (/usr/local/bin/check_up.pl)

<shrug>


DSL dropped out twice today, each for <= 30 minutes. But still annoying, nonetheless. Same old problem -- packets stopping in Atlanta. Gumdangit, BellSouth!

Can't get to anything, though -- not even Excite.

<shrug>


Stupid Linux thread model. I know that I saw a web page once that went through it and said why it was a good thing that threads are different processes (other than "it was an easy hack"). I did some web searches and can't find it.

<shrug>

This is going to problem for LAM itself, when we make it multithreaded because what I described in a previous journal entry. I did find the function pthread_atfork, though, and I think it can be used to fix this problem. There will have to be a cached value of getpid(), and at fork time, we'll have to zero out the cached value.

This can work. I haven't fully thought this out yet, but I'm quite sure that this scheme can work. It may require an additional configure test, too, which may be a bummer, but possibly not.


xmms crashed earlier today. I notice that I have xmms 1.2.3, and 1.2.5pre1 was announced on freshmeat today.



There are 98 copies of xmms running, out of 173 total processes (56%).

May 23, 2001

Nice "Big House" humor, sir

Ok, it's been a while. Cope.


GNU Mailman is smart. I created the cfl list and added a bunch of people to it, then sent a few posts across it. Pete then asked to be on cfl, so I added him, and bounced all three posts to him.

Oops -- I bounced one of them to cfl, not to Pete! Doh! But GNU mailman must have recognized that it was a duplicate (or a resend), because it didn't resend it across the list.


Pennsylvania has free birth certificate copies for military members.

Cool!

I ordered 3.

It's like when I was driving back to SBN from Ft. Knox one night after duty (I had to be in SBN for a meeting first thing in the morning or something), and was still in my BDUs when I stopped for dinner at a McDonald's. I ordered a Big Mac combo meal and pulled out my wallet to pay. The manager walked up and said, "Meals are free for military members". "Cool!", I said, "Can I super-size that?"


My passport expires soon. I went to http://www.firstgov.gov/, found the passport page, and downloaded the renewal forms. They're expired. They expired April 30, 2001. <sigh>

I went to the post office to get renewal forms instead. They had the same out of date forms. <sigh>


I found some more old bugs in jjc.

  • I found a nonterminated C string. I can't imagine how that didn't cause jjc to crash all the time.

  • I also freed some static memory.

  • I also found an endless loop when <> was in a rant (i.e,. an empty HTML tag).

They're all fixed now. Anyone who uses jjc, lemme know if you want a new copy.

Actually, it still appears that there's a little problem with jjc identifying which line unterminated HTML or special characters are on. I'll fix that one someday. Not today.

----

Some pieces of wisdom that I have learned:

  • If you vector.resize(-1), or vector.reserve(0), Bad Things happen.

  • If you setenv LAMHCC mpicc; mpicc hello.c -o hello, Bad Things happen (even though the computer appears to be doing nothing).


Epiphany got their new computer -- a w2k box. I put it together and got it running. It came with Office XP when we explicitly asked for Office 2000. We ordered Office 2k; should be here in the next few days.


The high tension power lines over the exercise track in my neighborhood have a distinctive hum.


I went to ND last week to receive my SGI award. I saw Dr. Eileen there, which was pretty cool. She's actually at IU/Bloomies now, and came back for graduation weekend.


While driving back from ND, I chatted with my C-* Terry for a good 45 minutes about her wedding, furniture, etc. It was a good chat.

I stopped in Indy to see Kelly and Matt on the way home. We had lunch and were generally silly for several hours. I met the crazy brown dog, who actually is crazy, brown, and a dog. They're moving to Chi-town. Good for them, bummer for me!


I'm in Bloomington (Bloomies); I met a bunch of people in the CS department today, saw Lindley Hall, the student union, the woods, got lost on the campus, parked illegally, had lunch with Todd, got a guest key, planned equipment purchases and layout, and did other nefarious deeds.

I'll head back to Looieville tomorrow.


More news on Tuscon in a different journal entry.

May 25, 2001

Stinkbutt

I think that the rat-bastard ice cream man is trying to kill my spirit.

It seems that entirely different muscles are required for running vs. roller blading. This is quite unfortunate. Why can't they be the same? IMHO, roller blading is much more fun than running. Running is so boring.
Unfortunately, the army isn't quite modern enough to offer competitive roller blading as part of their standard physical tests (just imagine: the soldiers of tomorrow blading around on the battlefield on special track-mounted foot adaptors... what an edge!), so I still have to run at my test next month.

So I decided that I had better start actually running rather than blading for exercise. Needless to say, I was smoked within minutes. But being a stubborn idiot, I pressed on for quite a while (mainly because my army duty is only a few weeks away). I ran around my neighborhood a bit, and did the exercise track (situps and pushups) down by one of the two lakes here.

And there's a big-ass hill between the exercise track and our home
-- and it's downhill the wrong way. Yes, I have to run uphill to get home. Woe is me! I guess it makes me a better person.

Anyway, the ice cream truck, playing its loud jingle-jangle pied piper music came down the street just as I was dying up the hill towards home.

Did I say "dying"? I meant "running".

Several thoughts enter my head, almost simultaneously:

  1. I shouldn't have any ice cream; I'm trying to lose some weight here!

  2. I don't have any money with me.

  3. I wonder if he'd give me a ride home.

Woe was me. He even stopped for a bunch of kids that I went by so that they could run screaming into their houses, "MOM!!! The ice cream man is here!!!" (reminded me of the old Eddie Murphy ice-cream man schtick. "It's like sprinkles.").

But I survived. Without ice cream. I have declared the ice cream man to be my nemesis. It's a battle of wills between us. I will prevail.

'morning sir. Are you going to introduce me to your bi-atch?

Tuscon. a.k.a., "I never new that queues could be so complicated!"


DSL went out while I was typing this. <sigh>

Looking through my log, I see that connections to ND were really spotty yesterday (indeed, I felt that heavily as I was working), including a full 20 minute outage around 2pm, a 10 minute outage on Wednesday, an outage from 4am to 7pm on Saturday (although that may well have been my router getting hosed -- when the power blinked recently, the router froze until I rebooted it), fairly crappy connectivity last Monday-Wednesday, and some sustained outages on the previous Saturday....

Overall, it's not as bad as it sounds. Sustained outages (like the one I'm having right now...) aren't too often, and seem to usually be the fault of Bell South (packets dropping in Atlanta). Spotty connectivity does happen not infrequently, but ND might well be to blame for that, because their external router is so overwhelmed, and the internal network is, well, less than perfect (oh for the days when Shawn was running the network...). Indeed, it's quite possible that my connectivity to IU will be better than my connectivity to ND if I only have to worry about the periodic sustained outages and not spotty connectivity.

We'll see how it works out; I don't have accounts a IU yet, but that paperwork is crunching through the vast papermills... I haven't decided yet on how to change my e-mail address. I might wait until after my defense; haven't decided yet.



  • Wow; the MPI queue turned out to be quite complicated; I mostly
    worked out the model in one days, but spent all the next day
    working on the engine itself, and a few details (bugs) in the
    outer parts.

  • everything seems to work except some MPI_Cancels at the end (I
    think the requests are already dead), and it's slow. Gonna
    have to revamp the enqueueing/dequeueing so that you can
    enqueue/dequeue lots of things at once, not just one at a time
    (why didn't I learn my lesson the first time?)


  • enqueueing dequeuing a list at a time really helps
    (std::list<>::splice() is very handy -- O(1), baby!)

  • BRAINSTORM: don't enqueue a list (complicates matters greatly,
    especially w.r.t. temporary buffers and whatnot) -- just get
    control of MPI from the event manager and therefore can do
    direct sends/receives. This also allows for arbitrary and
    potentially interactive send/recv protocols for user data. WOW
    -- that makes an AMAZING difference -- <1sec vs 45-60 seconds!
    (test case is particularly painful: 1024 short messages to each
    slave, each message contains 1 int)

  • Also had another thought -- this polling model in the MPI event manager is definitely sub-optimal (have to periodically steal cycles from the other threads to check for MPI progress). The main reason for the polling model is because they are always pending receiving (from children) -- the children may send to their parent at any time. So the parent always needs to have pending receiving posted. So we have to check periodically if any of the receives have finished --
    hence, the polling model. But what if there was a way to block and wait for such progress? I'm talking about using a mechanism outside of MPI. That is, open a secondary socket that is used just for signaling. When a child sends a message to its parent, it does the MPI_Send, and then tweaks the socket. The parent can be blocking on a select() of all the sockets from its children. When it select() indicates that one of them is ready, it knows to go complete the MPI_Recv. This is somewhat icky because we have to go outside MPI to do it, but it would work, and potentially could save a lot of time since the progressively-slower polling model can make receives wait an arbitrary length of time before they can actually complete. I may or may not pursue this, but I wanted to record the idea...

  • I seem to have gotten everything working now -- for single-level only (i.e., only one RelayCalc). The current scheme won't work with multiple levels because of the way RelayCalc distributes input data and expects to collect output data, and the way that RelayOut sends back data.

  • Had to overhaul the EOF/EOI progression through the queues and relays a bit to make them work and to ensure that there would be no memory leaks. Children now assign their own stream ID's to each input data set; they receive a chunk of input data from the parent, give it a unique stream ID, enqueue it all, and then immediately enqueue an EOF for that stream. The parent keeps track of the "real" stream ID by associating it with the child's ID; when the child returns output data, it uses the ID of the child to look up the stream ID of the data that it is returning. This scheme allows multiple things:

    • Children can completely clean up state after each chunk of data from their parents are processed (trust me on this one), because each chunk of data from a parent is a treated as a discrete, complete stream in itself.

    • The RelayOut can wait to send all of its output data to the parent until it gets the EOF on that stream. When it gets the EOF, it knows that the entire chunk of input data that was initially received from the parent has been processed, and it can send it all back en mass. This component (buffering output data) is needed to allow multiple levels to work, as mentioned above -- it hasn't been implemented yet.

  • Other things that are still needed:

    • Handling of faults -- children (and by induction, their children) will die when their parent dies. Parents need to mark a child down when it dies, and do the necessary bookkeeping to back out of any current transactions with that child, and ensure that that child will be ignored for the rest of the computation.

    • Startup "all at once" with no spawning model. This will be necessary for IMPI runs. This will be more software engineering than rocket science (although it won't be trivial :-\ ) -- the software has to support both models.

    • Support both MPI and non-MPI models. I had some preliminary infrastructure in there for that now (i.e., configure/compile without MPI -- just support a single SMP), but I've long since broken it --
      you can't compile without MPI. I likely won't fix this until after my defense...



There are 483 copies of xmms running, out of 562 total processes (85%).

May 28, 2001

1-800-J-JAMES

More quickies. Some are techincal. Cope.

  • I discovered today why grip sucks. I previously have had problems with grip refusing to rip a track or two. For example, it wouldn't rip a track at the end of Fatboy Slim's On the Floor of the Boutique. I had always assumed that the CD was defective. Today, I was ripping a CD that Tracy had just bought, and ran into exactly the same problem. grip reported the time for the track as 5:37, while the CD jacket reported it being something like 2:50. Hmm. I tried three different CD drives and they all did the same thing. I put the CD into a real CD player and the track played fine. Hum!

    So I ripped it manually with cdparanoia, and it ripped fine (which is weird, because grip uses cdparanoia to rip). Then I encoded it with bladeenc, and the resulting MP3 is fine. I did the same with the Fatboy Slim track, too. I found the problem -- each of these two CDs have an "enhanced track" at the end, which screws up the next-to-last track somehow. grip not only specifies the track cdparania to rip, it also specifies the sectors. So somehow grip is getting the wrong sectors, which causes it to fail. If I give just the track number to cdparanoia, it works just fine. Weird.

  • Internet connectivity has absolutely sucked for the past 72 hours. To ND especially. I am guessing that the networking upgrade that ND did on Saturday morning may have mucked things up... but IIRC, they were just replacing some UPSs, not changing any configurations. Hmm. But then again, there could just be lots more traffic on my DSL segment due to the holiday weekend. I dunno. I've been seeing 50-60%
    packet loss to nd.edu.
  • I found, by accident, today that the latest versions of xmms fix the thread leaks. Turns out that it was apparently leaking sockets, too. I was doing some Tuscon testing and noticed that a ps took 10-15 seconds to complete. This is because there were so many dangling threads (arrgghh... stupid linux thread/process model!). So I went and check http://www.xmms.org/, and sure enough, there was a new version. I got the latest (1.2.4), and it fixed the problem. I noticed that they released 1.2.5pre1 recently, so I grabbed that, compiled it (with ogg support, of course), and it seems to be working fine. Check out the xmms stats at the end. Amazing!

  • ogg seems to be coming along. I've been rather inactive in it while trying to finish the dissertation. I updated my CVS copy of it today when I compiled xmms; there's a DOS file that doesn't compile 'cause of preprocessing badness (trying to have a multi-line macro with a carriage return after the '\' causes the preprocessor to be unhappy). Monty just checked in what sounded like much audio goodness (I don't follow much of that stuff, but it sounded good... hahaha... very punny...).

  • I'm getting account on American Museum of Natural History's 260 node cluster; they have problems over 229 nodes with LAM. Will be a good place to test lamtree, too. But I must graduate...

  • For days, I've been looking for a memory leak in Tuscon. My sample passthrough test app was allocating memory at an alarming rate
    -- but only in the root master process. The children all looks like they were nicely memory bounded. I finally found the problem today --
    it wasn't a memory leak at all. Turns out that my input thread stupidly allocated space for the entire input file at the beginning of time rather than ask for a bunch of buffers, fill and enqueue them, ask for more buffers, etc., etc. i.e., the whole concept of buffer pooling. Nope -- I just asked for buffers for all the data up front. Duh. :-\



There are 8 copies of xmms running, out of 79 total processes (10%).

May 30, 2001

I'm sorry, that's just the way we do things around here -- the new guy has to sit next to Matthew.

"I love the smell of dropped packets in the morning."

My DSL connectivity still sucks. 50-60% packet loss to just about everywhere. Arrgh.


I saw a few episodes of the FX X-Files Memorial Day Marathon. They were showing some of the really classic X-Files episodes, like the monster/Cher Halloween episode, the X-Cops episode, the Queen Mary/Nazi episode, etc.

Classic.


Tuscon seems to be working! Added some simplification functions that I rolled into a "simple" example.

By the end of the day, I had broken Tucson again, all the in name of making the user interface better...


I finally got sick of this horrendous connectivity and called Telocity to report the problem. It seems that Telocity was bought out last week by Hughes Electronics, and is now DirectTVDSL. Hmm. I don't know whether to be worried or not...

The technician lady that I got was totally clueless. She asked me if I was frequently deleting my "catch files". It was only 10-15 minutes later that I realized that she meant "cache files" (which has absolutely nothing to do with the problem, which is why I didn't realize what she meant at the time).

The connectivity problems that I was having were mostly to the rest of the internet. Even though my route to other Telocity machines goes through Bell South, that appeared to be working [mostly] ok. For example, my DNS connectivity was [mostly] ok.

She ran some ping tests between me and her and said, "I'm not seeing this 50-60% loss than you're talking about..." I tried to explain that my Telocity connectivity was fine, but connectivity to the rest of the internet sucked. She said, "but I don't have any connectivity problems to Yahoo (for example)". <sigh>

She finally came up with a 2% packet loss between me and her, and decided to report that. She was convinced that this was the Big Problem. I made her put down that I was seeing 50-60% loss out to Yahoo, even though I think she didn't believe me. Arrghh...

[several hours later] The recorded message on Telocity's tech help line says that "customers in the southeast may experience intermittent service... there is no estimated time to repair, but it may take up to a week to resolve these issues..." ARRGGGHHH!!!! Hopefully it will be less than that. :-(


I got account on American Natural History Museum cluster today. They punched a hole in their firewall for ssh for my IP (fixed IP/DSL does have some advantages...). It appears to be related to the Zoology department... hum! I wonder what they use such a big cluster for.

But my connectivity sucks; I couldn't really do anything.


The Army finally authorized me a rental car for my AT today. Woo hoo!


DSL connectivity finally came back around 4-5pm. Everything appears to be normal now.

Hey Dave! Welcome to Gattica.

I suppose, just for posterity, I should include the xmms stats, even though they're now deliciously low...



There are 8 copies of xmms running, out of 82 total processes (9%).

About May 2001

This page contains all entries posted to JeffJournal in May 2001. They are listed from oldest to newest.

April 2001 is the previous archive.

June 2001 is the next archive.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type 3.34