Ok, so I didn't spend much (any) time on the Condor/LAM stuff yesterday. I spent most of the day finishing up the Password Storage and Retrieval system (PSR) originally written by Dale Southard. We use it with our batch queueing system (PBS) to get AFS tokens when jobs are submitted, and to automatically refresh tokens before they expire so that AFS authentication lasts throughout the entire submitted job.
It's pretty cool stuff -- it uses public/private keys for storing the user's password and whatnot. I've made it fully automake-ized, cleaned it up a bunch, added it to CVS, fixed a few bugs, ensured that it works with both Transarc's proprietary development AFS libraries and the krb4 freeware AFS libraries, and updated the patch to the OpenPBS source code (it's dynamically generated now, too). I finished early this morning and sent it off to Dale for review, and to Bob at PBS so that he can give the patch a once-over.
Hopefully -- that will be it, and I'll be able to release it and get it out of my hair.
Today will be spent answering 3 old LAM emails and working on the LAM/Condor description:
- Keith from Citifinancial: he has discovered that when in fault tolerant mode, if you
mpirunbefore the lamd's have discovered that one of the other lamd's is down,mpirunwill get the wrong information and sit forever trying to spawn a job on a node where the lamd is gone. Hence, deadlock. Need to fix this. - Dave from GE: wants to get the native signal/error handler fired when LAM intercepts a SIGSEGV, SIGBUS, SIGFPE. Seems like a reasonable request; need to work with him a little more to get the details right.
- Patricia from Dec: thinks that she has found a problem with
MPI_Intercomm_mergein LAM. Need to check this out; I think she sent a sample program that shows the error.
Off to work...