First entry in a while... (started this entry yesterday)
I am thoroughly worn out. I have just spent about 48 hours trying to get PBS configured properly for the hydra after all the nodes and front ends have been upgraded to Solaris 7/64 bit. I was finally [mostly] victorious, but I am left feeling cynical and wondering why I waster 2 days of valuable time when I could have been working on my dissertation. I must rant.
\begin{rant}
First, some technical details about why this was difficult, and why a seemingly simple thing took so long to do. We don't use a vanilla distribution of PBS. We use a clever patch (and a few extra executables) from Dale Southard to enable proper AFS authentication when our PBS jobs are run. This package needs libraries to perform AFS authentication; you can use the proprietary Transarc AFS libraries or the freeware krb4 libraries (http://www.pdc.kth.se/kth-krb/).
My initial goal was to build everything in 64 bit mode, because a) Curt had some horror stories about trying to run 32 bit AFS binaries in 64 bit mode, and b) it seemed the Right Thing to Do. Knowing that everything had to be 64 bit in order to link properly, I set about trying to build PBS in 64 bit mode.
It took a bit of research (thanks docs.sun.com!) to figure out how to compile in 64 bit mode in the first place. It took further research into the PBS docs to figure out all the ./configure flags that I wanted, etc., etc. (the PBS docs are somewhat hard to read, IMHO...). This all took a good chunk of time -- I just wanted to build a vanilla PBS first, and then try to build Dale's stuff (and probably recompile PBS to integrate it).
Being a forward-thinking person, I took Dale's stuff and updated all of it (because I forsee the need to do this whole process again in the not-too-distant future). I put in a proper automake process, with a full configure script to automagically figure out all the things that it needs to figure out so that you don't have to go fill in the Makefile yourself. That took a while, but I believe that it was worthwhile to do.
After as bunch of experimenting and poking around, I determined that the provided Transarc libraries are all 32 bit. Useless. So I went and got the krb4 package, and tried to compile it in 64 bit mode. Unfortunately, krb4 didn't want to compile in 64 bit -- it complained about some missing types.
<sigh>
So I said "Fuck it, I'll just build everything in 32 bit mode. Who cares?"
And I did.
And it worked.
...sort of.
The PBS mom's would periodically randomly die. I figured that it was because Dale's PBS patch had bit rotted, and was causing badness in the mom. So I put in all kinds of syslog() calls trying to track down where the problem was. I never saw any of the syslog messages. It made me think that the problem wasn't with the AFS code (!).
Luckily, Bob Henderson of PBS/Veridian, came to my rescue and informed me that if PBS is to be run in a 64 bit environment (like Solaris 7), it too, must be compiled in 64 bit mode so that it can read /proc properly. Without that, PBS will surely crash.
Arf. So now I have to get everything to compile in 64 bit mode.
krb4 took some tweaking (it's missing some typedefs that don't appear to be a problem if you compile in 32 bit mode -- go figure), but I finally got it compile properly.
Dale's stuff also use the RSA encryption routines from rsaref, so I had to compile that, too. Wow -- that thing must have been written a long time ago, 'cause about nothin' is standard. It's weird as hell. For example, it compiles to rsaref.a, not librsaref.a. Weird...
After that, I got Dale's stuff to link properly. It wasn't until much later that I discovered that rsaref wasn't happy in 64 bit mode. Trying to generate some keys, it sat and spun endlessly instead of actually producing output. Dale actually rescued me here
-- he pointed to a web page that indicated that there is a bad typedef for UINT4 in rsaref/source/global.h that is an unsigned long instead of an unsigned int
-- hence, in Solaris 7/64 bit mode, it was coming up as 8 bytes instead of 4 bytes. Changing the typedef and recompiling rsaref fixed everything, but figuring that out and fixing it took quite a while.
Dale's stuff links into the PBS mom, and I had some serious linker issues here. Turns out that both AFS (krb4) and PBS define routines for some MD5 stuff. It took a bit of creative side stepping, and changing Dale's patch, but I finally got it to work right.
After that, I had ended up creating 3 different PBS configurations: one for the PBS server (heracles), one for all the PBS client machines (athos, etc.), and one for all the hydra nodes. And I wrote a script to install each one. Not difficult, but not trivial either -- it took a lot of iterations to get the three scripts right.
So all in all, it took the better part of 2 days to get this all figured out an working properly. Ugh.
So why am I unhappy about this? It's not the work -- I don't mind that. And I learned some good stuff while doing this. But I really need to be working on graduating. And this is not such work. Even worse, we're doing this for people who don't care -- they expect that we do this. They use the hydra much more heavily than we do, but we have to take all the pain of administrating it.
Don't get me wrong, I like all of our users -- they're nice people, after all -- but they all don't have a clue as to how much work it takes to keep it running (which, in retrospect, is the mark of a good sysadmin). We're basically doing this out of the goodness of our hearts, and losing valuable time because of it. After about 10am this morning, I couldn't help thinking to myself, "Why am I doing this? Is anyone going to notice? Is anyone going to care?" These questions are quite cynical, and reflected my frustration at the time. Indeed, the answer to the second question is "no", which is one of the reasons that I can say that we are good sysadmins (more about this below).
The long and the short of this ends up at my philosophy of system administration: system administration is like sex. Any system administration is good system administration. But you don't know bad system administration until you've had good system administration. An admittedly biased view, as I consider myself a fairly good sysadmin (I'm certainly not perfect, but I'm pretty good for a part-time sysadmin who's trying to get a degree). I say this because I've seen a lot of sysadmins, and I've see a lot of bad sysadmins. Hence, I know what good and bad sysadmin is.
Just to be clear -- I'm defining "good sysadmin" from the viewpoint of a user. Users who have a good sysadmin barely know that they have a sysadmin; for the most part, things just "work". They don't have to keep continually updating their personal work habits to work with their computing environment. i.e., there is one environment, and it stays more or less uniform so that after users make the initial adjustment to work within it, they are rewarded with a fairly constant look and feel. This is not a hard and fast definition, but I think you can get the sense of what I am trying to say.
Bad sysadmins have exceptions for foo, you have to update your .cshrc to get the new version of bar, have no plans for uniform distributed environments, no backup schedules, no cohesive set of services, don't check their system logs, etc., etc. Users, however, unless they have had a good sysadmin, don't know the difference. In a society that tolerates (nay... expects) to reboot a Windoze machine multiple times a day, having exceptions for foo, or needing to type the full pathname to get the new version of bar seems like no big deal. It's extra pain that I (the user) must go through to do my real job; that's just the way it is. Users don't realize that it can be better.
But is this bad? If people don't realize that they have a sub-optimal arrangement, and just get used to dealing with the constant change, some things working and others not -- if they really don't know any better, what difference does it make? Probably little. However, I think this disturbs me philosophically at some level.
I have walked into 2 professional organizations where I worked as a system administrator (both, coincidentally, for the army). Both had horrendous (IMHO) sysadmins before me. Here's an example conversation that I had on my third day in the second organization (in a networked Unix environment):
Me: "I've installed the new version of Netscape; the one that was out there was a few versions back from the current release."
User: "Great! How do I access it?"
Me (puzzled): "What do you mean?"
User: "How to I bring up the new version?"
Me (still puzzled): "Well how do you bring up netscape now?"
User: "It's on one of the pulldown menus in my window manager."
Me: "Just access it the same way -- the next time you fire up netscape, it will be the new version."
The user literally sat there blinking at me for a few seconds. He had no concept of just doing the same thing and having updates automagically appear. This is one of many examples as to why I maintain that they had a bad sysadmin before me. Not that I'm self-aggrandizing, but doesn't it seem odd that when I announce the installation of a new version, the users assume that they'll have to do something different? ("You mean I don't have to reboot my unix machine multiple times a day? Why not? I think I'd still feel better if I rebooted it anyway." -- actual quote from a user when their desktop workstation was converted to unix)
The goal of good sysadmin is not only to keep everything working, but to hide as much of the work as possible from the user. The users have enough to worry about; they have their real jobs to do -- they shouldn't need to worry about fighting their computer to get their job done. It's the sysadmin's job to keep the computer running and make all of its services [relatively] easily accessible to its users. A sysadmin who does not make a "seamless" (i.e., as much as can be -- it cannot be 100% seamless) work environment for users is not doing their job, IMHO. A computer is supposed to be a labor-saving device -- this should be the sysadmin's mantra. More to the point, the technology itself should not make user's work harder than it already is (and I'm not talking about the evolution of using hand written foils that took 5 minutes to create to picture-perfect powerpoint presentations that take endless hours to create -- this is artificial demand that has been created by users; this is a different discussion).
Hence, the two organizations where I have done professional sysadmin (outside of ND that is) -- and again, I'm not trying to be self-aggrandizing -- now have a completely different view of sysadmin. They now expect a lot more from their sysadmin (as they should, IMHO). They don't want to fight the system anymore, to have to remember the three different ways to access netscape, etc., etc. They just want netscape, and they just want it "to work".
So how does this all tie in to how I'm annoyed with the hydra?
Well, to be blunt and arrogant, we've done a pretty darn good job with the hydra. Yes, we've screwed up a few times -- :-( -- but all in all, that system is pretty darn reliable and uniform. It "just works" for the most part, and users have had very few complaints. As such, our users have never had bad sysadmin. I dare say that we are under appreciated mainly because we set the initial level of service too high (most of the credit actually goes to our boss, Lummy, who infused me with many of the qualities of good sysadmin that I described above early in my graduate career, and I have done my best to pass these on to other grad students. i.e., these qualities don't just apply to sysadmin; they apply to research [and so on] in general. But that's a different conversation...).
I cite two reasons why we are "good" sysadmins:
- Our users have no concept how much work it takes to run the hydra (sure, once it's running, it pretty much runs itself, but, for example, this weekend's upgrade to the combined resources of two good sysadmins for multiple long-hour days to accomplish).
- Multiple groups have come to us asking if we'd sysadmin their cluster for them.
The second point kills me; it's further proof of the first point. Being a sysadmin is not our business; being a grad student is our business. If I were being paid to be a sysadmin, I'd be happy to do it without complaining. But how many other research assistants have to put in double digit hours a week on keeping their own (and others'!) systems running? This is the job that someone should be paid to do, not a job that someone should have to do in their spare time, or at the expense of their real job.
Not that I'm faulting anyone here -- indeed, I have learned a lot as a sysadmin over the years, and I honestly think that it has made me a better computer scientist. And I do seem to recall that we volunteered to sysadmin the hydra, etc., not really realizing what a big job it would be. As such, it's probably our own fault for raising user expectation levels so high -- they've always gotten this service for free, and don't realize that sysadmins out in Silicon Valley get 6 digit salaries to do what we do.
Indeed, we have taken pretty much the same attitude with the LSC software trees for Solaris 7. That is, for Solaris 2.5.1 and 2.6, the LSC had extensive software trees out in AFS that many, many users at ND used because the OIT-provided software trees were inadequate. The OIT trees were out of date, didn't include all the software that we needed, etc., etc. Hence, we made our own trees, and maintained them fairly well. For Solaris 7, we have pretty much refused to do this because -- just like the hydra -- it is just a time sink. We end up supporting all kinds of people instead of just us, and this takes time away from our real jobs. So if we end up with software trees for Solaris 7 (we haven't really yet, because the OIT has some Quality people who are actively being good sysadmins), we might very well lock them to LSC personnel only. We're not in the sysadmin business... but people think that we are.
More to the point -- look at the investment/reward ratio for the hydra. We barely use the hydra. The main users of it are CHEGs and civil engineers. We use the hydra for some development work, but we don't consume the vast majority of cycles on it. But we're still investing huge amounts of time in the hydra when we get very little back out of it. This is time that could be spent elsewhere, doing things that are relevant to our own work, not others' work.
So at the end of this long ramble, I guess I have no one to blame by myself. When you provide a good service (which is rare in today's society), people expect it to keep going -- especially when it's free. Indeed, our users don't even have the concept that such work should cost money. The very aspects of what makes a good sysadmin create a self-perpetuating cycle of raising the level of service that is simply not sustainable by someone who is not a full time sysadmin. However, we never put down limits on what we would do as sysadmins, so we continually fed the fire of user demand without realizing that we were damning ourselves even more.
As such, I think it is time for us to gracefully back out of this business. We have provided a good service for several years to our users, but we just can't do it anymore. Indeed, Lummy has already initiated this process -- he assures me that we will not be sysadmining the hydra past 1 Jan, 2001. I do feel badly about this, because I feel a commitment to my users, but it is necessary for us to continue our real jobs. :-(
\end{rant}
Please keep these comments in context -- I have no ill feelings toward any of our users at all. Indeed, I find most of them to be very friendly people, and we have gotten along well with them for several years. And I don't think that I'm the best sysadmin in the world; in the field of computer science, there is always more to learn. I've met some highly accomplished sysadmins who make my sysadmin knowledge look like belly button lint. I'm not trying to make myself sound like the world's best sysadmin; I only mention these kinds of things to give my rant a frame of reference.
The above rant is just out of frustration because I'm trying to get my own work done, and can't because of artificial demands placed on me. Miles to code before I sleep...