bcheck just simply rocks.
After beating my head against a wall for 2 days looking for a memory bug in LAM/MPI using valgrind (a memory-checking debugger for Linux), bcheck found the error within about test 3 runs on Solaris.
Don't get me wrong -- valgrind rocks as well. valgrind is a fabulous tool and I'm extremely glad that its available (many thanks Julian!). But bcheck somehow provides more detailed information than valgrind provides.
...actually, I guess that's not entirely true. I was sitting here thinking about it while writing this entry and I figured out why valgrind didn't tell me the same information that bcheck did. Here's the scoop:
In this case, the problem was both a read from unallocated and a duplicate free within LAM's myrinet network device. bcheck reported these problems, but valgrind did not. Why?
It all comes back to Myrinet -- arrgh! On Linux systems, LAM/MPI has to use its own memory allocator (a derrivation of the venerable ptmalloc) to be able to catch calls to sbrk() such that memory returned to the OS is guaranteed to be unpinned before it is returned. Hence, valgrind is probably not intercepting these calls because it doesn't know that it's the "real" free(), sbrk(), etc.
This doesn't happen on Solaris because Solaris has a bug deep within its kernel such that gm can't atomicly allocate-and-pin memory, and therefore LAM/MPI doesn't need to replace malloc/free/etc. (that's the short version, omitting all the juicy details). Hence, bcheck is able to see/report on the "true" malloc/free, but valgrind isn't.
So Valgrind rocks! Bcheck rocks! Memory-checking debuggers are life!