So, a server rebooted. Perhaps I’ve been lucky in the developers I work with, but whenever a host reboots, I always think hardware failure first. I treated this case no differently and jumped onto the host to run some quick diagnostics.
Since I deal in Dell equipment, my first stop is always check_openmanage. It gives a quick report on CPUs, DIMMs, chassis, power supplies, etc from the command line and takes about 30 seconds to run. It rarely fails me.
However, this time it did. Everything returned [OK]. No hardware failures.
I double-checked this claim on the iDRAC, just in case this was the one time in history a GUI was more correct than the modules running directly on the box. Nope, everything was green.
So I loaded up the kernel dump.