Ironically, after yesterday’s post about Oracle’s caviler attitude toward dealing with communities built around open software, a piece of died-in-the-wool proprietary software stepped out of Oracle’s (ever growing) stable of software, neighed a couple of times, stamped it’s feet and kicked me in the head. Specifically, one of our 11g database servers shot up to one minute load averages a little above 2,000. Yes, two zero zero zero.
Despite the comically high load, our M5000 server running Solaris 10 withstood the onslaught and the system was still quite usable. A testament to the engineering teams at Sun if I ever saw one. A quick jump to old performance standbys, mpstat, iostat, prstat and vmstat revealed many involuntary context switches and a run queue in the thousands. My DBA colleague and I went about the business of diagnosing the problem and he promptly ran an AWR report. Below was what we found.
“Wait, is that billions?” I asked naively.
“Yeah… something’s wrong.” my DBA colleague replied in his usual calm, understated way.
Oracle Support confirmed that, yes, four billion mutex waits in a span of an hour appeared to be the cause of our pain. Luckily for us, this is not undiscovered country. A quick Google search later revealed that 11g is notorious for this particular type of pain. The fix, of course, was a patch. Specifically, 10411618: Add different wait schemes for mutex waits.
Annoyingly, had Oracle been gracious enough to add DTrace probes to their enterprise products, we could’ve saved a lot of heartache with a one-liner.
Also, here’s a good intro to Oracle DB mutexes and latches.