Debugging 500 Errors

For the past two years, the AlliedModders site has been randomly dishing out “HTTP 500 – Internal Server Error” responses. It’s been pretty frustrating.

I actually tried debugging the errors two years ago. I got far enough to attach GDB to PHP, and saw it blocked on mysql_query(), but didn’t know how to proceed. We knew that during these incidents, the server load would spike to obscene levels – like 100-150. Lots of PHP processes were getting spawned. “iowait” time was through the roof, and CPU usage was normal. Something couldn’t read the disk, or the disk was getting hammered. I put MySQL on a separate hard drive, but the problems persisted and I gave up. Keep in mind: I didn’t know anything about running servers, and I still don’t – I just have a vague notion of how to debug things.

Fast forward to a few weeks ago, we had a serious hard drive crash and suffered some painful downtime. We lost a lot of data. I clung to a hope that our years of errors had been caused by a failing disk. Predictably, we weren’t so lucky, and the website problems continued. Someone joked, “Hey, I’m glad to see you got the server back up, even the 500 errors!” Yeah… it was time to fix this. Somehow.

Step One: Apache Logs

There were only two notable Apache errors. They’d appear over and over, but only in large clumps:

[Tue Jan 25 04:15:19 2011] [warn] [client xx.xxx.xxx.xx] (110)Connection timed out: mod_fcgid: ap_pass_brigade failed in handle_request function
[Tue Jan 25 04:15:25 2011] [warn] [client xxx.xxx.xx.xx] mod_fcgid: read data timeout in 60 seconds

It was pretty clear that somehow, communication was breaking down between mod_fcgid (Apache) and PHP processes. I read, I think, every single webpage on the Internet involving these two errors. Twice. They were all dead ends, both for us and both for the person reporting their own problems. A little sad. I wanted to blame mod_fcgid, since historically it has been really buggy for us, but I wanted to pin absolute blame.

So the next step was to break out a debugger.

Step Two: Debugging PHP, Apache

AM developer Fyren edited the mod_fcgid source code, such that instead of killing PHP, it would attach gdb in a separate tty. It usually took around three hours to get a debugging session going. The call stacks looked like:

(gdb) bt
...
#4 0x00007f1f7cad804f in net_write_command () from /usr/lib/libmysqlclient.so.15
#5 0x00007f1f7cad4c51 in cli_advanced_command () from /usr/lib/libmysqlclient.so.15
#6 0x00007f1f7cad1751 in mysql_send_query () from /usr/lib/libmysqlclient.so.15
#7 0x00007f1f7cad17b9 in mysql_real_query () from /usr/lib/libmysqlclient.so.15
...
(gdb) fin
Run till exit from #0 0x00007f1f7cad17b9 in mysql_real_query () from /usr/lib/libmysqlclient.so.15

At this point, the session would hang, indicating that MySQL was not responding. I really had no idea what to do, but we got lucky. We noticed during these incidents, Apache’s server-status was stuck at “200 requests being processed,” all in the “W” (“Sending reply”) state. Our MaxClients setting was exactly 200. Now we had a hypothesis: MySQL was intermittently not responding for short bursts of time. This blocked PHP, which in turn made mod_fcgid block and timeout, causing Apache to spawn a bunch of workers that would lock up as well.

Unfortunately, the intermittent failures only happened a few times a day, so debugging went slow. We managed to catch one, and ran iostat:

Device:          r/s       w/s   rsec/s    wsec/s    avgrq-sz  avgqu-sz   await  svctm  %util
sdc              320.80    4.40  3540.80    54.40    11.06     4.32   13.01   3.08 100.00
sdc1             320.80    4.40  3540.80    54.40    11.06     4.32   13.01   3.08 100.00

Someone was reading tons of data off the disk, and iotop confirmed MySQL was the culprit.

Diagnosing MySQL

I managed to find this enlightening presentation by Percona. Case study #3 looked identical to ours, except in their case, MySQL was writing. But their conclusion about bad queries seemed plausible. I made a script that would peek at server-status every second, log a ton of data, and e-mail me if it went above 100 concurrent requests. The data included oprofiling, iotop spew, SHOW FULL PROCESSLIST, SHOW OPEN TABLES, SHOW GLOBAL STATUS, etc.

Oprofile’s report wasn’t too interesting, but the MySQL status showed around 150 queries waiting for a table lock. The active queries, in all cases, were SELECTs. This explained the reads, but not the table locks. This didn’t make sense, so I generated a few core files of MySQL from within GDB, then Fyren and I did some post-mortem analysis.

The analysis revealed that MySQL had about 200 threads, most of which were waiting on locks (pthread_cond_timedwait) with a timeout of one year. The locks were almost all read locks for SELECTs, and the actually succeeding queries were exclusively SELECTs. Fyren tried to track down the thread owning the locks, but it ended up being apparently circular (thread X’s lock was held by thread Y, and thread Y was waiting on a lock held by thread Y).

That was all pretty mysterious (and still is), but rather than try to understand exactly what locks are in MyISAM tables, I took a look at the non-deadlocked queries. They were all one of: the AMX Mod X plugin search, the SourceMod plugin search, or the vBulletin member list search.

It turns out all three of these queries were extremely expensive – containing ORs, joins, full table scans of 800MB tables, etc. On average they each took about 500ms to complete. That’s without MySQL freaking out. I turned off vBulletin member searching (most large boards do), and replaced the plugin searches with faster queries off a much a smaller table. This brought those page load times down to 20ms. Twelve hours later, no 500 errors had occurred, and there were no mod_fcgid failures in Apache’s logs.

Too Early to Declare Victory

While waiting to see if any more 500 errors would show up, suddenly the server load spiked, and lo and behold – mod_fcgid became overloaded. Our monitoring script kicked in, and this time, MySQL was nowhere in the profile! It turned out some forum users were effectively DoS’ing the site by holding down F5 on their profile pages to bump up the “view” count. OProfile suggested that 20% of CPU time, during this incident, went to PHP parsing and lexing source code. I installed eAccelerator and now that time appears to be gone. (And for good measure, I removed profile view counts.)

Growth and Traffic

It’s unclear what our load and traffic really is. I think we’re pretty good though. CPU utilization is usually under 20%, disk I/O is under 10%, and we rarely see more than 30 concurrent requests. We don’t need to invest in a separate MySQL server yet, and we definitely aren’t growing fast enough to expect to need one anytime soon.

Community Involvement

I would like to thank Fyren for his help in looking into these problems, and asherkin for helping beef up our non-existent monitoring. Also, thanks to AzuiSleet, Drunken_Fool, MatthiasVance, and devicenull for sticking around in IRC and answering random questions.

Diagnose, then Solve

Many people, well-intentioned, said “switch to lighttpd, nginx, IIS, use my favorite host, use a VPS, buy more servers,” etc. I got caught up in this too, and fueled it by initially blaming mod_fcgid. It’s really important to make sure a solution has targeted a specific problem, and I had to keep telling myself to keep a cool head. Sometimes I had to slap myself or wait for Fyren to slap me. They all might be good ideas in their own right, but without diagnosing the actual cause of the errors, there was no guarantee of a fix. By careful debugging (albeit in painful, hour-per-day sprees), we got pretty close to the root cause, whereas throwing our infrastructure into a salad bowl would have delayed the inevitable.

Part of the reason our original drive is still at the recovery lab is because someone didn’t diagnose before “solving.” When it failed, a technician (not one of us) ran fsck on the disk instead of doing a hardware diagnostic first. I have no doubt this caused untold extra damage.

Conclusion

tl;dr, the forums should be much faster now, and it shouldn’t error anymore. I’ve got the system set up to e-mail me and dump diagnostics when our load goes way above normal, so I’ll continue monitoring for potential problems.

I would like to sincerely thank the whole AlliedModders community for the successful donation drive. I won’t lie, the past month has been pretty sucky. Despite our failings, the fact that everyone cares enough to see the whole thing keep going is really awesome. Our seven year anniversary was just a few weeks ago. Here’s to an AlliedModders for 2011!

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">