Debugging 500 Errors

For the past two years, the AlliedModders site has been randomly dishing out “HTTP 500 – Internal Server Error” responses. It’s been pretty frustrating.

I actually tried debugging the errors two years ago. I got far enough to attach GDB to PHP, and saw it blocked on mysql_query(), but didn’t know how to proceed. We knew that during these incidents, the server load would spike to obscene levels – like 100-150. Lots of PHP processes were getting spawned. “iowait” time was through the roof, and CPU usage was normal. Something couldn’t read the disk, or the disk was getting hammered. I put MySQL on a separate hard drive, but the problems persisted and I gave up. Keep in mind: I didn’t know anything about running servers, and I still don’t – I just have a vague notion of how to debug things.

Fast forward to a few weeks ago, we had a serious hard drive crash and suffered some painful downtime. We lost a lot of data. I clung to a hope that our years of errors had been caused by a failing disk. Predictably, we weren’t so lucky, and the website problems continued. Someone joked, “Hey, I’m glad to see you got the server back up, even the 500 errors!” Yeah… it was time to fix this. Somehow.

Step One: Apache Logs

There were only two notable Apache errors. They’d appear over and over, but only in large clumps:

[Tue Jan 25 04:15:19 2011] [warn] [client xx.xxx.xxx.xx] (110)Connection timed out: mod_fcgid: ap_pass_brigade failed in handle_request function
[Tue Jan 25 04:15:25 2011] [warn] [client xxx.xxx.xx.xx] mod_fcgid: read data timeout in 60 seconds

It was pretty clear that somehow, communication was breaking down between mod_fcgid (Apache) and PHP processes. I read, I think, every single webpage on the Internet involving these two errors. Twice. They were all dead ends, both for us and both for the person reporting their own problems. A little sad. I wanted to blame mod_fcgid, since historically it has been really buggy for us, but I wanted to pin absolute blame.

So the next step was to break out a debugger.

Step Two: Debugging PHP, Apache

AM developer Fyren edited the mod_fcgid source code, such that instead of killing PHP, it would attach gdb in a separate tty. It usually took around three hours to get a debugging session going. The call stacks looked like:

(gdb) bt
...
#4 0x00007f1f7cad804f in net_write_command () from /usr/lib/libmysqlclient.so.15
#5 0x00007f1f7cad4c51 in cli_advanced_command () from /usr/lib/libmysqlclient.so.15
#6 0x00007f1f7cad1751 in mysql_send_query () from /usr/lib/libmysqlclient.so.15
#7 0x00007f1f7cad17b9 in mysql_real_query () from /usr/lib/libmysqlclient.so.15
...
(gdb) fin
Run till exit from #0 0x00007f1f7cad17b9 in mysql_real_query () from /usr/lib/libmysqlclient.so.15

At this point, the session would hang, indicating that MySQL was not responding. I really had no idea what to do, but we got lucky. We noticed during these incidents, Apache’s server-status was stuck at “200 requests being processed,” all in the “W” (“Sending reply”) state. Our MaxClients setting was exactly 200. Now we had a hypothesis: MySQL was intermittently not responding for short bursts of time. This blocked PHP, which in turn made mod_fcgid block and timeout, causing Apache to spawn a bunch of workers that would lock up as well.

Unfortunately, the intermittent failures only happened a few times a day, so debugging went slow. We managed to catch one, and ran iostat:

Device:          r/s       w/s   rsec/s    wsec/s    avgrq-sz  avgqu-sz   await  svctm  %util
sdc              320.80    4.40  3540.80    54.40    11.06     4.32   13.01   3.08 100.00
sdc1             320.80    4.40  3540.80    54.40    11.06     4.32   13.01   3.08 100.00

Someone was reading tons of data off the disk, and iotop confirmed MySQL was the culprit.

Diagnosing MySQL

I managed to find this enlightening presentation by Percona. Case study #3 looked identical to ours, except in their case, MySQL was writing. But their conclusion about bad queries seemed plausible. I made a script that would peek at server-status every second, log a ton of data, and e-mail me if it went above 100 concurrent requests. The data included oprofiling, iotop spew, SHOW FULL PROCESSLIST, SHOW OPEN TABLES, SHOW GLOBAL STATUS, etc.

Oprofile’s report wasn’t too interesting, but the MySQL status showed around 150 queries waiting for a table lock. The active queries, in all cases, were SELECTs. This explained the reads, but not the table locks. This didn’t make sense, so I generated a few core files of MySQL from within GDB, then Fyren and I did some post-mortem analysis.

The analysis revealed that MySQL had about 200 threads, most of which were waiting on locks (pthread_cond_timedwait) with a timeout of one year. The locks were almost all read locks for SELECTs, and the actually succeeding queries were exclusively SELECTs. Fyren tried to track down the thread owning the locks, but it ended up being apparently circular (thread X’s lock was held by thread Y, and thread Y was waiting on a lock held by thread Y).

That was all pretty mysterious (and still is), but rather than try to understand exactly what locks are in MyISAM tables, I took a look at the non-deadlocked queries. They were all one of: the AMX Mod X plugin search, the SourceMod plugin search, or the vBulletin member list search.

It turns out all three of these queries were extremely expensive – containing ORs, joins, full table scans of 800MB tables, etc. On average they each took about 500ms to complete. That’s without MySQL freaking out. I turned off vBulletin member searching (most large boards do), and replaced the plugin searches with faster queries off a much a smaller table. This brought those page load times down to 20ms. Twelve hours later, no 500 errors had occurred, and there were no mod_fcgid failures in Apache’s logs.

Too Early to Declare Victory

While waiting to see if any more 500 errors would show up, suddenly the server load spiked, and lo and behold – mod_fcgid became overloaded. Our monitoring script kicked in, and this time, MySQL was nowhere in the profile! It turned out some forum users were effectively DoS’ing the site by holding down F5 on their profile pages to bump up the “view” count. OProfile suggested that 20% of CPU time, during this incident, went to PHP parsing and lexing source code. I installed eAccelerator and now that time appears to be gone. (And for good measure, I removed profile view counts.)

Growth and Traffic

It’s unclear what our load and traffic really is. I think we’re pretty good though. CPU utilization is usually under 20%, disk I/O is under 10%, and we rarely see more than 30 concurrent requests. We don’t need to invest in a separate MySQL server yet, and we definitely aren’t growing fast enough to expect to need one anytime soon.

Community Involvement

I would like to thank Fyren for his help in looking into these problems, and asherkin for helping beef up our non-existent monitoring. Also, thanks to AzuiSleet, Drunken_Fool, MatthiasVance, and devicenull for sticking around in IRC and answering random questions.

Diagnose, then Solve

Many people, well-intentioned, said “switch to lighttpd, nginx, IIS, use my favorite host, use a VPS, buy more servers,” etc. I got caught up in this too, and fueled it by initially blaming mod_fcgid. It’s really important to make sure a solution has targeted a specific problem, and I had to keep telling myself to keep a cool head. Sometimes I had to slap myself or wait for Fyren to slap me. They all might be good ideas in their own right, but without diagnosing the actual cause of the errors, there was no guarantee of a fix. By careful debugging (albeit in painful, hour-per-day sprees), we got pretty close to the root cause, whereas throwing our infrastructure into a salad bowl would have delayed the inevitable.

Part of the reason our original drive is still at the recovery lab is because someone didn’t diagnose before “solving.” When it failed, a technician (not one of us) ran fsck on the disk instead of doing a hardware diagnostic first. I have no doubt this caused untold extra damage.

Conclusion

tl;dr, the forums should be much faster now, and it shouldn’t error anymore. I’ve got the system set up to e-mail me and dump diagnostics when our load goes way above normal, so I’ll continue monitoring for potential problems.

I would like to sincerely thank the whole AlliedModders community for the successful donation drive. I won’t lie, the past month has been pretty sucky. Despite our failings, the fact that everyone cares enough to see the whole thing keep going is really awesome. Our seven year anniversary was just a few weeks ago. Here’s to an AlliedModders for 2011!

5 thoughts on “Debugging 500 Errors

  1. Aryeh Gregor

    I’ve been tech admin of twcenter.net since late 2005. We also run on only one server (until we just got a second for failover), but our vBulletin is significantly bigger than yours, so maybe I can offer some useful advice from experience.

    If you’re having locking problems with vBulletin, switching from MyISAM to InnoDB will solve them, since InnoDB uses only row-level locking for table modifications (instead of table-level), and no locking at all for selects (via the magic of MVCC). This isn’t a substitute for fixing slow queries, but MyISAM simply doesn’t scale except on very specific workloads. It takes out table-level locks for every query, and that just doesn’t work for concurrency.

    InnoDB doesn’t support MySQL fulltext search, so you’ll have to use the semi-official vB Sphinx plugin: http://www.vbulletin.com/forum/showthread.php/350427 But that’s a big performance improvement in itself; MyISAM’s fulltext search is hideously slow. Since you apparently haven’t upgraded to 4.0, you can use the totally unofficial one instead: http://www.vbulletin.org/forum/showthread.php?t=127868 My site has been using that for a few years with no problems. Even without that, you can switch all tables without fulltext indexes to InnoDB, but that leaves the post table unconverted, and of course that’s your biggest one.

    If you switch to InnoDB, make sure you reconfigure MySQL appropriately. This post will be helpful: http://www.mysqlperformanceblog.com/2007/11/01/innodb-performance-optimization-basics/ Basically, key_buffer_size is used only by MyISAM, the InnoDB equivalent is innodb_buffer_pool_size. But the latter should be set around twice as large as the former if you switch from all-MyISAM to all-InnoDB, because MyISAM uses the OS disk cache for data (key_buffer_size is only for indexes, as the name suggests) while InnoDB doesn’t use the OS disk cache at all. The only really important InnoDB tuning setting is the buffer pool size; raise it until “SHOW INNODB STATUS\G” in MySQL shows a buffer pool hit rate of, ideally, 1000/1000 (should be attainable if you have a decent amount of memory).

    InnoDB has other useful features too. If all your tables are InnoDB, you can use mysqldump –single-transaction to get consistent backups without taking out any locks, so no site downtime regardless of how large your database is. mysqlhotcopy will no longer work, if you were using that, and mysqldump is much slower to restore than mysqlhotcopy, but having SQL backups rather than just file copies is good anyway in the event of database corruption.

    Generally, InnoDB is a far superior storage engine to MyISAM. MyISAM only clung on to life for so many years because InnoDB was owned by a different company and MySQL AB didn’t want to become dependent on it, so it kept developing MyISAM and left it as the default. Oracle now owns both MySQL and InnoDB, so they did the sensible thing and made InnoDB the default in 5.5, now that it had no conflict of interest. At least one fork of MySQL did the same thing.

    For detecting slow queries in MySQL, you want to enable the slow query log: http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html mysqldumpslow will then tell you what all the problematic queries are, and you can try optimizing them. If your problem is locking, though, the results won’t be very useful — it will tell you which queries were blocked rather than which ones took out the locks. Switching to InnoDB will fix this.

    I’ve publicly recorded most of the optimization stuff I did with my forum in the last three and a half years here: http://www.twcenter.net/forums/forumdisplay.php?f=321 Maybe some of it will be useful to you. If you have any questions, I might be able to give you further advice. I’m happy to help out other open-source software developers when I can.

    Reply
    1. dvander Post author

      Aryeh – thank you for the in-depth reply! In fact, we aren’t using fulltext searches right now (we’re using the weird postindex one), so we might just be able to straight-up convert to InnoDB.

      Yeah, we had the slow query log, but as you said, the locking made it noise.

      It’s great you’ve documented your optimization pains. I’ll read over this today. Your forum was started around the same time as ours and has about 8X as many posts – and clearly handling the load much better. From the tech posts on your forums, it seems like you might use more than one server? Do you rent, or colo?

      I noticed you also haven’t upgraded to vB 4.0. I’m on the fence about this upgrade – the underlying code doesn’t look too different and we’d have to retheme again. So it doesn’t seem like there’s a performance motivation, but maybe there is a strong feature one.

      Reply
      1. Aryeh Gregor

        We colocate two servers, with hosting costs paid by advertising and hardware costs paid by donation drives. We always used to be on one server, but a few months ago we had a one-week outage (the longest since I became admin in 2005) caused by repeated hardware failures combined with me and the on-site tech admin being away/busy at various points. At that point we got a second server for failover, but we aren’t doing any load-balancing, so for performance purposes we have only one server.

        (It’s a pretty beefy server, though: sixteen cores, 16 GB of RAM, and now four 10k RPM SATA drives on the new machine. The disks are relatively weak, but that’s okay, since our data fits in memory well. If our dataset got too big, it would probably be cheaper to upgrade RAM instead of disks.)

        We haven’t upgraded to vB 4.x because I haven’t gotten around to doing the necessary work. I plan to get to it sometime after we switch to our new server (which has somewhat better hardware than the current one), which should happen soon. I don’t think it has any very interesting new features, but the fact is, as soon as 4.2 is released 3.8 is no longer getting security fixes. We’ve been attacked by bots exploiting vB security flaws before, so we can’t really take the risk. We don’t want to be forced to upgrade in a hurry to thwart a sudden attack. We will have to reskin, but we have a whole bunch of artsy sorts who’d be happy to do the work for us if it means their skin gets to be the new default. AFAIK, there’s no performance motivation to upgrade.

        Overall, vB’s scalability is decent but not great. I’m pretty sure most vB forums my forum’s size are on multiple servers. But I was able to keep things working efficiently on a single server by suitable hacks. MySQL is no problem for us at all at this point; our bottleneck is almost purely CPU. In addition to switching to InnoDB and Sphinx instead of MySQL, I wound up adding some indexes to certain tables and tweaking some queries.

        As far as MySQL performance goes, I’d recommend the book “High Performance MySQL, Second Edition”: . Trying to debug MySQL performance problems using gdb is an interesting approach, but probably not the most efficient way to go about things. :)

        Reply
  2. Lennie

    I’m really surprised you got that far with that storage engine and website traffic without knowing what to do.

    I know that sounds like an insult, it isn’t what I mean. I mean without experience in this field/subject or without finding the right advice.

    As Aryeh Gregor mentioned gdb and so on is probably not the best way to solve problems in such a complicated piece of software.

    Still an interresting read to see how people (try to) solve things

    Reply

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">