AWS RDS PHP connections hanging during failover

As part of our ongoing efforts to improve redundancy for our customers, Learnosity has been actively researching and preparing to migrate all of our EC2 databases into Amazon RDS. After putting RDS through its paces with a barrage of tests we identified a few easily solved issues, however we were left with one major head scratcher.

When a test is completed, it is sent to an SQS queue before a PHP process on one of our workers collects, processes and stores it in our databases, however when a failover occurs between RDS nodes, any connections to the database hang indefinitely, leaving our PHP workers locked up and in a non recoverable state.

We wrote a simple stand-alone script that would query from the database in an endless loop to see the process hanging. 20 lines of PHP later, we had a script that would connect to the DB and run a simple select from the dummy table DUAL and it would still freeze during a failover. (See this link for a copy of the code)

At this stage it became apparent the issue was strictly on the PHP layer that communicated with the Database. We then tried with mysqli instead of PDO with the same result. Research led us to read about the two underlying drivers: MySQL native driver (mysqlnd) and MySQL client server library (libmysqlclient).

We then read through the vast array of possible configuration options before trialling net_read_timeout which by default is set to 1 year, with 3 reattempts. This seemed the likely culprit of PHP just stalling when the failover took place as normally there would be a network level interrupt that would occur before this massive PHP timeout would occur, however due to the seamless nature of RDS failovers, this wasn’t occurring.

** Note: Despite mysqlnd being installed by default by PHP 5.5, under ubuntu it will overwrite this with the default libmysqlclient instead of php5-mysqlnd, you will need to apt-get install this package for this to be resolved.

After updating our infrastructure to utilise mysqlnd and testing again the timeout was respected and after 20 seconds, the query was aborted with the error “2006 – MySQL Server has gone away”.

We decided on a timeout value of 60 seconds, which means if we have a long running query that takes more than 60 seconds to complete, we’ll get a 2006 error. The reason for this value is that currently our MySQL servers sit behind an ELB and that has a 60 second timeout, and the workers seem to work fine, so we should be fine with the same value.

Hopefully by sharing this blog post with others, we might prevent a lucky engineer from spending a day on this too.

This post was posted in , , , , , , , by on