E-NDMS
E-NDMS Traffic in May
As has been the case for the last 2-3 years, E-NDMS has struggled to accommodate the significant increases in traffic during May. Typically, the traffic spikes during the week of the 15th and slows down afterward. This year, the pattern changed. As usual, the week of the 15th was a problem, but this week, the week after Memorial Day, presented serious problems. I'm aware there have been performance challenges all month, but this week the system was basically unusable.
Back in November when I moved our systems to Amazon EC2, you might recall that I had upgrading the database to a larger server as a contingency plan for May. Yesterday, I finally took that step. The big question is, why did I wait? In short, I waited because the database server was not the primary culprit in the problems we were experiencing earlier in the month.
If you were unlucky, you probably saw a number of Server Down or Server Unavailable messages between the 11th and 14th. These were due to resource exhaustion on the web server which should never happen with E-NDMS. E-NDMS traffic simply shouldn't even come close to overwhelming a web server. Upon further investigation, I found that the newer version of Linux I used had a different value set for a critical TCP/IP network stack parameter than what I have seen before. In fact, it's a parameter I've never had to tune for E-NDMS. Once corrected, the server down/unavailable messages mostly went away.
When traffic slowed on the 14th and 15th and didn't increase the following week, I naively thought we had survived the traffic spike. With the beginning of this week, I found out I was mistaken. In researching the continued performance problems, I found a database temporary file path, critical to Competency and Contact reports, was running out of drive space. When exhausted by reports, it also interfered with other operations and reports. Unfortunately, that fix didn't do enough to keep the system running.
With the serious problems corrected, the final remedy was to move to a larger server. For those technically inclined, the Amazon EC2 server we were using was equivalent to a 1.5 GHz processor with 1.5 Gb of memory. The new server is equivalent to 5 1.5 GHz processors with 7.5 Gb of memory. Today, the new server never strayed much higher than 50% of CPU capacity. Competency and Contact Reports were running in less than 5 minutes, and Follow-up Reports were taking little more than half a second.
Needless to say, I wish I could reverse the order in which I applied corrections, but I made decisions based on symptoms and historic traffic patterns. In the future, we will be more prepared to deal with the May rush.
Posted at 11:15PM May 29, 2009 by Jason Koeninger in General | Comments[0]