The Digging Continues
Mar. 29th, 2012 10:41 pmSo you remember the locking fixes that I've been working on for our server since August or thereabouts? The ones that were "finished"?
Yeah, not quite. My changes got through five days of successful Uptime testing (in that the server did not crash), but then the tester clicked on the tab in our admin app that would show the transaction results and the server GPFd. Sadly, we don't know where, because no one was logged in on the server machine, so the just-in-time debugging didn't kick in, but the logical assumption is that something was wrong in the code that would have returned the transaction results.
I cleaned that up, then spent some time looking at other code in the class that manages the transactions. Finding at least one locking error, I decided to rewrite it and bulletproof it, as much as is possible. That's almost done.
In the meantime, we're not entirely happy with the memory profile that the version with the locking fixes was showing, so our QE team went back to run the Uptime test against the baseline version.
And the server crashed within a day. More than once, as they restarted it and tried again.
I feel somewhat vindicated. :)
Yeah, not quite. My changes got through five days of successful Uptime testing (in that the server did not crash), but then the tester clicked on the tab in our admin app that would show the transaction results and the server GPFd. Sadly, we don't know where, because no one was logged in on the server machine, so the just-in-time debugging didn't kick in, but the logical assumption is that something was wrong in the code that would have returned the transaction results.
I cleaned that up, then spent some time looking at other code in the class that manages the transactions. Finding at least one locking error, I decided to rewrite it and bulletproof it, as much as is possible. That's almost done.
In the meantime, we're not entirely happy with the memory profile that the version with the locking fixes was showing, so our QE team went back to run the Uptime test against the baseline version.
And the server crashed within a day. More than once, as they restarted it and tried again.
I feel somewhat vindicated. :)
no subject
Date: 2012-03-30 12:55 pm (UTC)I once worked in a place with a reasonably large, cross-platform codebase where, before my arrival, the CTO had gone directly to a junior programmer and told him to no-op some of the locking macros for one specific system type because he thought they were slowing it down. Nobody else knew it had been done (which was a different and tougher problem). The port of the product to that platform was alpha-ish and the very few related hangs/crashes weren't an immediate priority. At some point they ran it under load on an 8-CPU system and you can imagine what all broke loose. :)
(*The CTO was promoted -- no connection to this incident -- and relocated to a place about 2500 miles away which was better for the tech environment, not so good for the company as a whole.)
no subject
Date: 2012-03-31 03:37 am (UTC)On the subject of your former company, well, they shoot CTOs, don't they?