billroper: (Default)
[personal profile] billroper
So you remember the locking fixes that I've been working on for our server since August or thereabouts? The ones that were "finished"?

Yeah, not quite. My changes got through five days of successful Uptime testing (in that the server did not crash), but then the tester clicked on the tab in our admin app that would show the transaction results and the server GPFd. Sadly, we don't know where, because no one was logged in on the server machine, so the just-in-time debugging didn't kick in, but the logical assumption is that something was wrong in the code that would have returned the transaction results.

I cleaned that up, then spent some time looking at other code in the class that manages the transactions. Finding at least one locking error, I decided to rewrite it and bulletproof it, as much as is possible. That's almost done.

In the meantime, we're not entirely happy with the memory profile that the version with the locking fixes was showing, so our QE team went back to run the Uptime test against the baseline version.

And the server crashed within a day. More than once, as they restarted it and tried again.

I feel somewhat vindicated. :)

Date: 2012-03-30 12:55 pm (UTC)
From: [identity profile] phillip2637.livejournal.com
Locking in a multithreaded system is tough...also not well understood.

I once worked in a place with a reasonably large, cross-platform codebase where, before my arrival, the CTO had gone directly to a junior programmer and told him to no-op some of the locking macros for one specific system type because he thought they were slowing it down. Nobody else knew it had been done (which was a different and tougher problem). The port of the product to that platform was alpha-ish and the very few related hangs/crashes weren't an immediate priority. At some point they ran it under load on an 8-CPU system and you can imagine what all broke loose. :)

(*The CTO was promoted -- no connection to this incident -- and relocated to a place about 2500 miles away which was better for the tech environment, not so good for the company as a whole.)

Profile

billroper: (Default)
billroper

January 2026

S M T W T F S
     1 2 3
4 5 6 7 8910
11 12 13 14 15 16 17
18 19 20 21 22 23 24
25 26 27 28 29 3031

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jan. 31st, 2026 06:59 pm
Powered by Dreamwidth Studios