Sunday, September 28, 2008

CPU Affinity Masking plus High CPU = Login Timeouts

Recently we saw an alarming issue with one of our database servers. Under heavy but not prohibitive load (60-80% CPU), it stopped accepting new connections intermittently. Obviously this is bad for any server, especially one that handles thousands of client connections simultaneously.

Some of the hard drives were under heavy load, especially tempdb, but another anomaly was a long-running query using many linked server connections (11!) that had been killed but was stuck in rollback several hours later. It was consuming near 100% of the cycles on one CPU core. We tried a variety of the usual DBA tricks to get rid of this spid, but nothing worked. It wasn't clear to us how this could cause the server to stop accepting connections, however.

Another oddity was the appearance in the System logs of messages like this:

The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.

I asked our Systems guys about this, and they said that this had been noticed a few months ago and a workaround had been put in place as per KB 931279 - the CPU affinity mask had been set in SQL Server.

Hmm.

This happened again a week later, minus the spid stuck in rollback, but with one CPU slammed at 100% again. We had engaged Microsoft PSS to assist with the problem, but so far all they had told us was that we had tempdb IO issues, which we knew. (DVNation, please finish the Windows drivers for your IODrives so we can use them for tempdb!)

So here's my theory, cooked up with one of the other DBAs: the login issues are being caused by the combination of affinity masking and one CPU at 100%. This could happen because schedulers are affinitized by the mask to a single CPU, making them unable to hop CPUs when one is under heavy load. User logins round-robin between schedulers, so if a scheduler is stuck to a single CPU and that CPU is not making enough cycles available to log someone in, eventually the login will timeout and fail.

Plausible? Anyone else seen this kind of issue?

UPDATE:

I was right. After removing the affinity masking, we no longer saw login timeouts, even when the server was near 100% CPU load. Be careful with those affinity masks.

No comments: