Sunday, September 28, 2008

CPU Affinity Masking plus High CPU = Login Timeouts

Recently we saw an alarming issue with one of our database servers. Under heavy but not prohibitive load (60-80% CPU), it stopped accepting new connections intermittently. Obviously this is bad for any server, especially one that handles thousands of client connections simultaneously.

Some of the hard drives were under heavy load, especially tempdb, but another anomaly was a long-running query using many linked server connections (11!) that had been killed but was stuck in rollback several hours later. It was consuming near 100% of the cycles on one CPU core. We tried a variety of the usual DBA tricks to get rid of this spid, but nothing worked. It wasn't clear to us how this could cause the server to stop accepting connections, however.

Another oddity was the appearance in the System logs of messages like this:

The time stamp counter of CPU on scheduler id 2 is not synchronized with other CPUs.

I asked our Systems guys about this, and they said that this had been noticed a few months ago and a workaround had been put in place as per KB 931279 - the CPU affinity mask had been set in SQL Server.

Hmm.

This happened again a week later, minus the spid stuck in rollback, but with one CPU slammed at 100% again. We had engaged Microsoft PSS to assist with the problem, but so far all they had told us was that we had tempdb IO issues, which we knew. (DVNation, please finish the Windows drivers for your IODrives so we can use them for tempdb!)

So here's my theory, cooked up with one of the other DBAs: the login issues are being caused by the combination of affinity masking and one CPU at 100%. This could happen because schedulers are affinitized by the mask to a single CPU, making them unable to hop CPUs when one is under heavy load. User logins round-robin between schedulers, so if a scheduler is stuck to a single CPU and that CPU is not making enough cycles available to log someone in, eventually the login will timeout and fail.

Plausible? Anyone else seen this kind of issue?

UPDATE:

I was right. After removing the affinity masking, we no longer saw login timeouts, even when the server was near 100% CPU load. Be careful with those affinity masks.

Monday, September 22, 2008

psexec Permission denied error

In the course of my DBA and software dev tasks, I make extensive use of Mark Russinovich's excellent PSTools. Today I ran across an odd error while running a remote command using psexec. Filenames have been changed, obviously, but here's the gist of it:

> psexec \\remotemachine python \\fileserver\script.py param1 param2

python: can't open file '\\fileserver\script.py': [Errno 13] Permission denied

So it was obviously opening python, which I double-checked by running with the -i option and checking the remote machine's desktop using VNC Viewer. The python shell was opening successfully, but it couldn't access the file server. Executing the script directly from the remote machine's command line worked fine.

Then I realized that this was the same issue that one gets from too (two? ;) many linked server hops using integrated authentication - it's a two-hop Kerberos problem. Google that if you don't know what I'm talking about. Fortunately, psexec offers the option of passing a username and password, and as soon as I added those options, my command executed successfully.

Tuesday, September 16, 2008

DBA rant

I'm not big on posting links, but this is the best angry DBA post I've read in some time.

Tuesday, September 9, 2008

SSIS Data Flow Task Error caused by too many Data Flow tasks

I just threw together a very basic SQL 2005 SSIS package, the main component of which is a Data Flow task with 25 sub-flows. The flows are very simple, just direct transfers of a table on one server to the same table name on another server. As an aside, it took WAY too long to create this package - for each little flow, I had to select the source connection, wait for it to populate the table drop-down, then scroll through that drop-down through hundreds of tables to find the one I wanted, then do it again for the destination. This could be better in a number of ways:
  1. Allow typing in the table field. It would have been much quicker to allow me to type or copy and paste the table name rather than waiting for the drop-down and then scrolling through it. 50 times.
  2. Improve copy and paste of flow tasks. If I could have constructed one and duplicated it 25 times, it would have been much quicker, but when I attempted that, I got strange warnings about field truncation that were cleared up only by deleting the source and re-creating it.
  3. Create a wizard that facilitates auto-generating this kind of simple package better than the Transfer Objects wizard.
Maybe one or more of these has been implemented in SQL 2008, but I can't use that yet, since my office has no plans to upgrade our SSIS runtime any time soon.

Anyway, when attempting to execute the package, I was consistently getting a failure from one flow with this error:

TCP Provider: An existing connection was forcibly closed by the remote host

This didn't make much sense to me, as it worked fine when run by itself. I found this article and added that registry key, but it did not fix the problem (although it did seem to increase throughput).

Then I noticed that the two largest flows in the Data Flow task (on the order of millions of rows) were executing at the time when the problem task was throwing an error. I thought SSIS would be able to handle the sequencing of parallel tasks, but I decided to enforce my own execution order by placing the two large flows in a separate Data Flow task.

Voila. Works fine. So if you're having issues with an overpopulated Data Flow task throwing that TCP error, one simple workaround is to manually sequence/reduce the parallel execution of the tasks.

Monday, September 8, 2008

Programming Languages I know

To pick up the latest blog mini-meme (courtesy of Corey Goldberg), here's The List of programming languages I have learned:

1. AppleScript
2. C
3. Java
4. Lisp (Scheme)
5. C++
6. JavaScript
7. ANSI SQL*
8. PHP
9. T-SQL
10. VBScript
11. C#
12. Python

Kind of a strange order, no? Mostly driven by expediency, except Scheme and C++, which I learned in college.

UPDATE: Dan points out that ANSI SQL is not Turing complete and should be asterisked, so I guess I'm down to 11.5. Let's see your monster list, Dan - mcfunley.com has been pretty quiet lately.

Friday, September 5, 2008

DTS Editing doesn't work in SQL 2008?

I've been running the SQL 2008 RTM for a few weeks now, and generally it works fine. But today I needed to open an ancient DTS package (it's not from my team, all our stuff is SSIS), and couldn't do it. First it told me I needed the SQL 2005 Backwards Compatibility components. "That's odd," I thought, "I could have sworn I installed all of that when I ran the SQL 2008 setup. And why is it asking for the 2005 components?"

It appears that Microsoft has simply updated the 2005 BC component install to work with 2008. Fine. I installed that, and tried to open the package again. No dice.


So I went to find those components. Found the package, installed it... to no effect whatsoever! I still get the same message! Note to Microsoft: if you don't want me to edit DTS packages in 2008 and have deprecated that functionality entirely, fine. Just TELL ME instead of popping up worthless error messages whose instructions I follow to no avail.