Monday, May 21, 2007

Standardize!

I was trying to restore a database backup from one server onto another this morning, and despite the two having the same drive letters for data and logs, I ran into problems because the directory names were different. For a company that runs everything based on rules and standards to the point of excess, this is silly. So I'm making a quick list of SQL Server-related settings that I feel should be standardized across all servers in one's environment. It just makes things easier!

1) Directory structures
2) Default data and log locations
3) Drive letters, if possible
4) Build levels (Service Pack/hotfix)
5) Logins, although these may be different between Dev, QA, and Prod
6) Install locations - I have never found a valid reason to install SQL Server on the C: drive of a non-desktop, but I keep seeing it

I'm sure I'll add to this list as I come up with more obnoxiously un-standardized settings in our environment.

Wednesday, May 16, 2007

WRITELOG waits

My team has been assisting another team off and on for several months with performance tuning their overnight DB processes. There are many possible bottlenecks in their process, but since the app writing into the DB is highly complex and multi-tiered, we're trying to focus on the internal DB issues first.

The first big win was dropping triggers from the tables being written to, which resulted in a 20% reduction in overall job time. Now the biggest WAITTYPE during the process is WRITELOG, with about 50-60% of the overall time in both QA and Prod. NETWORKIO is next, and I'd really like to SET NOCOUNT ON, since these writes are created as individual transactions and dropping the count should reduce the NETWORKIO substantially, but that's been problematic so far.

Digging into the WRITELOG waits, the disk queues are pretty low (0-5 range going against 80 or so physical disks on a big Hitachi SAN with a ton of cache), so the storage guys have dismissed the disk as the problem. However, I think it's still the disk, just not in the usual way.

The standard problem is that the disk subsystem runs out of write cache and has to start de-staging data to disk, and the physical disks can't write data as fast as it's coming in. That's not the case here, because the cache on the SAN is huge (70+ GB, according to the SAN guys), it can accept data faster than the logs are created, and it hasn't gotten close to the high watermark, when it starts de-staging. So what's the problem?

Well, in this case, the disks attached to the SQL Server have been attached as a single logical drive, so all the IOs get queued by Windows against that drive. Therein lies the problem, I believe. The SAN is fast, but every operation goes against that same queue, so one slow read for a piece of data not in the cache can slow down all the writelog IOs queued up behind it. This is what I believe is creating the WRITELOG waits.

The proposed solution is simple: add another logical drive dedicated to logs. Even if it's hitting the same physical disks, the addition of another disk queue should allow the log IOs to flow through without being blocked by slower data IOs or data reads. Stay tuned to see if it works.

Addendum: We tried this before at Capital IQ, splitting up sets of disks into multiple metaLUNs and comparing them against the same number of disks in one metaLUN and showed barely any performance gain (within the margin of measurement error). However, the difference between that situation and this one was the amount of cache involved - we were always past cache on the CapIQ SAN, so the disks were the bottleneck. Here I suspect it's the queue.

Tuesday, May 8, 2007

Cool new clustering stuff

That I just learned while perusing some Longhorn clustering docs.

1) GPT cluster disks are now supported. This should mean no more 2TB limit.
2) The Supported Hardware list is gone - now anything that validates is officially supported.
3) Clustering of virtual servers is now officially supported, so no more hacking virtual disks to get them to show up in multiple virtual instances. Caveat: The Parallel-SCSI model is deprecated, so it looks like virtual iSCSI is the way to get this to work.

and my personal favorite, quoted straight from the Q&A session:

4) "In Windows Server 2003 the way disks were identified was by the Disk Signature. In Longhorn clustered disks are identified by multiple attributes, including the Disk Signature and the SCSI Inquiry data. If a disk can't be found by one attribute, it will attempted to be found in another way. If found the service will self heal and update it's records of the disk. So in Longhorn disks will self heal and should elimate support issues of disk signatures changing and disks not coming online."

Finally! No more deleting and re-adding cluster disks when cluster services loses them!

Longhorn clustering should be cool. Now I just have to find another cluster to manage.