NYC DBA: 2007

Tuesday, December 11, 2007

More fun with ranking functions

I don't think I really understood why Mike F at Capital IQ was so excited about the SQL 2005 ranking functions until just recently, but I'm using them all over the place now. Minutes after kicking off the time series load I described in my last post, I came across another use for RANK.

EDIT: As I review this, I recognize that this is pretty esoteric and may not have much practical application for most people. You've been warned.

Like many financial companies, we get data from multiple vendors, and we have to decide which vendor's data to use in each case. So I've got a set of data provided by multiple Vendors, grouped by ObjectId and Date, that is ranked for each date by some business logic built into the procedure that loads the data. The best (lowest, in this case) ranked row for each ObjectId and Date gets a bit flag set; we'll call it IsBest.

However, it was uncovered that there was an extra negation in the business logic for many cases, so many of the rows had the bit flag set incorrectly. I needed to find a way to determine what portion of the data had the flag set incorrectly. This could be accomplished using any number of data warehousing techniques - Analysis Services is well-suited to it, for example, or Microstrategy, or even a complicated series of GROUP BYs. But I didn't want to go to all that trouble just to produce a single ad hoc report, so I found another way.

Some (totally made up, to prevent my employer from having me whacked) sample data, in which only objectId 4 has the correct settings for the bit flag:

I needed to find rows in which the highest-ranked VendorId had the IsBest flag set for a given ObjectId and Date, and any rows in which a lower-ranked VendorId did not have the IsBest flag set. Any other combination is incorrect. If I'm only worried about how many rows need to be fixed, I can stop there, or I can divide the sum of those two rowcounts by the total number of rows to get a percentage of data that is correct.

A query to return the rowcounts follows. I'm using DENSE_RANK rather than RANK to keep the ranking correct despite duplicate rows (as in objectId 3).

SELECT
IsBest, DRank, count(*)
FROM
(SELECT
[Date]
, ObjectId
, VendorId
, IsBest
, DENSE_RANK() OVER (PARTITION BY Date, ObjectId ORDER BY Priority) as DRank
from tbSource
) as allRows
group by IsBest, DRank
order by IsBest, DRank

Additional code could be added to this query to automatically sum the correct cases and determine the percentages, but I will leave that as an exercise for the reader. Here are the raw counts, with the correct cases highlighted:

Another cool use for RANK() - Time Series

I've got a big set of time series data across many object instances that needs to be inserted into a table that uses a trigger to set the "End Time" of each segment in the series. In other words, data like this:

When a new row for a given ObjectId is inserted, the previous row in the series is given an EndTime equal to the StartTime for the new row. Usually, this is done one row at a time, with only one row coming into the table for each ObjectId at a time.

However, in this case, I need to backfill a lot of data, which will cause the trigger to not behave as expected. I have a few options to work around this:

Insert each row one at a time. Turns out there are over a million rows to be inserted and this is a heavily-used table, so this option is undesirable.
Write a script to generate the EndTimes for each row ahead of time. This could be tricky and CPU-intensive, and I'd rather avoid it if possible.
Partition the data such that the trigger will work as it's supposed to, but large batches of data can be inserted instead of inserting one row at a time.

I'm currently investigating option 3, especially because it lets me use the RANK() function. The basic idea is this: for each ObjectId, rank the StartTimes, then loop through and insert all rows where RANK = 1, then 2, 3, etc. This will insert all the ObjectIds with their initial StartTime, then the next StartTime, and so on, allowing the trigger to fire on each round and update the EndTimes en masse.

The query I'm using (generalized for mass consumption):

INSERT INTO tbTemp
SELECT
ObjectId
, Value
, StartTime
, RANK() OVER (PARTITION BY ObjectId ORDER BY StartTime) as Ord
FROM
tbSource

Then the data in tbTemp can be looped through by rank and inserted into the destination table.

And the resulting data should look like this:

Thursday, November 29, 2007

Many-to-many physical table design

I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.

In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:

CREATE TABLE User
(
UserId INT IDENTITY(1,1)
, ...
)

CREATE TABLE Group
(
GroupId INT IDENTITY(1,1)
, ...
)

CREATE TABLE UserToGroup
(
UserId INT
, GroupId INT
, ?
)

A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.

Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.

The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.

I haven't tested this yet, though, so I welcome any comments or contradictions.

Tuesday, November 27, 2007

Odd SSIS bug

Here's what happened: I added a column to a source table, went into the Data Flow source, refreshed the table list, selected the table, and checked all the output columns that I wanted. Then I resolved all the naming issues with the downstream tasks.

But I'm getting an error (#1 below) when I try to build, because for some reason SSIS will not automatically add the new column to the Error Output, and it won't let me add a column to that output, even in the Advanced Editor. This is annoying. I'm probably going to need to delete and add this data source, which is going to screw up all my transformations.

Error Message #1
Error at Data Flow Task [tbCA Source [1]]: The output column "CaxSubTypeId" (1052) on the non-error output has no corresponding output column on the error output.

Error Message #2
TITLE: Microsoft Visual Studio
------------------------------

Error at Data Flow Task [tbCA Source [1]]: The component does not allow adding columns to this input or output.

Tuesday, October 9, 2007

off topic again

While catching up on the thousands of RSS articles that showed up in Google Reader while I was on vacation for a week, I read Scott Hanselman's post about typing speed and speech recognition and immediately had to take the test.

Number of words typed: 261
Test duration: 3 min
Speed: 87.0 words/min. (435 keystrokes/min.)
Error penalty: 10
Accuracy: 96.2%

Ha! I win! (He managed 72.6)

Actually, I'm surprised I managed 87 wpm, but I have to admit, this test was slightly stressful. I was trying hard. I probably average around 60 most of the time.

Thursday, September 20, 2007

Regex for linked servers

As part of the DTS migration I mentioned in my last post, I wrote a few functions to find less-than-ideal code. One thing we wanted to get rid of was linked-server joins, and the easiest way to find those was to check each Execute SQL task using a regular expression (long live Regulazy). So once again I'm using this blog as my personal source repository and re-producing the regex for this.

Regex for linked server queries:
(\S+)\.(\S+)\.dbo\.(\S+)

I tried to generate an expression to find "SELECT...INTO" where ... did not contain @ or FROM, but couldn't figure it out with the meager regex capabilities of VBScript. Seems like a lookahead or lookbehind would probably work for this purpose, but according to the 4 Guys From Rolla, VBScript regex does not support these.

UPDATE: Spent a little more time on it and figured out a suitable regular expression to find instances of SELECT INTO that excludes INSERT INTO and SELECT FROM.

Regex for SELECT INTO:
select((?!from)(?!insert).\n)*into

UPDATE 2: My first SELECT INTO regex was foiled by something like "SELECT foo, bar, admintool from ...", so I fixed it to make sure there's whitespace before and after the SELECT and the INTO.

select[\s]((?!from)(?!insert).\n)*[\s]into[\s]

Tuesday, September 11, 2007

Parsing the filesystem for files

I'm sure this has been done many times before, but this is my implementation. I needed it to run a large DTS migration from files stored on disk, so first I needed a list of those files. They were contained in a directory tree in no particular order, and in many levels of subdirectories. I wanted a generalized solution to the parsing, so here it is.

Pasting it into Blogger butchered my formatting, but it feels like a waste of time to fix it, since it will probably not paste properly back into an editor anyway. I made the (minimal) comments green just to break it up a bit.

As usual, no warranty or even functionality of this code is expressed or implied. Use at your own risk, especially since I'm not sure the xp_subdirs extended proc is supported and xp_cmdshell can definitely be dangerous.

/*************************************************
** get list of dirs to search for files
*************************************************/

declare @currentdir varchar(1000)
declare @fullpath varchar(1000)
declare @basepath varchar(1000)

set @basepath = 'c:\stuff'
set @fullpath = @basepath

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirs___%')
drop table #dirs

create table #dirs
(
directory varchar(1000)
, done bit
)

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirtemp___%')
drop table #dirtemp

create table #dirtemp
( directory varchar(1000) )

insert into #dirtemp exec master.dbo.xp_subdirs @fullpath

insert into #dirs
select
'\' + directory as directory
, 0 as done
from #dirtemp

Loop: -- xp_subdirs spits out errors when a dir has no subdirs, so they have to be caught with this GOTO Loop construct
while exists (select 1 from #dirs where done = 0)
begin
if exists (select 1 from tempdb.dbo.sysobjects where name like '#subdirs___%')
drop table #subdirs

create table #subdirs
( directory varchar(255) )

select top 1 @currentdir = directory from #dirs where done = 0

set @fullpath = @basepath + @currentdir

exec master.dbo.xp_subdirs @fullpath
if (@@ERROR <> 0)
begin
update #dirs set done = 1 where @currentdir = directory
GOTO Loop
end

insert into #subdirs exec master.dbo.xp_subdirs @fullpath
insert into #dirs select @currentdir + '\' + s.directory, 0 as done from #subdirs s
update #dirs set done = 1 where @currentdir = directory
end

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirtemp___%')
drop table #dirtemp

if exists (select 1 from tempdb.dbo.sysobjects where name like '#subdirs___%')
drop table #subdirs

/*************************************************
** get list of files
*************************************************/

declare @command varchar(1000)
declare @fileextension varchar(5)

set @fileextension = 'dts'

if exists (select 1 from tempdb.dbo.sysobjects where name like '#files___%')
drop table #files

create table #files
(
filepath varchar(1000)
)

update #dirs set done = 0

while exists (select 1 from #dirs where done = 0)
begin
if exists (select 1 from tempdb.dbo.sysobjects where name like '#filetemp___%')
drop table #filetemp

create table #filetemp
(
filepath varchar(1000)
)

select top 1 @currentdir = directory from #dirs where done = 0

set @command = 'dir ' + @basepath + @currentdir + '\*.' + @fileextension + ' /B'

insert into #filetemp exec master.dbo.xp_cmdshell @command

insert into #files select @basepath + @currentdir + '\' + filepath from #filetemp

update #dirs set done = 1 where directory = @currentdir
end

if exists (select 1 from tempdb.dbo.sysobjects where name like '#filetemp___%')
drop table #filetemp

delete from #files where ISNULL(filepath, 'x') not like '%' + @fileextension + '%'

select * from #files

Wednesday, August 29, 2007

Completely off-topic

My globe-trotting brother has started a blog. He's currently hanging out in Moscow doing translation work for a big Russian firm, and while I'm not sure what the typical content will be like, he's a good writer, so it should be interesting.

http://russiatranslated.blogspot.com/

Tuesday, August 28, 2007

SQL Management Studio Setup

First off, I'm back at a desk after an excellent time in Europe - a whirlwind tour through Budapest, Bratislava, Krakow, and Vienna. I may have to move to Vienna one of these days, and I may be seeking a trophy wife in Bratislava. So today is my second day at my new employer, an entity I have been informed prefers not to be named publically, so all references to it from here out will be oblique, or possibly substituted with "The Black Chamber."

In any case, I needed to start setting up my new desktop (that was already built when I got here, by the way, and with three (!) 19" monitors mounted on extremely solid articulating arms), and realized that I can never remember the miscellaneous settings I change in SMS to get it set up the way I like for maximum productivity. Generally I just stumble into them after I try to do something the way I usually do and it doesn't work. So like many other posts on this blog, this is purely for my reference purposes, and if it helps anyone else out, even better.

ProFont - Mike Forman introduced me to this, and I find it absolutely essential these days. I use it as my main text editing and printing font, at 9 point, and it's clearer and smaller at that font size than anything else I've found. I'll make it available for download as soon as I manage to get access to my own FTP server again. Until then, I'm sure it can be googled.

Environment->Keyboard - I add the following:
- Ctrl+F1: sp_helptext
- Ctrl+3: sp_who2
- Ctrl+4: sp_spaceused

Text Editor->Plaintext - I change the tab spacing to 2. Just my personal preference.

Query Results->SQL Server->Results to Grid - I check "Display results in a separate tab" and "Switch to results tab after the query executes." I prefer having as much screen real estate as possible for both query and results, and I find it quick to switch between tabs with F6.

The Change Connection button - Another one from Mike, I use the Customize menu option to display the Change Connection icon as Image and Text, and change the Name from "C&hange Connection..." to "Chan&ge Connection...". This allows quick switching of the connection via the keyboard shortcut of Alt-G. The default name needs to be changed because Alt-H brings up the Help menu.

I know there are a few others, but I can't recall them at the moment.
UPDATE 2012-02-16
SET ARITHABORT OFF - refer to http://nyc-dba.blogspot.com/2012/02/query-plan-troubleshooting.html

UPDATE 2013-11-05
Fix the Management Studio tab text - http://www.brentozar.com/archive/2011/06/fixing-sql-server-management-studios-tab-text/

Tuesday, August 7, 2007

Change of Scenery

It's about that time again. The Big Corporation atmosphere wasn't turning out to be what I wanted, so I'm going small again, moving to [redacted], a hedge fund based in Dallas. I'll be in the New York office, where I'll be working with some very bright people, who will hopefully point out to me how little I know about SQL, databases, and computer science in general, and then help to further my education in these subjects.

I have no idea what the rules are regarding any further disclosure about my current and future employers, so that's all I'm going to say for now.

Monday, June 25, 2007

Precision

There's probably a better way to do this, but this is what occurred to me:

select
max(
len(
rtrim(
replace(
substring(
cast(position as varchar(50))
, charindex(
'.'
, cast(position as varchar(50)))
, 50)
, '0', '')
)
)
)
from ...

Tuesday, June 5, 2007

Sprawl

Just looked at a refreshed copy of our server list, and we have more servers than users.

Monday, May 21, 2007

Standardize!

I was trying to restore a database backup from one server onto another this morning, and despite the two having the same drive letters for data and logs, I ran into problems because the directory names were different. For a company that runs everything based on rules and standards to the point of excess, this is silly. So I'm making a quick list of SQL Server-related settings that I feel should be standardized across all servers in one's environment. It just makes things easier!

1) Directory structures
2) Default data and log locations
3) Drive letters, if possible
4) Build levels (Service Pack/hotfix)
5) Logins, although these may be different between Dev, QA, and Prod
6) Install locations - I have never found a valid reason to install SQL Server on the C: drive of a non-desktop, but I keep seeing it

I'm sure I'll add to this list as I come up with more obnoxiously un-standardized settings in our environment.

Wednesday, May 16, 2007

WRITELOG waits

My team has been assisting another team off and on for several months with performance tuning their overnight DB processes. There are many possible bottlenecks in their process, but since the app writing into the DB is highly complex and multi-tiered, we're trying to focus on the internal DB issues first.

The first big win was dropping triggers from the tables being written to, which resulted in a 20% reduction in overall job time. Now the biggest WAITTYPE during the process is WRITELOG, with about 50-60% of the overall time in both QA and Prod. NETWORKIO is next, and I'd really like to SET NOCOUNT ON, since these writes are created as individual transactions and dropping the count should reduce the NETWORKIO substantially, but that's been problematic so far.

Digging into the WRITELOG waits, the disk queues are pretty low (0-5 range going against 80 or so physical disks on a big Hitachi SAN with a ton of cache), so the storage guys have dismissed the disk as the problem. However, I think it's still the disk, just not in the usual way.

The standard problem is that the disk subsystem runs out of write cache and has to start de-staging data to disk, and the physical disks can't write data as fast as it's coming in. That's not the case here, because the cache on the SAN is huge (70+ GB, according to the SAN guys), it can accept data faster than the logs are created, and it hasn't gotten close to the high watermark, when it starts de-staging. So what's the problem?

Well, in this case, the disks attached to the SQL Server have been attached as a single logical drive, so all the IOs get queued by Windows against that drive. Therein lies the problem, I believe. The SAN is fast, but every operation goes against that same queue, so one slow read for a piece of data not in the cache can slow down all the writelog IOs queued up behind it. This is what I believe is creating the WRITELOG waits.

The proposed solution is simple: add another logical drive dedicated to logs. Even if it's hitting the same physical disks, the addition of another disk queue should allow the log IOs to flow through without being blocked by slower data IOs or data reads. Stay tuned to see if it works.

Addendum: We tried this before at Capital IQ, splitting up sets of disks into multiple metaLUNs and comparing them against the same number of disks in one metaLUN and showed barely any performance gain (within the margin of measurement error). However, the difference between that situation and this one was the amount of cache involved - we were always past cache on the CapIQ SAN, so the disks were the bottleneck. Here I suspect it's the queue.

Tuesday, May 8, 2007

Cool new clustering stuff

That I just learned while perusing some Longhorn clustering docs.

1) GPT cluster disks are now supported. This should mean no more 2TB limit.
2) The Supported Hardware list is gone - now anything that validates is officially supported.
3) Clustering of virtual servers is now officially supported, so no more hacking virtual disks to get them to show up in multiple virtual instances. Caveat: The Parallel-SCSI model is deprecated, so it looks like virtual iSCSI is the way to get this to work.

and my personal favorite, quoted straight from the Q&A session:

4) "In Windows Server 2003 the way disks were identified was by the Disk Signature. In Longhorn clustered disks are identified by multiple attributes, including the Disk Signature and the SCSI Inquiry data. If a disk can't be found by one attribute, it will attempted to be found in another way. If found the service will self heal and update it's records of the disk. So in Longhorn disks will self heal and should elimate support issues of disk signatures changing and disks not coming online."

Finally! No more deleting and re-adding cluster disks when cluster services loses them!

Longhorn clustering should be cool. Now I just have to find another cluster to manage.

Friday, April 27, 2007

Mysterious HTTP 400 errors caused by large Kerberos tickets

First off, I know there's been a lack of DBA material lately. This is because I'm not doing anything interesting, database-wise. I can tell you all about the woes of interfacing SQL Server with an Oracle system to which I have minimal access and no table schema, but that's pretty boring.

The most interesting problem I've fixed lately was the mysterious HTTP 400 errors plaguing one of the users of our system. Briefly, the application consists of a .NET web app that performs business-specific functions and then runs Microstrategy 8.0.1 in an iFrame to run OLAP-style reports. One of the users of the app was regularly experiencing HTTP 400 errors while using the app, and several others were experiencing these errors intermittently.

After reading the links below (and a bunch of others), I checked the HTTP error logs at C:\WINNT\system32\LogFiles\HTTPERR and found that the HTTP 400 errors were associated with a RequestLength error type. According to Microsoft, the default maximum request size is 16384. I installed Ethereal packet capture software on the QA web server and captured packets while the user attempted to run reports. A check of the request packets showed that GET requests for certain pages exceeded 16384 bytes immediately before HTTP 400 errors (see image).

I verified this by capturing packets while I ran reports, and my requests were in the 2000-3000 byte range (see image).

Apparently the larger size of his packets could be caused by an excessively large Kerberos authentication ticket combined with cookies and a long request string. The Kerberos ticket can be larger because of membership in many groups – the user is a member of at least 19 that I can see in Active Directory. This could be temporarily cleared up in some cases by deleting cookies to reduce request size, which is why I believe that other users who have had this problem have had temporary success at clearing it up by deleting their cookies.

According to posts I read on this topic (see the top link for the most useful exchange), using the IP address instead of the hostname will avoid Kerberos ticketing and therefore not cause the problem. This was the case in our situation – when using the IP address of the QA server, the user had no problems running reports. He also had no problems after I added the two registry settings above and cycled the required services. The only alternative I could see was to ask the user to use the IP address of the server instead of the DNS name, but this may be against company policy and would reduce flexibility if we need to change the DNS entry in the future.

Registry keys added under HKLM\System\CurrentControlSet\Services\HTTP\Parameters

MaxFieldLength 32768
MaxRequestBytes 32768

After adding these keys, all users stopped experiencing HTTP 400 errors.

Main links used for reference:

http://www.issociate.de/board/post/314237/HTTP_400_Bad_Request.html
http://support.microsoft.com/kb/820129
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/c7d368f0-f175-4d58-b7a8-977e1d67bd07.mspx?mfr=true

Thursday, April 12, 2007

Overspecialization

My new pet peeve is overspecialization.

My esteemed employer emphasizes (at an institutional level) hiring top talent for very specific positions. I, for example, was hired as a SQL Server 2000 Developer. I took an extensive test that asked all kinds of very specific SQL Server 2000 questions. Fine, I know SQL Server, no problem.

Then I start work and find out that we're not supposed to touch any low-level SQL stuff because we have a DBA Team for that. Oookay, I'd rather do these things myself rather than break my train of thought to email somebody, and then wait for them to do something, but I'll give it a shot.

The first time I had to interact with the DBA team was over an issue with our backups. They weren't working. The DBAs have this dramatically complicated procedure that they install on every SQL Server that goes back and retrieves server metadata from a central DB and then dynamically creates a command to back up your databases. Great, except that it's not flexible enough to keep 3 days worth of backups, which is what we wanted. So they had a DBA Team member write a customized version of their stored proc that would add a timestamp to the files and then delete the old ones after 3 days.

Except it didn't work. I won't go into exactly why, but there were 3 different bugs within a relatively short stored proc. So I took it over, re-wrote it from scratch, and now it works fine, completely independent of the metadata repository which wasn't gaining us anything anyway.

Why did this happen? Because the guy spends all day every day installing SQL on new servers, installing SPs on old ones, setting up replication, and troubleshooting deadlocks. And he's probably been doing this for years. Even if he once was a great developer, skills (like brains) atrophy if unused. Which is not even his fault, because his job is designed to make him really good at the few things he does and not let him do anything else.

This causes problems not only for the individual, but for every team. I'm of the opinion that each team should be capable of functioning as an autonomous unit. If you need an outside expert occasionally for some piece of knowledge completely outside of your usual purview, fine. But a team with the skills to accomplish most dev tasks on its own can stay in high gear during development, instead of moving in fits and starts every time it has to beg somebody else to do something.

Getting back to the individual: of course a broad range of skills is a good thing! For me, learning new skills is one of the main things that keeps me interested in a job. Also, every time I interview somebody who has done one thing for his whole career, I have misgivings about hiring him, because

a) who knows if he can do anything else?
b) he has nothing extra to contribute to the team
c) having only specialized knowledge reduces your creativity when it comes to sticky problems

My favorite part about CapIQ was getting to work on everything, and in fact one of the reasons that I left was that I felt I was getting pigeon-holed into being just the DB hardware guy. Little did I know that other offices could be much more restrictive...

Wednesday, April 11, 2007

Infrastructure

I was going to post a comment on Allan Leinwand's post titled Web 2.0 & Death of the Network Engineer on GigaOM, but there were too many already and I didn't feel like reading them all. So I'm writing a quick response over here instead.

The CTO cited in the article knew nothing about the infrastructure details supporting his Web 2.0 venture. I wish him luck, but I think he's on the path to trouble.

The difference between good and great software engineers (or any professional, really) is knowledge of and attention to details. Anyone can write a stored procedure or a C# app, but a great developer's version will be 10% faster and more stable. Over time this can translate into huge savings, especially if that piece of code is run 10,000 times (or in some shops, 10,000 times A DAY).

Infrastructure works similarly. Specified, set up, and maintained properly, a company can squeeze maximum performance and capacity out of its infrastructure. This may not mean much when talking about one $10,000 web server here or there, but let's look at another piece of infrastructure that can make a much bigger difference: database hardware.

Database servers and disks have been by far the most expensive pieces of hardware owned by each team I've worked on, and the most susceptible to underutilization. For example, you can buy a 4-socket quad-core Clovertown server with cycles out the wazoo for under $30k, but if you only put $10k into the disks for it, it's not going to do much. And even if you do put more money into it, you need to know where it's going - will you go with fast, cheap, inflexible SAS storage, a soon-outgrown entry-level SAN like a CX3-10, or a monster DMX3000 but with no budget left over for spindles? Then when you buy it, who's going to have the knowledge to set it up so that it performs properly? If you dump all your DBs on one set of disks, you're potentially giving up a lot of performance, which means you'll have to buy more disks constantly as you scale up your user base.

A CTO who doesn't know these things is bad enough, but one who doesn't care is potentially leading his company into a financial quagmire. If you don't know what you're doing, it's easy to spend millions on infrastructure when you could have spent thousands. And the difference is knowing what you need and knowing how to use it.

Thursday, March 29, 2007

Old Thinkpad Nightmares

This is barely relevant to anything, I just want to get this information out on the net.

If you are having problems installing a fresh copy of Windows XP on an IBM Thinkpad T20 because the windows installer / setup program crashes repeatedly in random places, this is the fix:

Remove the battery.

I don't know why it works, but I'm guessing something to do with either power management drivers or CPU speed-stepping. I spent probably 6 hours thrashing on this stupid thing before locating an obscure post on an MIT newsgroup. Hopefully this blog will be slightly easier to find if anyone else is ever doing this.

Thursday, March 15, 2007

Too Much Information

I've noticed lately that my Google Reader inbox is more full than ever. It's not because of all the new feeds I'm adding - I've actually unsubscribed from a number of them. I don't have stats to back this up, but I believe it's because overall post volume on many blogs and news sites is increasing.

Fine. I like to stay informed, who doesn't?

Except that a lot of these posts are just for the sake of posting. When I read about disaster preparedness on Seeking Alpha and entertainment news (read: gossip) on The Register, something is wrong.

My friend Kevin Ho hates mainstream media. I disagree with him on most of his reasons, but he always points out that when the average person reads a newspaper article, he accepts that the facts and general position presented therein are accurate. Until he reads an article about a topic that he knows something about, and then he says, "What are these idiots talking about? They're missing the point entirely!"

Which is one of the reasons why I love blogs - it allows an expert in a certain subject to post about topics in an area in which he/she has extensive knowledge. And then I can aggregate the topics I care about in my RSS reader. No more amateurs writing about topics in which they have little background or understanding.

Unfortunately, many of these sites seem to be getting a big head about their traffic numbers and have started to assume that their readers live on their sites and go nowhere else, or else are attempting to squeeze more pageviews to generate those all-important ad dollars. At least, those are the possible rationales I've come up with for their explosions in posting, especially about topics outside their areas of expertise. Posts which are at best misguided, and at worst spam.

Even posts within a site's usual topical areas can be annoying if they're frivolous. For example, tech news outlets The Register and The Inquirer both post about 30-40 articles per day. Occasionally I'll click through to one with an interesting title, only to find that it's one paragraph long and contains no real information. Leave it out!

Mind you, I'm okay with a breadth of posts on personal blogs - the volume is low enough that ignoring the chaff doesn't take long, and some of it may even be interesting. And not all big sites are guilty of this posting diarrhea - Pitchfork only posts about music, Tom's Hardware about computer hardware, The Real Deal about real estate. But sites like Jalopnik and GigaOM, you're on notice. Keep up the crappy posts, and I'm dropping your feed.

(Yes, I realize that by posting this I'm guilty of the offense about which I'm complaining. I'm okay with the hypocrisy.)

Wednesday, February 14, 2007

"Unable to perform a SETUSER to the requested username" error

This was a new one on me:

Backstory: We have an application that runs a SQL job every time a set of data is checked out. (I didn't design it.) Our development server was recently demolished by the SAN team while they were trying to add a new HBA, so it had to be rebuilt from scratch. After the rebuild, SQL was at version 8.0.2039 (SP4 w/ no hotfixes) and the job wouldn't run. It would spit out the above error each time, which was breaking the app.

This error looked like a permissions issue to me, so I started out by checking the user mapping. We'll call the login that runs the application "CORP\AppUser." This user was a member of dbowner in the database in the job step, so that wasn't the problem. Tried sp_start_job from Query Analyzer using those credentials, which worked, but still produced the error. Tried starting the job using my credentials instead, which produced the same error.

Then I changed the job owner to the same account that SQLAgent runs under, and the job worked. So it was definitely a permissions issue. I decided to check out the specific sequence of events underneath firing off a job, so I ran a Profiler trace and found this...

exec sp_setuserbylogin N'CORP\AppUser', 1

Hmm, what's that? It's not in BOL, and I couldn't find it in the master DB. But clearly SQL is trying to execute it, which might explain why the error it's returning is looking for user ''. I dug around a bit more and found some code to add it back into master.

exec sp_addextendedproc N'sp_setuserbylogin', N'(server internal)'

Voila! As soon as I ran this code, the job worked like a charm. Apparently a botched SP3 or 4 install can miss this proc, which is what probably happened here. So if you come across this error, make sure you've got sp_setuserbylogin properly installed.

Wednesday, January 31, 2007

Stupid general-purpose connection errors

Had to debug a stupid error today being thrown by the VBscript we use to deploy DTS packages. Here's the text of the error:

[DBNETLIB][ConnectionRead (WrapperRead()).]

Pretty helpful, huh?

Although it wasn't difficult to debug after I got access to the script source. Here's the line that was producing the error:

Set oPackageSQLServer = oApplication.GetPackageSQLServer(sDestServer, sUser , sPwd , 0)

I tried connecting to the server using the same credentials, which broke with another unhelpful error. Then I logged into the server directly and tried logging in there, where it informed me that the default database for that user was unavailable. Aha! That DB was in the middle of a restore. So I changed the default database and was able to transfer packages.

I'm posting this so that somebody trying to track down this error with the PackageSQLServer object might land here and see this message:

Check your login credentials!

Tuesday, January 30, 2007

Blocking Chain script

I had a need to follow some quickly changing blocking chains today, and couldn't find any simple blocking chain scripts that I liked, so I wrote one. It's not the prettiest method, but it's simple, easy to modify, and doesn't rely on lots of hard-to-debug dynamic SQL.

The next updates I'll probably make will be to display only the top-level blockers rather than all blocking chains, and to get rid of the 0s at the end of each blocking chain line.

Here's the SQL:

if object_id('tempdb..#blockers0') is not null
drop table #blockingchain

create table #blockingchain
(
blockingchain varchar(255)
)

insert into #blockingchain
select cast(spid as varchar(5)) + ' -> ' + cast(blocked as varchar(5))
from master..sysprocesses
where blocked <> 0

while exists (select 1 from #blockingchain where cast((substring(right(blockingchain, 5), 1 + charindex('>', right(blockingchain, 5)), 5)) as smallint) <> 0)
begin

update bc set blockingchain = blockingchain + ' -> ' + cast(sp.blocked as varchar(5))
from #blockingchain bc
inner join master..sysprocesses sp
on cast((substring(right(bc.blockingchain, 5), 1 + charindex('>', right(bc.blockingchain, 5)), 5)) as smallint) = sp.spid

end

select * from #blockingchain
order by len(blockingchain) asc,
substring(blockingchain, 1, charindex('-', blockingchain)) asc

Sample output:

Wednesday, January 17, 2007

Changing the Domain for a SQL 2005 Cluster

NOTE: It looks like my old DBAIQ blog may be down for the count due to Blogger issues, and I didn't want to lose my post on changing the domain for a SQL 2005 cluster, so I retrieved it from Google's search cache, and am reprinting it here. Ironic that Google was both the cause of its destruction (taking over Blogger and breaking our team blog) and its recovery (search cache).

We're moving our Dev environment to a new domain for some unfathomable corporate reason, and it's going to be a major headache. Among our 20+ SQL Servers in the environment, we've got a few SQL 2005 clusters. According to Microsoft, we can't move them without uninstalling SQL.

http://support.microsoft.com/kb/915846/en-us

Needless to say, this would add more pain to the migration process. It wouldn't be the end of the world, but this is weekend work, and uninstalling and re-installing a couple active/active SQL clusters is just the thing to turn 1 day of work into 2+.

However, I think they're wrong.

I built two virtual machines on VMWare Server, added them to a cluster (see Repeatable Read for instructions on creating shared disks in VMWare) on our current domain, and created some databases. Verified failover capability and connectivity.

Then I moved both nodes to the new domain using the following steps:

1) Take all cluster resources offline (except the quorum, which cannot be taken offline)
2) Stop the cluster service on both nodes
3) Change the cluster service startup type to Manual
4) Change the domain of each machine to the new domain and reboot
5) After reboot, on each machine, change the cluster and SQL service accounts to accounts in the new domain
6) Run gpedit.msc or otherwise access Local Security Policy Settings (see below), and grant the following rights:

Cluster Service Account
Act as part of the operating system
Adjust memory quotas for a process
Debug programs
Increase scheduling priority
Manage auditing and security log
Restore files and directories

SQL Service Account
Adjust memory quotas for a process
Lock pages in memory
Log on as a batch job
Log on as a service
Replace a process level token

7) Add the cluster and SQL service accounts to the local Adminstrators group. NOTE: This should not be necessary for SQL, and I will update this with the minimum required permissions as soon as I sort them out. It is necessary for the cluster account, however.
8) Start the cluster service on both machines
9) Bring cluster resources online
10) Go enjoy the rest of your weekend

If you missed some permissions, the cluster service will likely fail to start with an error 7023 or 1321, and will helpfully output an error in the system log with eventId 1234 that contains a list of the necessary user rights that still need to be assigned. Now that's error reporting!

Comprehensive testing is still pending, but the preliminary results look good. After this process, SQL Server comes online on my test cluster, as do SQL Agent and Fulltext. I don't have any machines on the new domain with SQL Management Studio installed, but I could connect to SQL using osql directly on one of the cluster nodes. If anyone out there has any different experiences or comments, I'd love to hear them.

UPDATE:
My previous post left out one small but significant detail: the domain groups under which the SQL Server service accounts run. When one installs SQL 2005 on a cluster, the setup program requires domain groups to be entered for each service account. So for example:

SQL Server service account: OLDDOMAIN\SQLService
SQL Agent service account: OLDDOMAIN\SQLAgentService
SQL Browser service account: OLDDOMAIN\SQLBrowserService

Domain groups:

OLDDOMAIN\SQLServiceGroup
OLDDOMAIN\SQLAgentServiceGroup
OLDDOMAIN\FullTextGroup

Then it comes time to move your cluster, and you've followed my steps above or done your own hacking, and you've changed the service accounts to NEWDOMAIN\SQLService and so on. But the domain groups remain the same. Your cluster will come online and fail over and operate fine, but you won't be able to change it.

This was made evident when I tried to add a node to an existing cluster after moving it to a new domain. It gave me a message like "Cannot add NEWDOMAIN\SQLService to OLDDOMAIN\SQLServiceGroup." Arrgh. Microsoft had already claimed that this was not supported, so I suppose I shouldn't have been surprised.

So I started searching for the reference to OLDDOMAIN\SQLServiceGroup. And couldn't find it. Not in a config file, or a system table (I know, that was dumb, but I was desperate), or the registry, where I expected to find it. Eventually, I started combing the registry key by key within the SQL hives and came across this in HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\Setup...

(okay, I tried to upload another image here to illustrate, but Blogger hates me, so screw it.)

The keys AGTGroup, FTSGroup, and SQLGroup contain the SIDs for whatever OLDDOMAIN groups you set up when installing SQL. Find the SIDs for your new domain groups (the VBScript below is how I did it), enter those in place of the old ones, restart SQL, and your cluster is moved. You should now be able to add or remove nodes, install hotfixes, etc. You'll need to update the SIDs for each SQL installation (MSSQL.2, MSSQL.3, etc.)

As with any unsupported operation, your mileage may vary, but let me know if you have a different experience with this or you run into additional problems.

SQL Litespeed works better when unregistered

I recently set up a new DR server that had recently been built from scratch, and the DBAs who set it up didn't have a product key for SQL Litespeed, which we use to backup all of our databases. If you don't know about Litespeed, go check it out, it's handy. So the DBAs set it up using a trial version of the software.

After that, I set up an automatic restore process on the DR servers, but it was taking forever. A backup that would restore in 4-5 hours on the hardware-inferior DR QA server would take 18-20 hours on the Production DR server. This gave us a potentially long recovery time and precluded using the DR machine for reporting or other tasks. I ran a series of diagnostics against the server, and eventually even involved the SAN team, but everything looked fine - disk queues and CPU usage were low, the drives were clearly capable of much higher IOs, etc.

Then one day, the restores sped up. Not by a little bit, but by 10x - from 18 hours to 85-90 minutes. Hey, cool. Except I didn't know what had happened. But it was working and I had other stuff to do, so I left it alone.

A week later, I attempted to take a backup of a DB on that server and got an error message informing me that SQL Litespeed's trial period had expired. I checked that the backups were still restoring, which they were, and sent another request for a Litespeed license.

The request came through a few weeks later, and on the day that Litespeed was officially registered, the backups started taking 18 hours again. Bizarre? Yeah. I'm tempted to uninstall the licensed version and run it in trial mode again unless Quest can give me a good explanation as to why this is happening and how to fix it.

Here's some sample output from Litespeed on our server, with sensitive data redacted. Emphasis is mine.

Executed as user: [user]. Warning: Null value is eliminated by an aggregate or other SET operation. [SQLSTATE 01003] (Message 8153) Processed 29616984 pages for database '[dbname]', file '[dbname]_Data' on file 1. Processed 26037664 pages for database '[dbname]', file '[dbname]_Data2' on file 1. Processed 23631992 pages for database '[dbname]', file '[dbname]_Data3' on file 1. Processed 16 pages for database '[dbname]', file '[dbname]_Log' on file 1. RESTORE DATABASE successfully processed 79286656 pages in 5051.204 seconds (128.586 MB/sec). CPU Seconds: 8190.55 Environment: Intel(R) Xeon(TM) MP CPU 3.00GHz CPUs: 8 logical, 8 cores, 8 physical packages. [SQLSTATE 01000] (Message 1). The step succeeded.

Executed as user: [user]. Warning: Null value is eliminated by an aggregate or other SET operation. [SQLSTATE 01003] (Message 8153) Processed 29616984 pages for database '[dbname]', file '[dbname]_Data' on file 1. Processed 26037664 pages for database '[dbname]', file '[dbname]_Data2' on file 1. Processed 23631992 pages for database '[dbname]', file '[dbname]_Data3' on file 1. Processed 16 pages for database '[dbname]', file '[dbname]_Log' on file 1. RESTORE DATABASE successfully processed 79286656 pages in 69992.264 seconds (9.279 MB/sec). CPU Seconds: 7422.05 Environment: Intel(R) Xeon(TM) MP CPU 3.00GHz CPUs: 8 logical, 8 cores, 8 physical packages. [SQLSTATE 01000] (Message 1). The step succeeded.