NYC DBA

Tuesday, December 11, 2007

More fun with ranking functions

I don't think I really understood why Mike F at Capital IQ was so excited about the SQL 2005 ranking functions until just recently, but I'm using them all over the place now. Minutes after kicking off the time series load I described in my last post, I came across another use for RANK.

EDIT: As I review this, I recognize that this is pretty esoteric and may not have much practical application for most people. You've been warned.

Like many financial companies, we get data from multiple vendors, and we have to decide which vendor's data to use in each case. So I've got a set of data provided by multiple Vendors, grouped by ObjectId and Date, that is ranked for each date by some business logic built into the procedure that loads the data. The best (lowest, in this case) ranked row for each ObjectId and Date gets a bit flag set; we'll call it IsBest.

However, it was uncovered that there was an extra negation in the business logic for many cases, so many of the rows had the bit flag set incorrectly. I needed to find a way to determine what portion of the data had the flag set incorrectly. This could be accomplished using any number of data warehousing techniques - Analysis Services is well-suited to it, for example, or Microstrategy, or even a complicated series of GROUP BYs. But I didn't want to go to all that trouble just to produce a single ad hoc report, so I found another way.

Some (totally made up, to prevent my employer from having me whacked) sample data, in which only objectId 4 has the correct settings for the bit flag:

I needed to find rows in which the highest-ranked VendorId had the IsBest flag set for a given ObjectId and Date, and any rows in which a lower-ranked VendorId did not have the IsBest flag set. Any other combination is incorrect. If I'm only worried about how many rows need to be fixed, I can stop there, or I can divide the sum of those two rowcounts by the total number of rows to get a percentage of data that is correct.

A query to return the rowcounts follows. I'm using DENSE_RANK rather than RANK to keep the ranking correct despite duplicate rows (as in objectId 3).

SELECT
IsBest, DRank, count(*)
FROM
(SELECT
[Date]
, ObjectId
, VendorId
, IsBest
, DENSE_RANK() OVER (PARTITION BY Date, ObjectId ORDER BY Priority) as DRank
from tbSource
) as allRows
group by IsBest, DRank
order by IsBest, DRank

Additional code could be added to this query to automatically sum the correct cases and determine the percentages, but I will leave that as an exercise for the reader. Here are the raw counts, with the correct cases highlighted:

Another cool use for RANK() - Time Series

I've got a big set of time series data across many object instances that needs to be inserted into a table that uses a trigger to set the "End Time" of each segment in the series. In other words, data like this:

When a new row for a given ObjectId is inserted, the previous row in the series is given an EndTime equal to the StartTime for the new row. Usually, this is done one row at a time, with only one row coming into the table for each ObjectId at a time.

However, in this case, I need to backfill a lot of data, which will cause the trigger to not behave as expected. I have a few options to work around this:

Insert each row one at a time. Turns out there are over a million rows to be inserted and this is a heavily-used table, so this option is undesirable.
Write a script to generate the EndTimes for each row ahead of time. This could be tricky and CPU-intensive, and I'd rather avoid it if possible.
Partition the data such that the trigger will work as it's supposed to, but large batches of data can be inserted instead of inserting one row at a time.

I'm currently investigating option 3, especially because it lets me use the RANK() function. The basic idea is this: for each ObjectId, rank the StartTimes, then loop through and insert all rows where RANK = 1, then 2, 3, etc. This will insert all the ObjectIds with their initial StartTime, then the next StartTime, and so on, allowing the trigger to fire on each round and update the EndTimes en masse.

The query I'm using (generalized for mass consumption):

INSERT INTO tbTemp
SELECT
ObjectId
, Value
, StartTime
, RANK() OVER (PARTITION BY ObjectId ORDER BY StartTime) as Ord
FROM
tbSource

Then the data in tbTemp can be looped through by rank and inserted into the destination table.

And the resulting data should look like this:

Thursday, November 29, 2007

Many-to-many physical table design

I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.

In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:

CREATE TABLE User
(
UserId INT IDENTITY(1,1)
, ...
)

CREATE TABLE Group
(
GroupId INT IDENTITY(1,1)
, ...
)

CREATE TABLE UserToGroup
(
UserId INT
, GroupId INT
, ?
)

A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.

Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.

The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.

I haven't tested this yet, though, so I welcome any comments or contradictions.

Tuesday, November 27, 2007

Odd SSIS bug

Here's what happened: I added a column to a source table, went into the Data Flow source, refreshed the table list, selected the table, and checked all the output columns that I wanted. Then I resolved all the naming issues with the downstream tasks.

But I'm getting an error (#1 below) when I try to build, because for some reason SSIS will not automatically add the new column to the Error Output, and it won't let me add a column to that output, even in the Advanced Editor. This is annoying. I'm probably going to need to delete and add this data source, which is going to screw up all my transformations.

Error Message #1
Error at Data Flow Task [tbCA Source [1]]: The output column "CaxSubTypeId" (1052) on the non-error output has no corresponding output column on the error output.

Error Message #2
TITLE: Microsoft Visual Studio
------------------------------

Error at Data Flow Task [tbCA Source [1]]: The component does not allow adding columns to this input or output.

Tuesday, October 9, 2007

off topic again

While catching up on the thousands of RSS articles that showed up in Google Reader while I was on vacation for a week, I read Scott Hanselman's post about typing speed and speech recognition and immediately had to take the test.

Number of words typed: 261
Test duration: 3 min
Speed: 87.0 words/min. (435 keystrokes/min.)
Error penalty: 10
Accuracy: 96.2%

Ha! I win! (He managed 72.6)

Actually, I'm surprised I managed 87 wpm, but I have to admit, this test was slightly stressful. I was trying hard. I probably average around 60 most of the time.

Thursday, September 20, 2007

Regex for linked servers

As part of the DTS migration I mentioned in my last post, I wrote a few functions to find less-than-ideal code. One thing we wanted to get rid of was linked-server joins, and the easiest way to find those was to check each Execute SQL task using a regular expression (long live Regulazy). So once again I'm using this blog as my personal source repository and re-producing the regex for this.

Regex for linked server queries:
(\S+)\.(\S+)\.dbo\.(\S+)

I tried to generate an expression to find "SELECT...INTO" where ... did not contain @ or FROM, but couldn't figure it out with the meager regex capabilities of VBScript. Seems like a lookahead or lookbehind would probably work for this purpose, but according to the 4 Guys From Rolla, VBScript regex does not support these.

UPDATE: Spent a little more time on it and figured out a suitable regular expression to find instances of SELECT INTO that excludes INSERT INTO and SELECT FROM.

Regex for SELECT INTO:
select((?!from)(?!insert).\n)*into

UPDATE 2: My first SELECT INTO regex was foiled by something like "SELECT foo, bar, admintool from ...", so I fixed it to make sure there's whitespace before and after the SELECT and the INTO.

select[\s]((?!from)(?!insert).\n)*[\s]into[\s]

Tuesday, September 11, 2007

Parsing the filesystem for files

I'm sure this has been done many times before, but this is my implementation. I needed it to run a large DTS migration from files stored on disk, so first I needed a list of those files. They were contained in a directory tree in no particular order, and in many levels of subdirectories. I wanted a generalized solution to the parsing, so here it is.

Pasting it into Blogger butchered my formatting, but it feels like a waste of time to fix it, since it will probably not paste properly back into an editor anyway. I made the (minimal) comments green just to break it up a bit.

As usual, no warranty or even functionality of this code is expressed or implied. Use at your own risk, especially since I'm not sure the xp_subdirs extended proc is supported and xp_cmdshell can definitely be dangerous.

/*************************************************
** get list of dirs to search for files
*************************************************/

declare @currentdir varchar(1000)
declare @fullpath varchar(1000)
declare @basepath varchar(1000)

set @basepath = 'c:\stuff'
set @fullpath = @basepath

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirs___%')
drop table #dirs

create table #dirs
(
directory varchar(1000)
, done bit
)

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirtemp___%')
drop table #dirtemp

create table #dirtemp
( directory varchar(1000) )

insert into #dirtemp exec master.dbo.xp_subdirs @fullpath

insert into #dirs
select
'\' + directory as directory
, 0 as done
from #dirtemp

Loop: -- xp_subdirs spits out errors when a dir has no subdirs, so they have to be caught with this GOTO Loop construct
while exists (select 1 from #dirs where done = 0)
begin
if exists (select 1 from tempdb.dbo.sysobjects where name like '#subdirs___%')
drop table #subdirs

create table #subdirs
( directory varchar(255) )

select top 1 @currentdir = directory from #dirs where done = 0

set @fullpath = @basepath + @currentdir

exec master.dbo.xp_subdirs @fullpath
if (@@ERROR <> 0)
begin
update #dirs set done = 1 where @currentdir = directory
GOTO Loop
end

insert into #subdirs exec master.dbo.xp_subdirs @fullpath
insert into #dirs select @currentdir + '\' + s.directory, 0 as done from #subdirs s
update #dirs set done = 1 where @currentdir = directory
end

if exists (select 1 from tempdb.dbo.sysobjects where name like '#dirtemp___%')
drop table #dirtemp

if exists (select 1 from tempdb.dbo.sysobjects where name like '#subdirs___%')
drop table #subdirs

/*************************************************
** get list of files
*************************************************/

declare @command varchar(1000)
declare @fileextension varchar(5)

set @fileextension = 'dts'

if exists (select 1 from tempdb.dbo.sysobjects where name like '#files___%')
drop table #files

create table #files
(
filepath varchar(1000)
)

update #dirs set done = 0

while exists (select 1 from #dirs where done = 0)
begin
if exists (select 1 from tempdb.dbo.sysobjects where name like '#filetemp___%')
drop table #filetemp

create table #filetemp
(
filepath varchar(1000)
)

select top 1 @currentdir = directory from #dirs where done = 0

set @command = 'dir ' + @basepath + @currentdir + '\*.' + @fileextension + ' /B'

insert into #filetemp exec master.dbo.xp_cmdshell @command

insert into #files select @basepath + @currentdir + '\' + filepath from #filetemp

update #dirs set done = 1 where directory = @currentdir
end

if exists (select 1 from tempdb.dbo.sysobjects where name like '#filetemp___%')
drop table #filetemp

delete from #files where ISNULL(filepath, 'x') not like '%' + @fileextension + '%'

select * from #files