NYC DBA

Friday, February 22, 2008

SqlDecimal truncation oddity

I've been working with all sorts of SQL CLR integration lately - User Defined Types (which are a bad idea), stored procs, and UDFs. In general, I'm pretty happy with the stored procedures, although I've run up against a number of irritating (but understandable) limitations, like the inability to dynamically load libraries and the requirement that only SQL data types be passed in and out of the CLR stored procs. But I haven't really broken any new ground, so I haven't written about it.

Today's issue is also not ground-breaking, but it is odd, so I thought I'd mention it. I wrote a function to convert a date into a decimal. Better that you not ask why. Here's the code:

public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlDouble fnTimeSeriesCalcFracYrFromDateS(SqlDateTime inputDT)
{
DateTime dt = (DateTime)inputDT;
double temp = (dt.Year + (Math.Round(((double)dt.Month * (365 / 12) + (double)dt.Day) * 1000000 / 365) / 1000000));
SqlDecimal temp2 = (SqlDecimal)temp;
return temp2;
}
};

Obviously I could have dispensed with almost all of that by casting the data types within a single line, but I broke it out for debugging purposes.

Here's the funny part: when placing a breakpoint at the return statement, here are the values of the variables:

inputDT {1/2/2008 12:00:00 AM} System.Data.SqlTypes.SqlDateTime
dt {1/2/2008 12:00:00 AM} System.DateTime
temp 2008.087671 double
temp2 {2008.0876710000000} System.Data.SqlTypes.SqlDecimal

All as they should be. But the output of the function is 2008.

Why? I have no idea. I'm guessing it has something to do with the SqlDecimal declaration of the function, because when I changed it to SqlDouble, it returns 2008.087671 as it should. Anybody know why this is happening?

UPDATE: Figured it out. Visual Studio deploys the UDF without specifying a precision and scale, which causes the scale to be set to 0. One can circumvent this by dropping and re-creating the function manually within SQL Server and declaring the decimal data type with the desired precision and scale (in this case, (10,6)).

However, this is annoying because it turns a one-button deploy/debug process into a multi-step process, since one can't use the test script functionality in Visual Studio to run a script with multiple "GOs" to drop and re-create the function and then run it. The workaround I devised was to

1) Deploy the function to SQL via Visual Studio.
2) Drop and re-create the function manually in SQL Management Studio.
3) Set a breakpoint in VS.
4) Attach VS to the sqlservr.exe process.
5) Run the function. It should hit the breakpoint, allowing you to debug when it has the correct variable type settings.

But it would be nice to get back to where one press of F5 deployed and debugged the whole thing. Maybe I can figure out a way to set up a custom deploy script from VS...

UPDATE 2: When attempting to deploy the CLR functions after manually dropping them and re-creating them with the desired precision on the decimal return types, you get a Deploy error like the following:

The assembly module 'fnTimeSeriesCalcFracYrFromDateS' cannot be re-deployed because it was created outside of Visual Studio. Drop the module from the database before deploying the assembly.

To get around this, you can write a SQL script to drop the functions if they exist, save it into your project directory, and use the Pre-build event command line on the Build Events tab of the project properties to execute that drop script before each deploy. One less step per build!

Next I have to figure out how to get the post-build to execute the script to drop and recreate the functions with the proper precision.

Monday, January 28, 2008

Arggh

I wrote this nice little piece of code to eliminate tons of redundant rows from a 70-million row table, tested it in a vacuum in Dev, put it in the Integration environment for 3 weeks... and had it blow up spectacularly when applied to Production this weekend because of triggers on the table that I hadn't accounted for. Stupid legacy code!

There's no real content to this story, I'm just venting. The only moral is TEST YOUR CODE.

Tuesday, December 11, 2007

More fun with ranking functions

I don't think I really understood why Mike F at Capital IQ was so excited about the SQL 2005 ranking functions until just recently, but I'm using them all over the place now. Minutes after kicking off the time series load I described in my last post, I came across another use for RANK.

EDIT: As I review this, I recognize that this is pretty esoteric and may not have much practical application for most people. You've been warned.

Like many financial companies, we get data from multiple vendors, and we have to decide which vendor's data to use in each case. So I've got a set of data provided by multiple Vendors, grouped by ObjectId and Date, that is ranked for each date by some business logic built into the procedure that loads the data. The best (lowest, in this case) ranked row for each ObjectId and Date gets a bit flag set; we'll call it IsBest.

However, it was uncovered that there was an extra negation in the business logic for many cases, so many of the rows had the bit flag set incorrectly. I needed to find a way to determine what portion of the data had the flag set incorrectly. This could be accomplished using any number of data warehousing techniques - Analysis Services is well-suited to it, for example, or Microstrategy, or even a complicated series of GROUP BYs. But I didn't want to go to all that trouble just to produce a single ad hoc report, so I found another way.

Some (totally made up, to prevent my employer from having me whacked) sample data, in which only objectId 4 has the correct settings for the bit flag:

I needed to find rows in which the highest-ranked VendorId had the IsBest flag set for a given ObjectId and Date, and any rows in which a lower-ranked VendorId did not have the IsBest flag set. Any other combination is incorrect. If I'm only worried about how many rows need to be fixed, I can stop there, or I can divide the sum of those two rowcounts by the total number of rows to get a percentage of data that is correct.

A query to return the rowcounts follows. I'm using DENSE_RANK rather than RANK to keep the ranking correct despite duplicate rows (as in objectId 3).

SELECT
IsBest, DRank, count(*)
FROM
(SELECT
[Date]
, ObjectId
, VendorId
, IsBest
, DENSE_RANK() OVER (PARTITION BY Date, ObjectId ORDER BY Priority) as DRank
from tbSource
) as allRows
group by IsBest, DRank
order by IsBest, DRank

Additional code could be added to this query to automatically sum the correct cases and determine the percentages, but I will leave that as an exercise for the reader. Here are the raw counts, with the correct cases highlighted:

Another cool use for RANK() - Time Series

I've got a big set of time series data across many object instances that needs to be inserted into a table that uses a trigger to set the "End Time" of each segment in the series. In other words, data like this:

When a new row for a given ObjectId is inserted, the previous row in the series is given an EndTime equal to the StartTime for the new row. Usually, this is done one row at a time, with only one row coming into the table for each ObjectId at a time.

However, in this case, I need to backfill a lot of data, which will cause the trigger to not behave as expected. I have a few options to work around this:

Insert each row one at a time. Turns out there are over a million rows to be inserted and this is a heavily-used table, so this option is undesirable.
Write a script to generate the EndTimes for each row ahead of time. This could be tricky and CPU-intensive, and I'd rather avoid it if possible.
Partition the data such that the trigger will work as it's supposed to, but large batches of data can be inserted instead of inserting one row at a time.

I'm currently investigating option 3, especially because it lets me use the RANK() function. The basic idea is this: for each ObjectId, rank the StartTimes, then loop through and insert all rows where RANK = 1, then 2, 3, etc. This will insert all the ObjectIds with their initial StartTime, then the next StartTime, and so on, allowing the trigger to fire on each round and update the EndTimes en masse.

The query I'm using (generalized for mass consumption):

INSERT INTO tbTemp
SELECT
ObjectId
, Value
, StartTime
, RANK() OVER (PARTITION BY ObjectId ORDER BY StartTime) as Ord
FROM
tbSource

Then the data in tbTemp can be looped through by rank and inserted into the destination table.

And the resulting data should look like this:

Thursday, November 29, 2007

Many-to-many physical table design

I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.

In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:

CREATE TABLE User
(
UserId INT IDENTITY(1,1)
, ...
)

CREATE TABLE Group
(
GroupId INT IDENTITY(1,1)
, ...
)

CREATE TABLE UserToGroup
(
UserId INT
, GroupId INT
, ?
)

A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.

Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.

The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.

I haven't tested this yet, though, so I welcome any comments or contradictions.

Tuesday, November 27, 2007

Odd SSIS bug

Here's what happened: I added a column to a source table, went into the Data Flow source, refreshed the table list, selected the table, and checked all the output columns that I wanted. Then I resolved all the naming issues with the downstream tasks.

But I'm getting an error (#1 below) when I try to build, because for some reason SSIS will not automatically add the new column to the Error Output, and it won't let me add a column to that output, even in the Advanced Editor. This is annoying. I'm probably going to need to delete and add this data source, which is going to screw up all my transformations.

Error Message #1
Error at Data Flow Task [tbCA Source [1]]: The output column "CaxSubTypeId" (1052) on the non-error output has no corresponding output column on the error output.

Error Message #2
TITLE: Microsoft Visual Studio
------------------------------

Error at Data Flow Task [tbCA Source [1]]: The component does not allow adding columns to this input or output.

Tuesday, October 9, 2007

off topic again

While catching up on the thousands of RSS articles that showed up in Google Reader while I was on vacation for a week, I read Scott Hanselman's post about typing speed and speech recognition and immediately had to take the test.

Number of words typed: 261
Test duration: 3 min
Speed: 87.0 words/min. (435 keystrokes/min.)
Error penalty: 10
Accuracy: 96.2%

Ha! I win! (He managed 72.6)

Actually, I'm surprised I managed 87 wpm, but I have to admit, this test was slightly stressful. I was trying hard. I probably average around 60 most of the time.