Tuesday, December 11, 2007

More fun with ranking functions

I don't think I really understood why Mike F at Capital IQ was so excited about the SQL 2005 ranking functions until just recently, but I'm using them all over the place now. Minutes after kicking off the time series load I described in my last post, I came across another use for RANK.

EDIT: As I review this, I recognize that this is pretty esoteric and may not have much practical application for most people. You've been warned.

Like many financial companies, we get data from multiple vendors, and we have to decide which vendor's data to use in each case. So I've got a set of data provided by multiple Vendors, grouped by ObjectId and Date, that is ranked for each date by some business logic built into the procedure that loads the data. The best (lowest, in this case) ranked row for each ObjectId and Date gets a bit flag set; we'll call it IsBest.

However, it was uncovered that there was an extra negation in the business logic for many cases, so many of the rows had the bit flag set incorrectly. I needed to find a way to determine what portion of the data had the flag set incorrectly. This could be accomplished using any number of data warehousing techniques - Analysis Services is well-suited to it, for example, or Microstrategy, or even a complicated series of GROUP BYs. But I didn't want to go to all that trouble just to produce a single ad hoc report, so I found another way.

Some (totally made up, to prevent my employer from having me whacked) sample data, in which only objectId 4 has the correct settings for the bit flag:

I needed to find rows in which the highest-ranked VendorId had the IsBest flag set for a given ObjectId and Date, and any rows in which a lower-ranked VendorId did not have the IsBest flag set. Any other combination is incorrect. If I'm only worried about how many rows need to be fixed, I can stop there, or I can divide the sum of those two rowcounts by the total number of rows to get a percentage of data that is correct.

A query to return the rowcounts follows. I'm using DENSE_RANK rather than RANK to keep the ranking correct despite duplicate rows (as in objectId 3).

IsBest, DRank, count(*)
, ObjectId
, VendorId
, IsBest
, DENSE_RANK() OVER (PARTITION BY Date, ObjectId ORDER BY Priority) as DRank
from tbSource
) as allRows
group by IsBest, DRank
order by IsBest, DRank

Additional code could be added to this query to automatically sum the correct cases and determine the percentages, but I will leave that as an exercise for the reader. Here are the raw counts, with the correct cases highlighted:

Another cool use for RANK() - Time Series

I've got a big set of time series data across many object instances that needs to be inserted into a table that uses a trigger to set the "End Time" of each segment in the series. In other words, data like this:

When a new row for a given ObjectId is inserted, the previous row in the series is given an EndTime equal to the StartTime for the new row. Usually, this is done one row at a time, with only one row coming into the table for each ObjectId at a time.

However, in this case, I need to backfill a lot of data, which will cause the trigger to not behave as expected. I have a few options to work around this:
  1. Insert each row one at a time. Turns out there are over a million rows to be inserted and this is a heavily-used table, so this option is undesirable.
  2. Write a script to generate the EndTimes for each row ahead of time. This could be tricky and CPU-intensive, and I'd rather avoid it if possible.
  3. Partition the data such that the trigger will work as it's supposed to, but large batches of data can be inserted instead of inserting one row at a time.
I'm currently investigating option 3, especially because it lets me use the RANK() function. The basic idea is this: for each ObjectId, rank the StartTimes, then loop through and insert all rows where RANK = 1, then 2, 3, etc. This will insert all the ObjectIds with their initial StartTime, then the next StartTime, and so on, allowing the trigger to fire on each round and update the EndTimes en masse.

The query I'm using (generalized for mass consumption):

, Value
, StartTime
, RANK() OVER (PARTITION BY ObjectId ORDER BY StartTime) as Ord

Then the data in tbTemp can be looped through by rank and inserted into the destination table.

And the resulting data should look like this: