Tuesday, December 11, 2007

More fun with ranking functions

I don't think I really understood why Mike F at Capital IQ was so excited about the SQL 2005 ranking functions until just recently, but I'm using them all over the place now. Minutes after kicking off the time series load I described in my last post, I came across another use for RANK.

EDIT: As I review this, I recognize that this is pretty esoteric and may not have much practical application for most people. You've been warned.

Like many financial companies, we get data from multiple vendors, and we have to decide which vendor's data to use in each case. So I've got a set of data provided by multiple Vendors, grouped by ObjectId and Date, that is ranked for each date by some business logic built into the procedure that loads the data. The best (lowest, in this case) ranked row for each ObjectId and Date gets a bit flag set; we'll call it IsBest.

However, it was uncovered that there was an extra negation in the business logic for many cases, so many of the rows had the bit flag set incorrectly. I needed to find a way to determine what portion of the data had the flag set incorrectly. This could be accomplished using any number of data warehousing techniques - Analysis Services is well-suited to it, for example, or Microstrategy, or even a complicated series of GROUP BYs. But I didn't want to go to all that trouble just to produce a single ad hoc report, so I found another way.

Some (totally made up, to prevent my employer from having me whacked) sample data, in which only objectId 4 has the correct settings for the bit flag:


I needed to find rows in which the highest-ranked VendorId had the IsBest flag set for a given ObjectId and Date, and any rows in which a lower-ranked VendorId did not have the IsBest flag set. Any other combination is incorrect. If I'm only worried about how many rows need to be fixed, I can stop there, or I can divide the sum of those two rowcounts by the total number of rows to get a percentage of data that is correct.

A query to return the rowcounts follows. I'm using DENSE_RANK rather than RANK to keep the ranking correct despite duplicate rows (as in objectId 3).

SELECT
IsBest, DRank, count(*)
FROM
(SELECT
[Date]
, ObjectId
, VendorId
, IsBest
, DENSE_RANK() OVER (PARTITION BY Date, ObjectId ORDER BY Priority) as DRank
from tbSource
) as allRows
group by IsBest, DRank
order by IsBest, DRank

Additional code could be added to this query to automatically sum the correct cases and determine the percentages, but I will leave that as an exercise for the reader. Here are the raw counts, with the correct cases highlighted:

No comments: