Tuesday, December 11, 2007

Another cool use for RANK() - Time Series

I've got a big set of time series data across many object instances that needs to be inserted into a table that uses a trigger to set the "End Time" of each segment in the series. In other words, data like this:

When a new row for a given ObjectId is inserted, the previous row in the series is given an EndTime equal to the StartTime for the new row. Usually, this is done one row at a time, with only one row coming into the table for each ObjectId at a time.

However, in this case, I need to backfill a lot of data, which will cause the trigger to not behave as expected. I have a few options to work around this:
  1. Insert each row one at a time. Turns out there are over a million rows to be inserted and this is a heavily-used table, so this option is undesirable.
  2. Write a script to generate the EndTimes for each row ahead of time. This could be tricky and CPU-intensive, and I'd rather avoid it if possible.
  3. Partition the data such that the trigger will work as it's supposed to, but large batches of data can be inserted instead of inserting one row at a time.
I'm currently investigating option 3, especially because it lets me use the RANK() function. The basic idea is this: for each ObjectId, rank the StartTimes, then loop through and insert all rows where RANK = 1, then 2, 3, etc. This will insert all the ObjectIds with their initial StartTime, then the next StartTime, and so on, allowing the trigger to fire on each round and update the EndTimes en masse.

The query I'm using (generalized for mass consumption):

, Value
, StartTime
, RANK() OVER (PARTITION BY ObjectId ORDER BY StartTime) as Ord

Then the data in tbTemp can be looped through by rank and inserted into the destination table.

And the resulting data should look like this:

No comments: