Thursday, November 29, 2007

Many-to-many physical table design

I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.

In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:

CREATE TABLE User
(
UserId INT IDENTITY(1,1)
, ...
)

CREATE TABLE Group
(
GroupId INT IDENTITY(1,1)
, ...
)

CREATE TABLE UserToGroup
(
UserId INT
, GroupId INT
, ?
)

A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.

Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.

The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.

I haven't tested this yet, though, so I welcome any comments or contradictions.

Tuesday, November 27, 2007

Odd SSIS bug

Here's what happened: I added a column to a source table, went into the Data Flow source, refreshed the table list, selected the table, and checked all the output columns that I wanted. Then I resolved all the naming issues with the downstream tasks.

But I'm getting an error (#1 below) when I try to build, because for some reason SSIS will not automatically add the new column to the Error Output, and it won't let me add a column to that output, even in the Advanced Editor. This is annoying. I'm probably going to need to delete and add this data source, which is going to screw up all my transformations.


Error Message #1
Error at Data Flow Task [tbCA Source [1]]: The output column "CaxSubTypeId" (1052) on the non-error output has no corresponding output column on the error output.

Error Message #2
TITLE: Microsoft Visual Studio
------------------------------

Error at Data Flow Task [tbCA Source [1]]: The component does not allow adding columns to this input or output.