I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.
In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:
CREATE TABLE User
UserId INT IDENTITY(1,1)
CREATE TABLE Group
GroupId INT IDENTITY(1,1)
CREATE TABLE UserToGroup
, GroupId INT
A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.
Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.
The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.
I haven't tested this yet, though, so I welcome any comments or contradictions.