Showing posts with label indexing. Show all posts
Showing posts with label indexing. Show all posts

Wednesday, March 10, 2010

FILLFACTOR is really not that complicated

I keep coming across code that explicitly sets the FILLFACTOR option on an index for no good reason. FILLFACTOR is not complicated - it leaves some percentage of index pages empty in order to prevent page splits and resulting fragmentation and low page density during inserts into the middle of a key range. It can be a useful option iff:

- the table will actually get rows inserted into it!
- there is no monotonically increasing index key (like a primary key identity column)
- the author is intimately familiar with the usage pattern of the table and knows how to balance the additional I/O required to read the extra pages with the I/O saved by preventing page splits

Otherwise you're just taking an automatic IO performance hit of n% (where n is 100-(FILLFACTOR)) for no good reason.

Particularly ridiculous is the code that I've found which does something like this:

CREATE TABLE #tempstuff (foo int, bar varchar(20))

INSERT INTO #tempstuff SELECT foo, bar FROM dbo.sometable

CREATE INDEX ix_reallyNotUseful ON #tempstuff (foo, bar) WITH FILLFACTOR = 90

First off, make the index clustered. If this table will not see any more inserts, there's no reason to leave it as a heap and then have to create surrogate keys in order to build the non-clustered index. Might as well physically order it by the index keys and gain the additional performance of the clustered index.

Second, the table is not getting any more inserts after this. Why would you care about leaving extra space on the pages? Set the FILLFACTOR to 100!

This has been a public service rant from your irritable neighbor database engineer.

Thursday, November 29, 2007

Many-to-many physical table design

I was creating a new set of tables (we'll say 2 for simplicity) that will contain rows that can have many-to-many relationships, and I was pondering how best to design the association table so as to optimize IO.

In this table scenario, joins can come from either direction, so it could be one row from the first table that joins to many in the second, or vice versa. I'm generally a big fan of a clustered index on some column, especially an integer ID if it's useful, but in this case, I think the best performance would be achieved by dispensing with the clustered index and ID column altogether. The basic table design would be as follows:

CREATE TABLE User
(
UserId INT IDENTITY(1,1)
, ...
)

CREATE TABLE Group
(
GroupId INT IDENTITY(1,1)
, ...
)

CREATE TABLE UserToGroup
(
UserId INT
, GroupId INT
, ?
)

A clustered index on UserId would speed up queries starting with a User and joining to Groups, but it would fragment as old users added groups, splitting pages to insert UserIds. The reverse would be true of clustering on GroupId. Unless the query pattern supported such an unbalanced design which would require frequent index defragmentation, either of these clustered indices would be less than ideal.

Adding a surrogate key identity column would be no better; it would prevent index fragmentation, but the index wouldn't be used for seeks, and the column would add 50% more space on disk between the columns that we really care about, making I/O proportionately slower.

The best design should be a heap, with nonclustered indices on UserId, GroupId and GroupId, UserId. Or more specifically, in SQL Server 2005, indices on UserId and GroupId, with the other column set up as an "included column" to make each index covering.

I haven't tested this yet, though, so I welcome any comments or contradictions.