Showing posts with label clusters. Show all posts
Showing posts with label clusters. Show all posts

Monday, October 27, 2008

Using local disks in a cluster

With the advent of (relatively) cheap, fast Solid State Disks, DBAs everywhere are realizing that they could potentially see huge performance gains by incorporating these new disks into their database servers. This is straightforward if you've got a monolithic box: buy 2 SSDs, create a new RAID 1, move tempdb and/or logs onto it, restart SQL, shake well, and watch your IO waits go to 0! Hopefully, anyway.

However, if you've got a cluster, things are a bit trickier. The SAN vendors are still ramping up their SSD support, so it can be difficult to get an SSD into a shared storage enclosure unless you roll your own. Especially if you pick up one of the most promising new disks, the FusionIO ioDrive. This disk gets attached to a PCIe bus directly rather than a SAS or SATA backplane, so it can't be put into most commercially available shared storage devices (SANs, NASs, etc.).

Even so, we were sufficiently impressed by the IO stats on this disk to pick up a couple of them, but we still wanted to put them in our production cluster. We can't use them for shared data, obviously, as the data wouldn't persist between machines. But in theory they could be perfect for tempdb, which gets wiped out on each restart and so only needs to see the same path on each machine. I figured there was probably a way to get Microsoft Cluster Services to see the local SSDs on each machine as the same disk.

To test this theory, we (I and coworker Doug H.) took a pair of virtual machines, added drives to simulate a standard local disk, shared SAN storage, and the new SSD, built them into a cluster, and installed SQL Server 2005 on the R: drive.

Disk 0 (C:) - local - boot disk
Disk 1 (Q:) - shared - quorum
Disk 2 (R:) - shared - SQL data
Disk 3 (D:) - local - simulated SSD

When the VMs were initially configured, the fake SSDs were created on the same SCSI bus as the boot disk. The cluster administrator quickly informed us that this would not fly - disks on the same bus as the boot disk cannot be used as shared resources. So we moved the D: drives to a new bus, separate from both the C: (simulated backplane) and Q:/R: (simulated SAN) drives.

The next step was to add the D: drive to the cluster. I assumed this wouldn't work without some hacking, but I decided to try a "baseline" attempt without any modifications. The Cluster Admin New Resource wizard allowed me to add the D: drive on one of the nodes, but it wouldn't come online. So much for the baseline attempt.

I knew that Windows 2003 (though maybe not 2008) clustering relies on disk signatures to identify disks for cluster operations, so I decided to synchronize the signatures of the two fake SSDs. To this end, I found a utility - MBRWizard - that would let me manually set the signature of an MBR disk.

Using this utility, I found the signature of the D: drive on machine #1:

mbrwiz /Disk=3 /Signature

and then overwrote the signature of the D: drive on machine #2:

mbrwiz /Disk=3 /Signature=xxxxxxxx

Then I added the disk as a cluster resource again, and brought it online. Success! Failover worked, too, so I failed the SQL Server group (Group 0 in the image below) back and forth between nodes and created a Data directory on each D: drive.


The next step was to make sure it worked with SQL Server. I took SQL offline, added Disk D: as a dependency, and brought it back up. Then I altered tempdb's location to use the D: drive after a restart.

alter database tempdb
modify file (NAME=tempdev, FILENAME = 'D:\Data\tempdb.mdf')
GO
alter database tempdb
modify file (NAME=templog, FILENAME = 'D:\Data\templog.ldf')
GO

Finally I took SQL offline and brought it back up, with fingers crossed. And it worked! Failover worked too.

So my proof of concept for using a local SSD in a cluster (for tempdb only!) was successful. Stay tuned to see if it works in the real world.

Wednesday, January 17, 2007

Changing the Domain for a SQL 2005 Cluster

NOTE: It looks like my old DBAIQ blog may be down for the count due to Blogger issues, and I didn't want to lose my post on changing the domain for a SQL 2005 cluster, so I retrieved it from Google's search cache, and am reprinting it here. Ironic that Google was both the cause of its destruction (taking over Blogger and breaking our team blog) and its recovery (search cache).

We're moving our Dev environment to a new domain for some unfathomable corporate reason, and it's going to be a major headache. Among our 20+ SQL Servers in the environment, we've got a few SQL 2005 clusters. According to Microsoft, we can't move them without uninstalling SQL.

http://support.microsoft.com/kb/915846/en-us

Needless to say, this would add more pain to the migration process. It wouldn't be the end of the world, but this is weekend work, and uninstalling and re-installing a couple active/active SQL clusters is just the thing to turn 1 day of work into 2+.

However, I think they're wrong.

I built two virtual machines on VMWare Server, added them to a cluster (see Repeatable Read for instructions on creating shared disks in VMWare) on our current domain, and created some databases. Verified failover capability and connectivity.

Then I moved both nodes to the new domain using the following steps:

1) Take all cluster resources offline (except the quorum, which cannot be taken offline)
2) Stop the cluster service on both nodes
3) Change the cluster service startup type to Manual
4) Change the domain of each machine to the new domain and reboot
5) After reboot, on each machine, change the cluster and SQL service accounts to accounts in the new domain
6) Run gpedit.msc or otherwise access Local Security Policy Settings (see below), and grant the following rights:

Cluster Service Account
Act as part of the operating system
Adjust memory quotas for a process
Debug programs
Increase scheduling priority
Manage auditing and security log
Restore files and directories

SQL Service Account
Adjust memory quotas for a process
Lock pages in memory
Log on as a batch job
Log on as a service
Replace a process level token



7) Add the cluster and SQL service accounts to the local Adminstrators group. NOTE: This should not be necessary for SQL, and I will update this with the minimum required permissions as soon as I sort them out. It is necessary for the cluster account, however.
8) Start the cluster service on both machines
9) Bring cluster resources online
10) Go enjoy the rest of your weekend

If you missed some permissions, the cluster service will likely fail to start with an error 7023 or 1321, and will helpfully output an error in the system log with eventId 1234 that contains a list of the necessary user rights that still need to be assigned. Now that's error reporting!

Comprehensive testing is still pending, but the preliminary results look good. After this process, SQL Server comes online on my test cluster, as do SQL Agent and Fulltext. I don't have any machines on the new domain with SQL Management Studio installed, but I could connect to SQL using osql directly on one of the cluster nodes. If anyone out there has any different experiences or comments, I'd love to hear them.

UPDATE:
My previous post left out one small but significant detail: the domain groups under which the SQL Server service accounts run. When one installs SQL 2005 on a cluster, the setup program requires domain groups to be entered for each service account. So for example:

SQL Server service account: OLDDOMAIN\SQLService
SQL Agent service account: OLDDOMAIN\SQLAgentService
SQL Browser service account: OLDDOMAIN\SQLBrowserService

Domain groups:

OLDDOMAIN\SQLServiceGroup
OLDDOMAIN\SQLAgentServiceGroup
OLDDOMAIN\FullTextGroup

Then it comes time to move your cluster, and you've followed my steps above or done your own hacking, and you've changed the service accounts to NEWDOMAIN\SQLService and so on. But the domain groups remain the same. Your cluster will come online and fail over and operate fine, but you won't be able to change it.

This was made evident when I tried to add a node to an existing cluster after moving it to a new domain. It gave me a message like "Cannot add NEWDOMAIN\SQLService to OLDDOMAIN\SQLServiceGroup." Arrgh. Microsoft had already claimed that this was not supported, so I suppose I shouldn't have been surprised.

So I started searching for the reference to OLDDOMAIN\SQLServiceGroup. And couldn't find it. Not in a config file, or a system table (I know, that was dumb, but I was desperate), or the registry, where I expected to find it. Eventually, I started combing the registry key by key within the SQL hives and came across this in HKLM\SOFTWARE\Microsoft\Microsoft SQL Server\MSSQL.1\Setup...

(okay, I tried to upload another image here to illustrate, but Blogger hates me, so screw it.)

The keys AGTGroup, FTSGroup, and SQLGroup contain the SIDs for whatever OLDDOMAIN groups you set up when installing SQL. Find the SIDs for your new domain groups (the VBScript below is how I did it), enter those in place of the old ones, restart SQL, and your cluster is moved. You should now be able to add or remove nodes, install hotfixes, etc. You'll need to update the SIDs for each SQL installation (MSSQL.2, MSSQL.3, etc.)

As with any unsupported operation, your mileage may vary, but let me know if you have a different experience with this or you run into additional problems.