Friday, May 31, 2013

Why SQL Server Replication depends on Database Mirroring

Had a convoluted replication problem this morning that turned out to be caused by mirroring.  Users noticed a lack of data on a replicated server that was quickly traced back to replication failing to deliver commands. The Log Reader Agent noted the following:

Replicated transactions are waiting for next Log backup or for mirroring partner to catch up

There was nothing wrong with log backups, so mirroring was the next thing to be checked.  It was suspended, and attempting to resume it failed.  Checking the Mirroring Monitor yielded nothing helpful, but the SQL Server error log on the host server showed the following:

Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid34s

'TCP://', the remote mirroring partner for database 'SomeDB', encountered error 3624, status 1, severity 20. Database mirroring has been suspended.  Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.

On the mirrored server, there were additional errors, including a stack dump, but this was the most relevant:

Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Error: 3624, Severity: 20, State: 1.


Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

SQL Server Assertion: File: , line=807 Failed Assertion = 'result == LCK_OK'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Severity 20 == bad.  This basically comes down to corruption on the mirror target, which irreparably broke mirroring.  This, in turn, broke replication.  

So why, the technical user asked, does replication between business-critical Production servers depend on mirroring to a disaster recovery environment?  I had to think about that one, but there's a good reason.

Assume replication does not depend on mirroring.  Server A is both the replication publisher and mirror source.  It replicates rows to Server B and mirrors to Server C.  Mirroring breaks at time t0.  Rows continue to replicate, and are marked as replicated on Server A.

Server A fails over to Server C at time t1.  Replication picks up from there with Server C as published.  Server C republishes the rows between t0 and t1 to Server B because it does not have a record of those rows having been replicated.

Makes sense now, right?

1 comment:

Ellis White said...
This comment has been removed by a blog administrator.