Friday, May 31, 2013

Why SQL Server Replication depends on Database Mirroring

Had a convoluted replication problem this morning that turned out to be caused by mirroring.  Users noticed a lack of data on a replicated server that was quickly traced back to replication failing to deliver commands. The Log Reader Agent noted the following:

Replicated transactions are waiting for next Log backup or for mirroring partner to catch up

There was nothing wrong with log backups, so mirroring was the next thing to be checked.  It was suspended, and attempting to resume it failed.  Checking the Mirroring Monitor yielded nothing helpful, but the SQL Server error log on the host server showed the following:



Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid34s

Message
'TCP://foo.bar.com:5022', the remote mirroring partner for database 'SomeDB', encountered error 3624, status 1, severity 20. Database mirroring has been suspended.  Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.




On the mirrored server, there were additional errors, including a stack dump, but this was the most relevant:



Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
Error: 3624, Severity: 20, State: 1.

---


Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
SQL Server Assertion: File: , line=807 Failed Assertion = 'result == LCK_OK'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Severity 20 == bad.  This basically comes down to corruption on the mirror target, which irreparably broke mirroring.  This, in turn, broke replication.  

So why, the technical user asked, does replication between business-critical Production servers depend on mirroring to a disaster recovery environment?  I had to think about that one, but there's a good reason.


Assume replication does not depend on mirroring.  Server A is both the replication publisher and mirror source.  It replicates rows to Server B and mirrors to Server C.  Mirroring breaks at time t0.  Rows continue to replicate, and are marked as replicated on Server A.

Server A fails over to Server C at time t1.  Replication picks up from there with Server C as published.  Server C republishes the rows between t0 and t1 to Server B because it does not have a record of those rows having been replicated.

Makes sense now, right?

Tuesday, September 18, 2012

sqlcmd swallows errors by default

I'm sure this is well-known out in SQL land, but I keep forgetting about it, so I'm writing this short post as a way to cement the knowledge in my brain: sqlcmd, the current command line utility for running T-SQL and replacement for OSQL, does not return a DOS ERRORLEVEL > 0 by default, even when the command being executed fails.

You need to pass the -b flag to raise an ERRORLEVEL 1 when a real error (SQL error severity > 10) occurs, which is crucial when running batch jobs in job systems that report status based on DOS ERRORLEVELs.

Thursday, July 12, 2012

pyodbc autocommits when using "with"

Discovered an interesting variation in pyodbc behavior today.  A coworker asked me if he needed to commit explicitly when using pyodbc to execute a stored procedure.  I told him he did, but he protested that his code worked fine without it.  Here's a paraphrased version:

def db_exec(qry):
    with pyodbc.connect(cnxn_str) as db:
        db.execute(proc)

I tested it, and what do you know?  It works.  But the following does not:

db = pyodbc.connect(cnxn_str)
db.execute(proc)
db.close()

In this case, the results of proc will be rolled back automatically when closing the connection.  So evidently the way that the with handler is coded includes an autocommit.

Wednesday, June 20, 2012

Stripping time from GETDATE(), 2008 edition

Stripping the time from GETDATE() to get the date at 00:00:00 is a common practice, and has been accomplished by various methods in the past, including CASTing to VARCHAR and back, using DATEADD and DATEDIFF, and using FLOOR after casting to INT.  It should be less necessary these days since the introduction of the DATE data type, but sometimes it's still a useful comparison trick.

And now because of DATE there's a simpler way to do the conversion: CAST(@dt as DATE).  But is this any faster?

The short answer is no, but it doesn't seem to be any slower either.  I ran a quick test:


declare @dtest datetime
declare @i int = 0, @j int = 0
declare @start datetime

set @start = GETDATE()

while @i < 1000000
begin
    set @dtest = DATEADD(dd, 0, DATEDIFF(dd, 0, GETDATE()))
    set @i = @i + 1
end
    
print datediff(ms, @start, getdate())
set @start = GETDATE()

while @j < 1000000
begin
    set @dtest = CAST(GETDATE() as DATE)
    set @j = @j + 1
end

print datediff(ms, @start, getdate())

In half a dozen runs, the two versions never varied by more than a handful of ms.  Feel free to try on your own and let me know if you see anything different.

Friday, March 30, 2012

Another hazard of Laptopistan - slow WiFi

This is hardly unexpected, but one of the side effects of everyone using Starbucks as a de facto workplace and internet cafe is that their WiFi is heavily stressed. People watching online video, using Remote Desktop (like me), probably even BitTorrent. I don't know what sort of routers they put in these coffee shops, but they often seem to be unequal to the load they're handling.

This morning I'm in a Starbucks, and my connection went from perfectly fine, at 8:30 AM before anyone was here, to barely usable now that I'm surrounded by laptops. One more reason for actual workspace.

Friday, March 16, 2012

Sliding Window partitioning

Implemented my first "sliding window" partitioned table. Wound up being easier than I expected, especially since the table will only have 2 partitions for now - Current and Archive.

Most sliding window schemes I've seen out there on the intertubes make use of a date as the partition key and update the sliding window based on the current date. In this case, the data comes in somewhat sporadically, so although it's date-based (a month key), I made the partitioning data-driven instead of date-driven. So for starters we have a RANGE LEFT scheme like this:

Partition Month Key
Current -
Archive 88

Say we're currently on month 90. So months 89 and 90 are in Current, and 1-88 are in Archive. When the sliding window proc runs, it does essentially this:

SELECT @NewBoundary = MAX(monthKey)-2 FROM tbPartitionedFoo
SELECT @CurBoundary = CAST(prv.value as smallint) FROM sys.partition_functions AS pf
JOIN sys.partition_range_values as prv
ON prv.function_id = pf.function_id
WHERE pf.name = 'partfn_foo'

(If there was more than one partition range value, a MIN() or something would be needed to grab the appropriate value.)

IF @NewBoundary > @CurBoundary
BEGIN
ALTER PARTITION FUNCTION partfn_foo() SPLIT RANGE (@NewBoundary)
ALTER PARTITION FUNCTION partfn_foo() MERGE RANGE (@CurBoundary)
END

So when we start getting monthKey 91 data and the maintenance proc runs, it will do this:

IF 89 > 88
BEGIN
ALTER PARTITION FUNCTION partfn_foo() SPLIT RANGE (89)

Now we have 3 partitions:

Partition Month Key
Current -
temp 89
Archive 88

ALTER PARTITION FUNCTION partfn_foo() MERGE RANGE (88)
END

And now we're back to 2:

Partition Month Key
Current -
Archive 89

Monday, March 5, 2012

Python Gets Things Done

A quick example of why I've grown to love Python.

My girlfriend had a laboriously scanned document in two pieces that needed to be combined. I had a .NET command line utility that I wrote a couple years ago that does just that, but because one of the scanned pages was a different size, it got cut off in the resulting output file. So I could either try to dig up some docs on the library I used, play around with Intellisense in Visual Studio, or see if there was a Python library that might work.

2 minutes of Googling later I had pyPDF, and maybe 10 minutes after that, I had reset the mediaBox on the outsized page, written it to a new pdf file, and emailed it off. Q.E.D.