Friday, August 2, 2013

Dealing with runaway DBMail messages

Say you have a particular type of alert configured on your SQL Server.  Maybe it sends an email any time there's blocking that lasts longer than a particular threshold.

Then you have a long-running multi-threaded process that runs on a cluster and hammers your SQL Server with a couple hundred threads for hours, and some of them wind up blocking.  You build up a lot of alerts, say 50,000.  Your SMTP server can't handle all of them at once, so you (and your team) wind up getting a heavy stream of them for the next few hours or maybe even days.

Sound like fun?  Yeah, not to me either.

There are multiple ways to handle this, but the cleanest is probably to kill the messages at their source. I'm assuming in this example that the messages have already been queued, so fixing your alert will have to come later.  Here's the simplest way to deal with your message problem - stop your DBMail service broker objects, delete the messages, and start things back up:

use msdb

exec sysmail_stop_sp

exec sysmail_delete_mailitems_sp @sent_before = '[DATE_OF_SCREWUP]'

exec sysmail_start_sp

Your mail server admin and team members will thank you.

Testing something

{
"Python":8,
"C++":4,
"SQL":9,
"NoSQL":5,
"C#":6
}

Friday, July 26, 2013

Commandments for scraping public data sources

I've used a couple different flavors of publicly available data as data sources, and I can tell you that there's a good reason why firms pay for clean data.  Not that even vendor-processed data is usually 100% clean, but that's a different story.  Anyway, I've learned a few things the hard way about scraping public data:

  1. Use a resilient HTML parsing engine.  BeautifulSoup is great, and makes it very easy to explore HTML structures, but you'll almost certainly want the LXML backend to avoid blowing up on unclosed tags and nonstandard nesting.

  2. Always call strip().  I've had a couple of import busts in which things weren't matching that really looked like they should.  90% of the time it was because there was random whitespace that had infiltrated the actual data.  This also leads me to...

  3. Anticipate spacing changes. ESPECIALLY with hand-entered data, even if there's a template, there can always be an extra line between the logo and the header, a blank line between data rows, or new spacing for the file date.  Whenever possible, search for a reference value that points the way to the data, and throw away those newlines rather than expecting data by default.

  4. Keep original copies. The first thing you should do with any parsed source is make a raw copy of it (if you can) so that you can refer back to it when #5 happens.

  5. Expect change!  No warranty for most public data sources is granted, implied, or hinted at, and sometimes the exact opposite is the case: public data "providers" happily change formats, addresses, and datasets available to stymie anyone making systematic use of such.  Make sure you've got good logging and debugging set up so you can figure out quickly where and how something changed.

  6. Make things modular.  Building on #5 above, if you've got your parser set up to quickly swap out the downloader, parser, normalizer, or persister (for example) for any given source and toggle those sources on and off easily, it'll make things much easier when you need to quickly hack up a new one because somebody changed something somewhere.

  7. Handle multiple formats.  I've parsed many Excel spreadsheets that usually have an XLDate in a column but, every so often, spit out a text string instead because of the way someone typed it or copied it in.  If your date parser function knows about this, you don't have to think about it.

  8. Don't hammer.  When you're using someone else's data that they're not explicitly making available in a structured format, be nice, download one copy and build your import off of that.  Otherwise you're putting undue load on their servers and exposing your IP to possible banning if they decide you're violating their ToS.

  9. Inheritance is your friend.  Most of the parsers I've written have a lot of common structure that needs a little tweaking for certain sources.  If you've built a solid class hierarchy, you can easily override that save() method in the 2 subclasses that need it while only needing to write a basic one for the other dozen.  Any time I find myself copying code, I generally try to move it up to the superclass.

  10. Pad your dev time estimate!  This is a general problem I have, but I always look at a source, pretty quickly shred the data out with bs4 or something similar, and go "yeah, this'll take 2 days."  Sure, to code-complete.  Then it'll take a week to figure out every stupid corner case, extra whitespace location, and placeholder for None.  Trust me (and this goes double for you, future me, when you're re-reading this).

Wednesday, June 26, 2013

GetBytes error when using a DATE column

Quick troubleshooting post. Given this error from a colleague upon execution of a pretty simple function that used a DATE column:

An error occurred while executing batch. Error message is: Invalid attempt to GetBytes on column 'RelDate'.  The GetBytes function can only be used on columns of type Text, NText, or Image.

I couldn't reproduce the problem on my machine - the same function worked fine for me.  Checked all his query execution settings and didn't see anything, and then he realized he was running SQL Management Studio 2005 connected to a 2008 R2 instance.  As soon as he fired up SMS 2008 R2, the function worked fine.

Friday, May 31, 2013

Why SQL Server Replication depends on Database Mirroring

Had a convoluted replication problem this morning that turned out to be caused by mirroring.  Users noticed a lack of data on a replicated server that was quickly traced back to replication failing to deliver commands. The Log Reader Agent noted the following:

Replicated transactions are waiting for next Log backup or for mirroring partner to catch up

There was nothing wrong with log backups, so mirroring was the next thing to be checked.  It was suspended, and attempting to resume it failed.  Checking the Mirroring Monitor yielded nothing helpful, but the SQL Server error log on the host server showed the following:



Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid34s

Message
'TCP://foo.bar.com:5022', the remote mirroring partner for database 'SomeDB', encountered error 3624, status 1, severity 20. Database mirroring has been suspended.  Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.




On the mirrored server, there were additional errors, including a stack dump, but this was the most relevant:



Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
Error: 3624, Severity: 20, State: 1.

---


Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
SQL Server Assertion: File: , line=807 Failed Assertion = 'result == LCK_OK'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Severity 20 == bad.  This basically comes down to corruption on the mirror target, which irreparably broke mirroring.  This, in turn, broke replication.  

So why, the technical user asked, does replication between business-critical Production servers depend on mirroring to a disaster recovery environment?  I had to think about that one, but there's a good reason.


Assume replication does not depend on mirroring.  Server A is both the replication publisher and mirror source.  It replicates rows to Server B and mirrors to Server C.  Mirroring breaks at time t0.  Rows continue to replicate, and are marked as replicated on Server A.

Server A fails over to Server C at time t1.  Replication picks up from there with Server C as published.  Server C republishes the rows between t0 and t1 to Server B because it does not have a record of those rows having been replicated.

Makes sense now, right?