NYC DBA: sql server

Showing posts with label sql server. Show all posts

Friday, August 2, 2013

Dealing with runaway DBMail messages

Say you have a particular type of alert configured on your SQL Server. Maybe it sends an email any time there's blocking that lasts longer than a particular threshold.

Then you have a long-running multi-threaded process that runs on a cluster and hammers your SQL Server with a couple hundred threads for hours, and some of them wind up blocking. You build up a lot of alerts, say 50,000. Your SMTP server can't handle all of them at once, so you (and your team) wind up getting a heavy stream of them for the next few hours or maybe even days.

Sound like fun? Yeah, not to me either.

There are multiple ways to handle this, but the cleanest is probably to kill the messages at their source. I'm assuming in this example that the messages have already been queued, so fixing your alert will have to come later. Here's the simplest way to deal with your message problem - stop your DBMail service broker objects, delete the messages, and start things back up:

use msdb

exec sysmail_stop_sp

exec sysmail_delete_mailitems_sp @sent_before = '[DATE_OF_SCREWUP]'

exec sysmail_start_sp

Your mail server admin and team members will thank you.

Tuesday, September 18, 2012

sqlcmd swallows errors by default

I'm sure this is well-known out in SQL land, but I keep forgetting about it, so I'm writing this short post as a way to cement the knowledge in my brain: sqlcmd, the current command line utility for running T-SQL and replacement for OSQL, does not return a DOS ERRORLEVEL > 0 by default, even when the command being executed fails.

You need to pass the -b flag to raise an ERRORLEVEL 1 when a real error (SQL error severity > 10) occurs, which is crucial when running batch jobs in job systems that report status based on DOS ERRORLEVELs.

Wednesday, January 4, 2012

More fun with SSIS type conversions

Just finished painfully debugging a strange SSIS issue about which I could find no documentation anywhere on the web, so I'm noting it here.

I had changed a type within a large unwieldy data flow from DT_R4 (float) to DT_R8 (double), since the smaller data type was causing some weirdness around the least significant figures on some option prices. This seemed to work in testing, but I got a cryptic error about a conditional operation failing during a derived column transformation when I deployed it to Production. I opened the package, added an error redirection for the component and a data viewer, and sure enough, it was failing on a transformation that utilized some of the columns I had modified.

I couldn't find any reason why the conditional would fail, since it was faithfully spitting out BOOLs, but the resulting transformation split the float into Integer and Decimal by casting the value into a WSTR, 20 and then finding the decimal point. After trying a dozen other things, I tried adjusting the size of the WSTR cast to a WSTR, 40. That worked.

Note that the values that were failing would not even come close to 40 characters (ex: 110.68), but apparently the possibility of running into an overflow breaks something in the SSIS runtime.

So, a word to the wise: if casting from float (DT_R4 or DT_R8) to Unicode string (WSTR), make sure you leave enough room to cast any possible value.

Wednesday, November 30, 2011

SQL could use some real string manipulation

In order to remove the first "-" from a string, this is the query I just had to run:

update tbFoo set BBG_ticker = substring(BBG_ticker, 1, charindex('-', BBG_ticker) - 1) + ' ' + substring(BBG_ticker, charindex('-', BBG_ticker)+1, LEN(BBG_ticker) - charindex('-', BBG_ticker))

Does that seem like overkill to anyone else? Let's compare it to Python...

BBG_ticker.replace("-"," ",1)

Wednesday, October 5, 2011

Loading Fixed Width files with BCP / Bulk Insert

I had to craft some format files in order to load a couple of fixed width files using the SQL Server BULK INSERT / bcp tools, and had some issues with the documentation when trying to get them to work.

So let me state this explicitly: when attempting to load a file with no line breaks, you do not need a terminator for any field in your format file. Just specify all the field lengths, set the xsi:type to CharFixed, and the whole thing should stream in. Ex:

<record>
<field id="1" type="CharFixed" length="12" />
<field id="2" type="CharFixed" length="3" />
<field id="3" type="CharFixed" length="5" />
</record>
<row>
<column source="1" name="series" type="SQLCHAR" />
<column source="2" name="pool" type="SQLCHAR" />
<column source="3" name="deal" type="SQLCHAR" />
</row>

You can still mix and match - have a terminator for the last field in each line, for example - but if your file has 0 line breaks, you don't need it.

Also, the easiest way I found to handle skipping columns was to use OPENROWSET(BULK). Just make sure you're selecting the names of the columns from the ROW section of the format file, not the RECORD section.

i.e.

SELECT series, deal FROM
OPENROWSET(BULK 'source.txt',
FORMATFILE='sample.xml'
) as x;

The documentation covers that part a bit better, but just wanted to reproduce it for my own sake, since I'm sure I'll forget how I did it between now and the next time I use BULK INSERT a few years from now.

P.S. Gotta throw in a plug for tohtml.com here for making my code actually paste into Blogger and look decent to boot.

Tuesday, August 23, 2011

SSIS, MSDTC, and SQL Server clusters

Note to self: when receiving the following error from an SSIS package, the first thing to check is the Authentication setting for MSDTC.

Description: The SSIS Runtime has failed to enlist the OLE DB connection in a distributed transaction with error 0x8004D00E "The transaction has already been implicitly or explicitly committed or aborted".

The security setting should be set to No Authentication Required. Mutual does not appear to work with clusters. This is especially significant if, like me, you have a non-clustered instance in QA and a clustered one in Prod and you have been beating your head against the wall trying to figure out why MSDTC, with all the same settings, works in the former but not the latter.

Friday, July 1, 2011

The Query Did Not Run

Why thank you, Excel, that's a very helpful error message. Except I was watching the query execute on SQL Server, and it DOES run. Perhaps there could be a different problem?

The actual, exact text of the error is

The query did not run, or the database table could not be opened.

The sheet in question was running a stored procedure, which had been working fine, albeit very slowly. A coworker tuned the proc and happened to use a temp table to accumulate the results for multiple queries instead of a UNION, producing better execution plans for each. However, despite being 4x faster, the proc stopped running from Excel, throwing the error message above.

At first I thought there might be a problem with the final DROP #table statement in the proc, so I removed that. No help. I adjusted various properties of the database connection from Excel. No change. I fiddled with the table setup in Excel. Nothing.

Then I remembered that I had some difficulties with pyodbc a while back that were resolved by setting NOCOUNT ON in the query. I can't recall the exact cause, but it's something about the driver seeing the first count returned as the query result, when of course it's not in a proc that populates a table with a few queries and then selects from it as the final step. Anyway, I set NOCOUNT ON at the beginning of the proc and OFF at the end and voila, it worked. So try that if you're having difficulties with Excel not executing your stored proc correctly.

Tuesday, June 21, 2011

So you've got a problem, and you want to solve it using SQL_VARIANT

Now you've got a bigger problem, because using SQL_VARIANT for anything is a terrible idea. It's a memory and storage hog, it generally can't be indexed, it produces inconsistent and unpredictable behavior in code, and it's not supported by most client libraries. Don't use it.

Of course, you may not have a choice, if some esteemed predecessor has implemented database structures with it. In that case, you should definitely read up on it, but I just wanted to throw out one particularly useful function that I encountered yesterday while tearing my hair out trying to solve a problem with our resident SQL_VARIANT infection.

CREATE TABLE dbo.tbFoo (

ID INT IDENTITY(1,1),

bar SQL_VARIANT)

That function would be SQL_VARIANT_PROPERTY(). The primary use of this function is to get the BaseType of the SQL_VARIANT column or variable, as in:

SELECT SQL_VARIANT_PROPERTY(bar, 'BaseType') FROM dbo.tbFoo

This will let you know whether your variable or column is actually being used as an int, varchar, nvarchar, etc. There are other properties that can be checked as well using this function; essentially anything that sp_help would usually give you.

Knowing the base type can be enormously helpful. For example, when you encounter code that joins the table to itself to compare SQL_VARIANT values directly...

SELECT * FROM dbo.tbFoo f1

INNER JOIN dbo.tbFoo f2

ON f1.bar = f2.bar

AND f1.ID < f2.ID

Now say you have dozens of processes inserting into this table, and they're not particularly standardized. Say one of them inserts non-unicode strings (VARCHAR) while another inserts unicode strings (NVARCHAR).

INSERT INTO dbo.tbFoo SELECT CAST('A1B2C3' as varchar)

INSERT INTO dbo.tbFoo SELECT CAST('A1B2C3' as nvarchar)

If you had CAST them in your query, they would be comparable, but without the cast, they may or may not be, and you might not get the row back that you expected.

The best fix is to avoid writing this code, and be sure to CAST your SQL_VARIANTs into the type you're expecting to use, but if you need to restore functionality to this type of code, you can update your SQL_VARIANT column base type.

UPDATE dbo.tbFoo SET bar = CAST(bar as NVARCHAR) WHERE SQL_VARIANT_PROPERTY(bar, 'BaseType') = 'varchar'

All this may or may not help you with your SQL_VARIANT problems, as there's plenty of unexplored territory even within the issues I've touched upon here. But it would have helped me yesterday. Good luck.

Friday, April 1, 2011

In SSIS, memory corruption error == bad parameter passing

Had a strange situation with an SSIS package that would occasionally produce the following error message from an Execute SQL task:

"Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

After several rounds of machine reboots, file checking, and messing with the connection manager, I finally checked the actual query being executed (I know, way to work backwards). It had a parameter being passed in that must have been hand-entered, because it was lacking the namespace, i.e.

ParamName1

instead of

User::ParamName1.

Corrected that, and everything worked fine. So the error message is thoroughly misleading, but if you investigate the query execution, you may find a bad parameter name.

Wednesday, March 10, 2010

FILLFACTOR is really not that complicated

I keep coming across code that explicitly sets the FILLFACTOR option on an index for no good reason. FILLFACTOR is not complicated - it leaves some percentage of index pages empty in order to prevent page splits and resulting fragmentation and low page density during inserts into the middle of a key range. It can be a useful option iff:

- the table will actually get rows inserted into it!

- there is no monotonically increasing index key (like a primary key identity column)

- the author is intimately familiar with the usage pattern of the table and knows how to balance the additional I/O required to read the extra pages with the I/O saved by preventing page splits

Otherwise you're just taking an automatic IO performance hit of n% (where n is 100-(FILLFACTOR)) for no good reason.

Particularly ridiculous is the code that I've found which does something like this:

CREATE TABLE #tempstuff (foo int, bar varchar(20))

INSERT INTO #tempstuff SELECT foo, bar FROM dbo.sometable

CREATE INDEX ix_reallyNotUseful ON #tempstuff (foo, bar) WITH FILLFACTOR = 90

First off, make the index clustered. If this table will not see any more inserts, there's no reason to leave it as a heap and then have to create surrogate keys in order to build the non-clustered index. Might as well physically order it by the index keys and gain the additional performance of the clustered index.

Second, the table is not getting any more inserts after this. Why would you care about leaving extra space on the pages? Set the FILLFACTOR to 100!

This has been a public service rant from your irritable neighbor database engineer.

Thursday, January 14, 2010

Daily WTF: SSIS edition

With apologies to Alex P of the real Daily WTF, here's my latest legacy code annoyance:

We have a variety of data imports that employ multiple SSIS packages in order to break the import logic into manageable, consistent stages. So far, so good. Except that a few particularly intrepid former team members of mine preferred to construct parent packages that called child packages, complete with variable passing to maintain state awareness, directory location consistency, dynamic environment configuration, etc. instead of just calling them from the job management system in successive tasks.

Oh, and none of these packages are in source control, they're just scattered about the file system like leaves in an ill-kept yard.

All of this leads me to my current exercise in SSIS surgery - I needed to change the behavior of one of these packages, so I decided to update it to our modern standards of SSIS behavior and move it to a new location. This one has a final step that calls the next package, a process I wasn't quite prepared to rework, so I wanted to simply update the directory in which it expects to find the next package. "Hopefully it does it on the fly," I thought, "since all the related packages are kept in the same directory and it could just pull the working directory at runtime..."

No such luck.

Instead, the package uses a variable to hold the path of the next package, which is configured via an expression that pulls from another variable, which pulls from another variable, which pulls from a package configuration, which pulls from a registry string. Which I don't have on my machine.

Got that?

Registry Key -> Package Configuration -> Variable A -> Variable B -> Variable C -> Script Task

I've got news for you, former developer: it's called IO.Directory.GetCurrentDirectory().

Thursday, September 17, 2009

Regex to remove end-of-line white space

Utilizing the blog as my public scratchpad (as usual), here's a regex for removing whitespace from the end of all lines. Useful when pulling up the text of a SQL object with sp_helptext and then creating it again.

Replace: [ \t]+\n

With: \n

Wednesday, June 3, 2009

Conflicting BULK INSERTs

Let me start by saying that this is not a post about concurrent BULK INSERTs into a single table or SELECTs during a BULK INSERT, both of which the SQL CAT has covered pretty effectively, along with many other questions about bulk data loading. My problem is with concurrent BULK INSERTs into different tables on the same database.

Basically, there are two different data sets, one large and one small, that get BULK INSERTed into two different tables in the same DB on different schedules. Like Alan's bicycle in Cryptonomicon, every so often the two schedules will align and the chain will fall off, or more accurately, the big BULK INSERT will hold up the small BULK INSERT. There doesn't seem to be a reason for the SQL engine to block one with the other; they're not hitting the same table, so locking should not be a problem, and the machine in question is a dual quad-core, so SOS_SCHEDULER_YIELDs are not prevalent.

OK, if it's not the SQL engine, it's probably a shared resource issue. Sure enough, an inspection of the wait types during a concurrent execution of these inserts revealed a lot of PAGEIOLATCH_SH, indicating a disk problem. But this database is in Simple recovery mode, so it shouldn't be doing a lot of logging on a bulk-logged operation like BULK INSERT. We should just be writing dirty pages to memory and then checkpointing at our leisure, right? Why are the disks getting hit so hard? Unless we don't have enough memory to write those dirty pages...

Which a perfmon counter log revealed to be precisely the case. The box only has 3.5 GB of RAM, has a number of other applications running, and shows high levels of paging and low Page Life Expectancy. The lack of memory available to SQL means that the larger dataset being BULK INSERTed won't fit into RAM, so checkpointing starts almost immediately and blocks the smaller BULK INSERT from completing until the slow local disks can finish with the bigger one. The easiest and cheapest solution will almost certainly be to add RAM, if only I can talk Systems Engineering into digging some up.

Wednesday, April 22, 2009

Performance of LIKE vs = in SQL queries

Got a question from a coworker yesterday about the relative performance of LIKE vs = in a SQL query, and had to think about it a bit. He wanted to know whether there was a performance penalty to using LIKE if there were no wildcards. In other words, if this:

SELECT * FROM tbFoo WHERE colA LIKE 'bar'

would perform equivalently to this:

SELECT * FROM tbFoo WHERE colA = 'bar'

My knee-jerk response was that the = would be faster, but I thought about it and realized that the query optimizer would actually see them as the same thing. A check of the query plans against a quickly-created tbFoo confirmed it. So that's what I told him.

Except that I realized a moment later that there was a major caveat - the query optimization depends on how the statement is parameterized. If it's purely ad hoc SQL and being compiled at run-time, then the statements are equivalent, but if there's going to be any plan re-use, either by including the statement in a stored proc or preparing it and executing via sp_executesql, the LIKE will impose a significant penalty.

This is because the optimizer doesn't know at compile time whether the parameter to the LIKE operator will contain a wild card, so it can't use the more specific optimization (including index selection) that it could with an = operator. So if you mostly pass parameters without wildcards, you will be executing a suboptimal query plan. Keep this in mind when designing your query!

---

I realized after writing this that this topic has been covered in more depth and with more eloquence by Erland Sommarskog and any number of other SQL MVPs and bloggers, but this post is already written, so I'm releasing it into the wild anyway.

Friday, March 13, 2009

pyodbc doesn't seem to like BLOBs

I tried to pull a Binary Large OBject out of a SQL Server database yesterday via pyodbc, and it blew up.

In [4]: c = pyodbc.connect('DRIVER={SQL Server};SERVER=servername;')
In [5]: r = c.execute('select Blob from dbo.tbDoc where DocID = 8')
In [6]: row = r.fetchone()
assertion failed: getdata.cpp(171)
cbRead == 0 || pch[cbRead-element_size] != 0

And crashed my IPython shell, to boot.

Python's DB API spec (aka PEP 249) doesn't appear to specify exactly how to handle BLOBs, but it does mention them, so I presume they're supposed to be supported. I wound up grabbing this particular BLOB in a C# app instead, but this seems like it could be an issue in the future, so if anyone knows what's going on, please let me know. Otherwise I'll dig into this issue next time I have some spare time or a pressing need to manipulate BLOBs in Python.

UPDATE: I wasn't on the latest build of pyodbc, so I updated to 2.1.3. After updating, I was able to fetch the binary stream from that row. Hooray!

Although when attempting to print it out, I got some weird behavior - my PC's internal speaker started beeping like mad and the window stopped responding. As my colleague said:

P.V. [1:57 PM]:
What the hell was stored in that blob?
Ira Pfeifer [1:58 PM]:
just a hex stream
P.V. [1:58 PM]:
This is like something from Gibson.

Strange...

Monday, February 2, 2009

Weird SQLCMD issue

I've got a server that runs the open source Job Scheduler to execute scheduled tasks, and I encountered a strange problem with SQLCMD today. Basically, a very simple script fails to execute:

2009-02-02 13:52:01.224 [info] C:\Program Files\scheduler>sqlcmd -S TestServer1 -E -i "c:\temp\test.sql"
2009-02-02 13:52:01.224 [info] SCHEDULER-915 Process event
2009-02-02 13:52:01.224 [ERROR] SCHEDULER-280 Process terminated with exit code 1 (0x1)

This script executes without a problem on the command line, and

echo %errorlevel%

after its execution returns 0. I thought this might be an authentication issue, but switching to SQL authentication and entering the credentials explicitly on the command line produced the same result.

The strangest part of the whole thing is that switching from SQLCMD to OSQL makes the script work fine. I'd rather use SQLCMD, but I don't have time to investigate the minute difference in output that's causing JS to barf on it, so I guess it's back to the old reliable OSQL. At least until it's no longer supported.

Friday, December 5, 2008

When an OUTER JOIN becomes an implicit INNER JOIN

I meant to blog about SQL PASS 2008, which I attended in Seattle and where I learned some good stuff, but I haven't had time. Plus there were so many blogs and tweets from it that most of what I would have said was covered. I will note that my favorite sessions were Itzik Ben-Gan's pre-conference session on Advanced T-SQL Querying and Bob Ward's talk about SQL Server memory. If you get a chance to hear either of these guys talk, jump on it - they're both extremely sharp, and good presenters to boot.

I'm posting this quick note because I had a question from one of our developers today regarding style for various joins. He wanted to know whether a condition on a secondary table in a query should go in the ON clause or the WHERE clause. This was my response:

I prefer to see conditions on the primary table in the WHERE clause, with conditions on the secondary tables in the ON clauses. This is because a condition in the WHERE clause for the secondary table when using an OUTER JOIN will effectively convert that to an INNER JOIN, making it confusing to read the query. However, putting that condition in the ON clause keeps the OUTER JOIN functioning as expected. Does that make sense?

Here’s an example with actual tables.

Notice how the first and second queries return the same thing even though the second is a left outer join, because putting the t2 condition in the WHERE clause makes it required for the whole result set instead of just the join. In the third query, the t2 condition is in the ON clause, making it only required for that join.

create table t1
(alfa int,
bravo int)

create table t2
(alfa int,
charlie int)

insert into t1
select 1, 2
union all select 2, 3
union all select 3, 4
union all select 4, 5

insert into t2
select 1, 4
union all select 2, 9
union all select 3, 16

select * from t1
inner join t2
on t1.alfa = t2.alfa
and t2.charlie = 16
where t1.alfa < 5

select * from t1
left outer join t2
on t1.alfa = t2.alfa
where t1.alfa < 5
and t2.charlie = 16

select * from t1
left outer join t2
on t1.alfa = t2.alfa
and t2.charlie = 16
where t1.alfa < 5

drop table t1, t2

Tuesday, October 28, 2008

Optimizing tempdb and log disk performance

We implemented a new SAN fairly recently: an HP EVA. In the enthusiasm to move to this new device, some tuning decisions were overlooked or made quickly, and are now being revisited. For example, all the drives allocated to database servers were created on VRAID5 (essentially equivalent to RAID 5, refer to HP's IT Resource Center for the nuances).

In general, it's a known fact that RAID 10 (aka VRAID1) outperforms RAID 5 (aka VRAID5) for write-heavy workloads, but at a higher cost per GB. I wanted to revisit the cost-benefit analysis that led us to use RAID 5, because I've always used RAID 10 for at least logs and tempdb in the past, and for all SQL data when possible.

To make sure that my analysis was relevant to our particular IO load, I planned to generate load on each type of drive using IOMeter, for which I would need to construct an appropriate workload. I recorded a Counter Log with the following counters from each drive using Windows Performance Monitor from our main OLTP server over the course of several days.

Avg Disk Bytes/Read
Avg Disk Bytes/Write
Avg Disk Queue Length
Disk Reads/sec
Disk Writes/sec

Using these counters, I constructed a characteristic workload for each drive.

To do this, I first bucketed each reading of Avg Disk Bytes/[Read, Write] into a size category (.5K, 1K, ...64K, 128K, etc.) and chose enough categories to add up to 80% of the total in order to determine what size IOs to generate.

Then I added the averages of Disk Reads/sec and Disk Writes/sec in order to get the percentages of reads and writes against each disk, and used the average of Avg Disk Queue Length to determine how many IOs outstanding to queue up.

Finally, I ran through this whole exercise again after deciding to filter for Disk Queues > 2. I did this because I figured that the SAN probably wasn't having any trouble when disk queues were low - RAID 5 vs RAID 1 would only show a difference in latency during those times. The trouble comes when loads (for which disk queues are a proxy) are high; that's when you need maximum throughput and minimum response time.

After all this, I came up with the following characteristic workloads for our system (yours may vary substantially):

TEMPDB
Reads - 2%
- 68% 64K read size
- 32% 8K read size
Writes - 98%
- 83% 64K write size
- 17% 512B write size
600 IOs outstanding

LOGS
Reads - 36%
- 87% 128K read size
- 13% 1024K read size
Writes - 64%
- 81% 64K write size
- 19% 16K write size
4 IOs outstanding

Next I worked with our Systems Engineering group to create a VRAID5 and a VRAID1 drive of the same size and attached both to the same server. I created Access Specifications with IOMeter for each workload shown above (see image below), and then ran those Access Specifications against each drive with the requisite number of IOs outstanding.

This yielded some fairly conclusive results. I won't post the actual throughput numbers, but here are the margins by which VRAID1 outperformed VRAID5 for each workload:

TEMPDB

IOPS	MB/s	Avg IO Response Time
117.65%	117.55%	51.48%

LOGS

IOPS	MB/s	Avg IO Response Time
80.05%	65.63%	39.69%

So according to my testing, VRAID1 is vastly superior for these types of workloads, to the point where it's probably worth the higher cost per GB. We'll see how this bears out in further testing and in real-world usage shortly...

Monday, July 14, 2008

pyodbc returns an int instead of rows in a cursor

Just a quick note about pyodbc: if you're getting a TypeError from a pyodbc cursor because it's an int instead of a rowset, it's probably because the proc or query being called is returning extra info or messages. Try turning off anything that's returning extraneous info, like ANSI_WARNINGS.

Tuesday, June 10, 2008

Returning rows from a stored procedure with pyodbc, pt 2

Pyodbc really doesn't like extraneous information to be returned with its (ONE AND ONLY ONE!) result set from a stored procedure.

In my latest fight with pyodbc, I had a stored proc returning a rowset with a few thousand rows. This worked fine most of the time, except in a few cases with certain input parameters. In those cases, the cursor returned by execute() would contain an int, so any cursor method I called on it would break.

At first I thought I might have hit some kind of row size limit or rowcount limit, but realized the result set wasn't big enough for that to be the case. The proc wasn't throwing off any errors when run in SQL Management Studio with the same input parameters, so that wasn't the problem.

Then I noticed that it was returning a single warning message in the Messages tab.

Warning: Null value is eliminated by an aggregate or other SET operation.

This is a fairly common warning in SQL land, so I hadn't paid any attention to it. But it is possible to suppress, using the following statement:

SET ANSI_WARNINGS OFF

Sure enough, this did the trick. The moral of this story (just like last time) is that pyodbc wants a single rowset and that's it; returning any data in any of the alternate ODBC channels seems to break it, or at least, not yield the cursor in the result set that I expected.