NYC DBA: 2011

Thursday, December 22, 2011

pyodbc isn't playing well with multiprocessing

Wrote my first Python script yesterday that incorporates the multiprocessing module in order to parallelize some CPU-intensive calculations that were going extremely slowly when run in the database. Mission accomplished there, and it was remarkably easy in Python, but then I hit a brick wall when trying to load everything back to the database.

As it turns out, the calculation time is dominated by the database IO (~25 mil rows in and out), so I went about trying to optimize that as best I could. The reads are coming off an SSD, so they're fine, but the writes are UPDATEs, and I didn't want to issue them one at a time and incur all the connection and transaction overhead for each one. Instead, I created a staging table as a heap and batched up INSERTs of my data in a specified batch size (1000 to start) using SQL 2008 row constructors. Then the main table gets updated from the staging table right at the end, so only one giant transaction and UPDATE statement is necessary.

This strategy appeared to work fine when single-threading, but then I figured I'd throw in multiprocessing there, too, since I already had the module imported and had crafted the functions around it. Didn't work.

The error message was 42S02, aka Invalid object specified - in other words, the INSERT can't find the staging table, even though I'm creating it within the Python script, during a single-threaded section before I Pool.map() out different calculated shards to be inserted. SQL Server can obviously accept just about as many threads as you want to throw at it, so that's not the problem. The table exists, which I confirmed by adding a breakpoint (well, a set_trace()) and looking at it through Management Studio. It's all using integrated authentication, and I'm a sysadmin on the server, so there can't be any problems with schema, although I threw in some "dbo." references just to make sure. I even tried to trace the SQLNCLI driver operations, but haven't managed to get usable output yet.

So the only thing I can come up with so far is an incompatibility between pyodbc and multiprocessing. If I get more time to investigate, I'll update the post, but for now I'm just going to have to switch the DB section back to single-threaded. Boo.

Wednesday, November 30, 2011

SQL could use some real string manipulation

In order to remove the first "-" from a string, this is the query I just had to run:

update tbFoo set BBG_ticker = substring(BBG_ticker, 1, charindex('-', BBG_ticker) - 1) + ' ' + substring(BBG_ticker, charindex('-', BBG_ticker)+1, LEN(BBG_ticker) - charindex('-', BBG_ticker))

Does that seem like overkill to anyone else? Let's compare it to Python...

BBG_ticker.replace("-"," ",1)

Saturday, October 22, 2011

Meeting new people in Laptopistan

This morning I'm back at Starbucks, as we haven't yet moved closer in to DC where there are alternatives to corporate coffee world, and I made the acquaintance of a woman who

1) made me explain the meaning of my t-shirt

2) asked to use my laptop (instead of hers, for some reason?) to check her email

3) inquired about what I do for a living

I'm pretty sure the next step will be to introduce me to her daughter/niece/granddaughter who's been looking for a nice Jewish boy (which I'm not, but I'm sure it wouldn't matter).

Well, I suppose it's more interesting than doing R regressions while sitting at home.

Friday, October 14, 2011

Life as a contractor

It's my own fault that I'm a contractor and not a full employee - a year ago, my girlfriend decided to go back to school in another state, and I followed her there. My office was gracious (desperate?) enough to keep me on as a remote employee, but I was converted to contractor status, for both employment flexibility (i.e. they can let me go easily if it doesn't work out) and tax liability purposes.

For the most part, this has worked out fine. Working from home can be a double-edged sword, but the non-existent commute is great and the ability to work from alternate locations seamlessly has been pretty nice - this summer we picked up and moved to Denver for 10 weeks with no interruption of my work, only a slight shift in working hours due to the time zone change.

But there are some definite drawbacks. My taxes are far more complicated now, I don't get employer-sponsored health insurance, and there's a bit of a disconnect from the rest of my team and the firm at large. This is both practical - I have difficulty hearing what's going on in many of my meetings, since I'm the only one on the phone instead of there in person - and psychological.

The latest case in point, and the catalyst for this post, occurred this morning. The chairman of my firm sent out an email announcing a minor contest sort of thing in which employees can win an electronic gizmo that shall remain nameless in exchange for updating their employee profile. I dutifully went to update mine, which I hadn't looked at since I was a Full-Time Employee... and found that I no longer have one. This is far from a big deal, but it's a reminder of the fact that although I work like an employee (I'm on a fixed rate, not an hourly one), consider myself part of the firm, and try to act in its best interest, there's a psychic distance between me and the Real Employees.

There are more examples. The firm celebrated its 20th birthday a short time ago, and Employees all received small gift bags. It should probably go without saying that I did not. On one's 5th anniversary as an Employee, one receives a very small token of acknowledgement of service to the firm. My 5th year is coming up, but I presume it will be sans token.

I assume this would be different at firms in which contracting is more widespread - contractors would be held either closer or farther away - but my situation is unusual at my office, so it's not worth HR's time to hold my hand through any weird episodes. Also, these are minor enough issues that I would feel ridiculous complaining about them in person, so I'm using this medium to work through them a little.

Anyway, I know that my blog is occasionally perused by people from my office, so let me reiterate that this is not a big deal and not intended to be a passive-aggressive complaint, it's just some musings on this strange state of employment that I, and a growing number of others, find myself in.

Wednesday, October 5, 2011

Loading Fixed Width files with BCP / Bulk Insert

I had to craft some format files in order to load a couple of fixed width files using the SQL Server BULK INSERT / bcp tools, and had some issues with the documentation when trying to get them to work.

So let me state this explicitly: when attempting to load a file with no line breaks, you do not need a terminator for any field in your format file. Just specify all the field lengths, set the xsi:type to CharFixed, and the whole thing should stream in. Ex:

<record>
<field id="1" type="CharFixed" length="12" />
<field id="2" type="CharFixed" length="3" />
<field id="3" type="CharFixed" length="5" />
</record>
<row>
<column source="1" name="series" type="SQLCHAR" />
<column source="2" name="pool" type="SQLCHAR" />
<column source="3" name="deal" type="SQLCHAR" />
</row>

You can still mix and match - have a terminator for the last field in each line, for example - but if your file has 0 line breaks, you don't need it.

Also, the easiest way I found to handle skipping columns was to use OPENROWSET(BULK). Just make sure you're selecting the names of the columns from the ROW section of the format file, not the RECORD section.

i.e.

SELECT series, deal FROM
OPENROWSET(BULK 'source.txt',
FORMATFILE='sample.xml'
) as x;

The documentation covers that part a bit better, but just wanted to reproduce it for my own sake, since I'm sure I'll forget how I did it between now and the next time I use BULK INSERT a few years from now.

P.S. Gotta throw in a plug for tohtml.com here for making my code actually paste into Blogger and look decent to boot.

Laptopistan upgrade

3G vs 4G tethering: night and day. I can now click on things through my remote desktop session and have them actually respond, instead of doing the mental 2-count (or switching to Google Reader while waiting, which is always a productivity killer).

I have some concerns about the battery life of my new HTC Thunderbolt, and I'm not blown away by the form factor, but so far I definitely prefer it to my iPhone 3GS. And the 4G factor alone makes it totally worthwhile.

Tuesday, August 23, 2011

SSIS, MSDTC, and SQL Server clusters

Note to self: when receiving the following error from an SSIS package, the first thing to check is the Authentication setting for MSDTC.

Description: The SSIS Runtime has failed to enlist the OLE DB connection in a distributed transaction with error 0x8004D00E "The transaction has already been implicitly or explicitly committed or aborted".

The security setting should be set to No Authentication Required. Mutual does not appear to work with clusters. This is especially significant if, like me, you have a non-clustered instance in QA and a clustered one in Prod and you have been beating your head against the wall trying to figure out why MSDTC, with all the same settings, works in the former but not the latter.

Friday, July 1, 2011

The Query Did Not Run

Why thank you, Excel, that's a very helpful error message. Except I was watching the query execute on SQL Server, and it DOES run. Perhaps there could be a different problem?

The actual, exact text of the error is

The query did not run, or the database table could not be opened.

The sheet in question was running a stored procedure, which had been working fine, albeit very slowly. A coworker tuned the proc and happened to use a temp table to accumulate the results for multiple queries instead of a UNION, producing better execution plans for each. However, despite being 4x faster, the proc stopped running from Excel, throwing the error message above.

At first I thought there might be a problem with the final DROP #table statement in the proc, so I removed that. No help. I adjusted various properties of the database connection from Excel. No change. I fiddled with the table setup in Excel. Nothing.

Then I remembered that I had some difficulties with pyodbc a while back that were resolved by setting NOCOUNT ON in the query. I can't recall the exact cause, but it's something about the driver seeing the first count returned as the query result, when of course it's not in a proc that populates a table with a few queries and then selects from it as the final step. Anyway, I set NOCOUNT ON at the beginning of the proc and OFF at the end and voila, it worked. So try that if you're having difficulties with Excel not executing your stored proc correctly.

Tuesday, June 21, 2011

So you've got a problem, and you want to solve it using SQL_VARIANT

Now you've got a bigger problem, because using SQL_VARIANT for anything is a terrible idea. It's a memory and storage hog, it generally can't be indexed, it produces inconsistent and unpredictable behavior in code, and it's not supported by most client libraries. Don't use it.

Of course, you may not have a choice, if some esteemed predecessor has implemented database structures with it. In that case, you should definitely read up on it, but I just wanted to throw out one particularly useful function that I encountered yesterday while tearing my hair out trying to solve a problem with our resident SQL_VARIANT infection.

CREATE TABLE dbo.tbFoo (

ID INT IDENTITY(1,1),

bar SQL_VARIANT)

That function would be SQL_VARIANT_PROPERTY(). The primary use of this function is to get the BaseType of the SQL_VARIANT column or variable, as in:

SELECT SQL_VARIANT_PROPERTY(bar, 'BaseType') FROM dbo.tbFoo

This will let you know whether your variable or column is actually being used as an int, varchar, nvarchar, etc. There are other properties that can be checked as well using this function; essentially anything that sp_help would usually give you.

Knowing the base type can be enormously helpful. For example, when you encounter code that joins the table to itself to compare SQL_VARIANT values directly...

SELECT * FROM dbo.tbFoo f1

INNER JOIN dbo.tbFoo f2

ON f1.bar = f2.bar

AND f1.ID < f2.ID

Now say you have dozens of processes inserting into this table, and they're not particularly standardized. Say one of them inserts non-unicode strings (VARCHAR) while another inserts unicode strings (NVARCHAR).

INSERT INTO dbo.tbFoo SELECT CAST('A1B2C3' as varchar)

INSERT INTO dbo.tbFoo SELECT CAST('A1B2C3' as nvarchar)

If you had CAST them in your query, they would be comparable, but without the cast, they may or may not be, and you might not get the row back that you expected.

The best fix is to avoid writing this code, and be sure to CAST your SQL_VARIANTs into the type you're expecting to use, but if you need to restore functionality to this type of code, you can update your SQL_VARIANT column base type.

UPDATE dbo.tbFoo SET bar = CAST(bar as NVARCHAR) WHERE SQL_VARIANT_PROPERTY(bar, 'BaseType') = 'varchar'

All this may or may not help you with your SQL_VARIANT problems, as there's plenty of unexplored territory even within the issues I've touched upon here. But it would have helped me yesterday. Good luck.

Wednesday, May 11, 2011

The Rare, Simple SQL Wait Fix

A "power user" reported some slowness on one of our big servers today while it was being pummeled by about 60 simultaneous executions of the same script. The wait type was PAGELATCH_UP, which I vaguely recalled to mean something about allocation contention. The Microsoft Waits and Queues tuning guide confirmed that, so I checked the wait resources. They all started with 2:6 and 2:9, so it definitely had to do with tempdb.

This server is pretty beefy, a 4-socket quad-core with Xeon 7350s, but tempdb only had 8 files. I know that the rule of thumb of 1 file per core is no longer quite so hard and fast, but I figured it probably wouldn't hurt here. Created an extra 8 files, equalized the file sizes on the current ones, and had the user kick off the process again.

No waits! Or at least, no PAGELATCH_UP waits. Some SOS_SCHEDULER_YIELDs and CXPACKETs, but I took that to mean that we had successfully shifted the bottleneck off of allocations and onto CPU, where it should be.

It's rare that 5 minutes of configuration change can effect a significant gain in process speed, but it's pretty satisfying when it happens.

Friday, April 1, 2011

In SSIS, memory corruption error == bad parameter passing

Had a strange situation with an SSIS package that would occasionally produce the following error message from an Execute SQL task:

"Attempted to read or write protected memory. This is often an indication that other memory is corrupt."

After several rounds of machine reboots, file checking, and messing with the connection manager, I finally checked the actual query being executed (I know, way to work backwards). It had a parameter being passed in that must have been hand-entered, because it was lacking the namespace, i.e.

ParamName1

instead of

User::ParamName1.

Corrected that, and everything worked fine. So the error message is thoroughly misleading, but if you investigate the query execution, you may find a bad parameter name.

Wednesday, March 16, 2011

More Resolver One nice-to-haves

Working on Resolver One stuff again, and came up with a few more things I'd like to see.

I've been having problems moving some of my IronPython code out of my worksheet and into an IronPython module, but I assume that's due to my own lack of understanding of the clr - the external libraries seem to get hung up on the import statements and not get any further.

Which leads me to my first request: enable an easy way to reload a workbook. At the moment, any time I make a change in an external file, I have to close the entire workbook and re-open it. I can't just recalculate and have it re-import the module, I have to close the workbook (which generally means the entire program) and re-open it. It would be nice to have a way to "Reload from disk" or something similar that does a thorough reinitialization of the environment.

The second thing is something I just posted about on the Resolver forums, so I'm hopeful that I'll come upon a solution via that line of inquiry, but basically, I want SSL support in Resolver. I was trying to connect to a gmail server and use starttls() from smtplib, but couldn't do it because of the lack of SSL support in IronPython 2.6.0, which I believe is what's under the hood in Resolver One.

I'm using Resolver for a couple projects now, so here's hoping that development continues and that I can see some of these things working in the future.

Friday, March 11, 2011

Context switching is more expensive when bandwidth is limited

This post is just made up of more idle musings and could really be shortened to a tweet, but while on the road this week with access to the office via questionable and/or slow internet connections, I've noticed that context switching, both literal and figurative, is much more expensive with low bandwidth.

This applies across all sorts of situations. I get a lot of IMs from co-workers, for example. Normally I don't notice their impact much, since I have two big monitors worth of screen real estate and I can switch back and forth between IM windows and coding or testing quickly. But with less screen space (1280 x 800 instead of 3840 x 1080), the IM windows take up more space and I have to move them around frequently, which takes up more time.

Also, with slow internet connections, actually switching between windows is sometimes difficult, as redrawing a bigger chunk of the screen causes the whole connection to freeze for as long as several minutes at a time. And as upstream bandwidth is always more limited than downstream, sometimes my typing or clicks don't get transmitted for extended periods of time. It's very frustrating, to the point where I'm planning to get a mobile internet solution of some kind very soon. The travel has also cost me quite a few mobile phone minutes, as I can't use VOIP over the sketchy internet connections.

In other remote working news, I've noticed two startups in the space mentioned recently on TechCrunch: LiquidSpace and LooseCubes. I'm hoping that one or both will help provide the co-working experience I've been seeking and not finding in DC (and potentially elsewhere).

Thursday, January 27, 2011

The Perils of Laptopistan

I'm out in Laptopistan again today, this time in the principality of Starbucks, but not by choice. My power is out for the 3rd or 4th time since moving to the DC burbs, and in order to get any work done I had to get somewhere with Wifi. This Starbucks is packed, of course, with other people whose power is also out, making the internet access slow as molasses, and seating and power very difficult to come by. Fortunately, I brought a mini power strip, so I could add my laptop and Krystal's phone to an outlet without displacing anyone.

Anyway, my complaint is not so much that the situation is impossible, it's just small extra costs to this type of remote working:

- 20 mins looking for street parking in the snow = lost work time

- slow internet = lower productivity

- large latte = 4x the unit cost of coffee at home (which is better anyway, thanks to our Baratza grinder and French press)

- 20 mins spent waiting for a table = lost work time

- no power outlets available = additional capital goods required (power strip) plus the need to remember to bring them

- limited-time parking meters requiring movement of car after 4 hours = lost work time

If anyone at my office is reading this, don't worry, I'll make up the lost time, but my complaint still stands, because that cuts into my free time.

Anyway, obviously some establishments are better-suited to remote working than others; Peregrine Espresso is paradise compared to Starbucks, but it would have taken 90+ mins to get there this morning through the snow and non-working traffic lights. All of these complaints could be remedied by having access to a proper co-working space, so I suppose the best approach is to quantify these costs, find the appropriate break-even point, and seek co-working space at or below that price point.

Wednesday, January 5, 2011

Daily WTF: Financial Data Provider edition

A data provider that shall remain nameless (at least while I'm still employed at my current office) sends us daily CSV files. I was looking into a failure of the load of one of these files and noticed this... (values modified to protect the sanctity of the data)

[header row]

[first data row]

...

DAM, 0.2309674, 0.2309676,

POC, 0.000301957, 0.000301963,

* The [big data provider] Closing Spot Rates provided by [big data provider a] plc in conjunction with [big data provider b].The [big data provider] plc shall not be liable for any errors in or delays in providing or making available the data contained within this service or for any actions taken in reliance on the same except to the extent that the same is directly caused by its or its employees' negligence.

Thanks, [big data provider]! But next time, maybe you could shove that disclaimer someplace other than a file produced and consumed exclusively by machines, which happen to appreciate a specific format for their files, especially when said files are called, explicitly, Comma-Separated Values?

NYC DBA