Thursday, December 22, 2011

pyodbc isn't playing well with multiprocessing

Wrote my first Python script yesterday that incorporates the multiprocessing module in order to parallelize some CPU-intensive calculations that were going extremely slowly when run in the database. Mission accomplished there, and it was remarkably easy in Python, but then I hit a brick wall when trying to load everything back to the database.

As it turns out, the calculation time is dominated by the database IO (~25 mil rows in and out), so I went about trying to optimize that as best I could. The reads are coming off an SSD, so they're fine, but the writes are UPDATEs, and I didn't want to issue them one at a time and incur all the connection and transaction overhead for each one. Instead, I created a staging table as a heap and batched up INSERTs of my data in a specified batch size (1000 to start) using SQL 2008 row constructors. Then the main table gets updated from the staging table right at the end, so only one giant transaction and UPDATE statement is necessary.

This strategy appeared to work fine when single-threading, but then I figured I'd throw in multiprocessing there, too, since I already had the module imported and had crafted the functions around it. Didn't work.

The error message was 42S02, aka Invalid object specified - in other words, the INSERT can't find the staging table, even though I'm creating it within the Python script, during a single-threaded section before I Pool.map() out different calculated shards to be inserted. SQL Server can obviously accept just about as many threads as you want to throw at it, so that's not the problem. The table exists, which I confirmed by adding a breakpoint (well, a set_trace()) and looking at it through Management Studio. It's all using integrated authentication, and I'm a sysadmin on the server, so there can't be any problems with schema, although I threw in some "dbo." references just to make sure. I even tried to trace the SQLNCLI driver operations, but haven't managed to get usable output yet.

So the only thing I can come up with so far is an incompatibility between pyodbc and multiprocessing. If I get more time to investigate, I'll update the post, but for now I'm just going to have to switch the DB section back to single-threaded. Boo.

No comments: