NYC DBA

Unintentional Hiatus

2014-12-12T02:43:00.002-08:00

I clearly haven't been posting on this blog, due to multiple factors.

I'm not a DBA any more, so the blog title doesn't make much sense. I've been doing exclusively Front Office-facing software engineering, mostly in Python, for the last 18 months.
I'm not in NYC. Since I started this blog, I've done stints in DC, Denver, and now London, where I'll be based for at least a couple years.
I've been writing internal technical blog posts for my firm.
I've been writing external expat blog posts on a Wordpress blog.
I'm now married, have a dog, and am training for a triathlon, so my supply of spare time has declined.

None of these situations are likely to change in the near future, so I wouldn't expect too much content here. I may spin up a new (more generally-named!) technical blog at some point in the future, and I will link this post to it. Til then, good luck to everyone and please enjoy the archives as long as Blogger keeps them active.

Dealing with runaway DBMail messages

2013-08-02T08:21:00.003-07:00

Say you have a particular type of alert configured on your SQL Server. Maybe it sends an email any time there's blocking that lasts longer than a particular threshold.

Then you have a long-running multi-threaded process that runs on a cluster and hammers your SQL Server with a couple hundred threads for hours, and some of them wind up blocking. You build up a lot of alerts, say 50,000. Your SMTP server can't handle all of them at once, so you (and your team) wind up getting a heavy stream of them for the next few hours or maybe even days.

Sound like fun? Yeah, not to me either.

There are multiple ways to handle this, but the cleanest is probably to kill the messages at their source. I'm assuming in this example that the messages have already been queued, so fixing your alert will have to come later. Here's the simplest way to deal with your message problem - stop your DBMail service broker objects, delete the messages, and start things back up:

use msdb

exec sysmail_stop_sp

exec sysmail_delete_mailitems_sp @sent_before = '[DATE_OF_SCREWUP]'

exec sysmail_start_sp

Your mail server admin and team members will thank you.

Testing something

2013-08-02T08:18:00.001-07:00

{
"Python":8,
"C++":4,
"SQL":9,
"NoSQL":5,
"C#":6
}

Commandments for scraping public data sources

2013-07-26T13:14:00.001-07:00

I've used a couple different flavors of publicly available data as data sources, and I can tell you that there's a good reason why firms pay for clean data. Not that even vendor-processed data is usually 100% clean, but that's a different story. Anyway, I've learned a few things the hard way about scraping public data:

Use a resilient HTML parsing engine. BeautifulSoup is great, and makes it very easy to explore HTML structures, but you'll almost certainly want the LXML backend to avoid blowing up on unclosed tags and nonstandard nesting.
Always call strip(). I've had a couple of import busts in which things weren't matching that really looked like they should. 90% of the time it was because there was random whitespace that had infiltrated the actual data. This also leads me to...
Anticipate spacing changes. ESPECIALLY with hand-entered data, even if there's a template, there can always be an extra line between the logo and the header, a blank line between data rows, or new spacing for the file date. Whenever possible, search for a reference value that points the way to the data, and throw away those newlines rather than expecting data by default.
Keep original copies. The first thing you should do with any parsed source is make a raw copy of it (if you can) so that you can refer back to it when #5 happens.
Expect change! No warranty for most public data sources is granted, implied, or hinted at, and sometimes the exact opposite is the case: public data "providers" happily change formats, addresses, and datasets available to stymie anyone making systematic use of such. Make sure you've got good logging and debugging set up so you can figure out quickly where and how something changed.
Make things modular. Building on #5 above, if you've got your parser set up to quickly swap out the downloader, parser, normalizer, or persister (for example) for any given source and toggle those sources on and off easily, it'll make things much easier when you need to quickly hack up a new one because somebody changed something somewhere.
Handle multiple formats. I've parsed many Excel spreadsheets that usually have an XLDate in a column but, every so often, spit out a text string instead because of the way someone typed it or copied it in. If your date parser function knows about this, you don't have to think about it.
Don't hammer. When you're using someone else's data that they're not explicitly making available in a structured format, be nice, download one copy and build your import off of that. Otherwise you're putting undue load on their servers and exposing your IP to possible banning if they decide you're violating their ToS.
Inheritance is your friend. Most of the parsers I've written have a lot of common structure that needs a little tweaking for certain sources. If you've built a solid class hierarchy, you can easily override that save() method in the 2 subclasses that need it while only needing to write a basic one for the other dozen. Any time I find myself copying code, I generally try to move it up to the superclass.
Pad your dev time estimate! This is a general problem I have, but I always look at a source, pretty quickly shred the data out with bs4 or something similar, and go "yeah, this'll take 2 days." Sure, to code-complete. Then it'll take a week to figure out every stupid corner case, extra whitespace location, and placeholder for None. Trust me (and this goes double for you, future me, when you're re-reading this).

GetBytes error when using a DATE column

2013-06-26T08:36:00.000-07:00

Quick troubleshooting post. Given this error from a colleague upon execution of a pretty simple function that used a DATE column:

An error occurred while executing batch. Error message is: Invalid attempt to GetBytes on column 'RelDate'. The GetBytes function can only be used on columns of type Text, NText, or Image.

I couldn't reproduce the problem on my machine - the same function worked fine for me. Checked all his query execution settings and didn't see anything, and then he realized he was running SQL Management Studio 2005 connected to a 2008 R2 instance. As soon as he fired up SMS 2008 R2, the function worked fine.

Why SQL Server Replication depends on Database Mirroring

2013-05-31T09:06:00.000-07:00

Had a convoluted replication problem this morning that turned out to be caused by mirroring. Users noticed a lack of data on a replicated server that was quickly traced back to replication failing to deliver commands. The Log Reader Agent noted the following:

Replicated transactions are waiting for next Log backup or for mirroring partner to catch up

There was nothing wrong with log backups, so mirroring was the next thing to be checked. It was suspended, and attempting to resume it failed. Checking the Mirroring Monitor yielded nothing helpful, but the SQL Server error log on the host server showed the following:

Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid34s

Message
'TCP://foo.bar.com:5022', the remote mirroring partner for database 'SomeDB', encountered error 3624, status 1, severity 20. Database mirroring has been suspended. Resolve the error on the remote server and resume mirroring, or remove mirroring and re-establish the mirror server instance.

On the mirrored server, there were additional errors, including a stack dump, but this was the most relevant:

Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
Error: 3624, Severity: 20, State: 1.

---

Date x/xx/xxxx x:xx:xx AM
Log SQL Server (Current - x/xx/xxxx x:xx:00 AM)

Source spid39s

Message
SQL Server Assertion: File: , line=807 Failed Assertion = 'result == LCK_OK'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

Severity 20 == bad. This basically comes down to corruption on the mirror target, which irreparably broke mirroring. This, in turn, broke replication.

So why, the technical user asked, does replication between business-critical Production servers depend on mirroring to a disaster recovery environment? I had to think about that one, but there's a good reason.

Assume replication does not depend on mirroring. Server A is both the replication publisher and mirror source. It replicates rows to Server B and mirrors to Server C. Mirroring breaks at time t0. Rows continue to replicate, and are marked as replicated on Server A.

Server A fails over to Server C at time t1. Replication picks up from there with Server C as published. Server C republishes the rows between t0 and t1 to Server B because it does not have a record of those rows having been replicated.

Makes sense now, right?

sqlcmd swallows errors by default

2012-09-18T07:02:00.002-07:00

I'm sure this is well-known out in SQL land, but I keep forgetting about it, so I'm writing this short post as a way to cement the knowledge in my brain: sqlcmd, the current command line utility for running T-SQL and replacement for OSQL, does not return a DOS ERRORLEVEL > 0 by default, even when the command being executed fails.

You need to pass the -b flag to raise an ERRORLEVEL 1 when a real error (SQL error severity > 10) occurs, which is crucial when running batch jobs in job systems that report status based on DOS ERRORLEVELs.

pyodbc autocommits when using "with"

2012-07-12T11:17:00.000-07:00

Discovered an interesting variation in pyodbc behavior today. A coworker asked me if he needed to commit explicitly when using pyodbc to execute a stored procedure. I told him he did, but he protested that his code worked fine without it. Here's a paraphrased version:

def db_exec(qry):
with pyodbc.connect(cnxn_str) as db:
db.execute(proc)

I tested it, and what do you know? It works. But the following does not:

db = pyodbc.connect(cnxn_str)
db.execute(proc)
db.close()

In this case, the results of proc will be rolled back automatically when closing the connection. So evidently the way that the with handler is coded includes an autocommit.

Stripping time from GETDATE(), 2008 edition

2012-06-20T11:34:00.002-07:00

Stripping the time from GETDATE() to get the date at 00:00:00 is a common practice, and has been accomplished by various methods in the past, including CASTing to VARCHAR and back, using DATEADD and DATEDIFF, and using FLOOR after casting to INT. It should be less necessary these days since the introduction of the DATE data type, but sometimes it's still a useful comparison trick.

And now because of DATE there's a simpler way to do the conversion: CAST(@dt as DATE). But is this any faster?

The short answer is no, but it doesn't seem to be any slower either. I ran a quick test:

declare @dtest datetime
declare @i int = 0, @j int = 0
declare @start datetime

set @start = GETDATE()

while @i < 1000000
begin
    set @dtest = DATEADD(dd, 0, DATEDIFF(dd, 0, GETDATE()))
    set @i = @i + 1
end
    
print datediff(ms, @start, getdate())
set @start = GETDATE()

while @j < 1000000
begin
    set @dtest = CAST(GETDATE() as DATE)
    set @j = @j + 1
end

print datediff(ms, @start, getdate())

In half a dozen runs, the two versions never varied by more than a handful of ms. Feel free to try on your own and let me know if you see anything different.

Another hazard of Laptopistan - slow WiFi

2012-03-30T07:34:00.003-07:00

This is hardly unexpected, but one of the side effects of everyone using Starbucks as a de facto workplace and internet cafe is that their WiFi is heavily stressed. People watching online video, using Remote Desktop (like me), probably even BitTorrent. I don't know what sort of routers they put in these coffee shops, but they often seem to be unequal to the load they're handling.

This morning I'm in a Starbucks, and my connection went from perfectly fine, at 8:30 AM before anyone was here, to barely usable now that I'm surrounded by laptops. One more reason for actual workspace.

Sliding Window partitioning

2012-03-16T14:11:00.002-07:00

Implemented my first "sliding window" partitioned table. Wound up being easier than I expected, especially since the table will only have 2 partitions for now - Current and Archive.

Most sliding window schemes I've seen out there on the intertubes make use of a date as the partition key and update the sliding window based on the current date. In this case, the data comes in somewhat sporadically, so although it's date-based (a month key), I made the partitioning data-driven instead of date-driven. So for starters we have a RANGE LEFT scheme like this:

Partition Month Key

Current -

Archive 88

Say we're currently on month 90. So months 89 and 90 are in Current, and 1-88 are in Archive. When the sliding window proc runs, it does essentially this:

SELECT @NewBoundary = MAX(monthKey)-2 FROM tbPartitionedFoo

SELECT @CurBoundary = CAST(prv.value as smallint) FROM sys.partition_functions AS pf

JOIN sys.partition_range_values as prv

ON prv.function_id = pf.function_id

WHERE pf.name = 'partfn_foo'

(If there was more than one partition range value, a MIN() or something would be needed to grab the appropriate value.)

IF @NewBoundary > @CurBoundary

BEGIN

ALTER PARTITION FUNCTION partfn_foo() SPLIT RANGE (@NewBoundary)

ALTER PARTITION FUNCTION partfn_foo() MERGE RANGE (@CurBoundary)

END

So when we start getting monthKey 91 data and the maintenance proc runs, it will do this:

IF 89 > 88

BEGIN

ALTER PARTITION FUNCTION partfn_foo() SPLIT RANGE (89)

Now we have 3 partitions:

Partition Month Key

Current -

temp 89

Archive 88

ALTER PARTITION FUNCTION partfn_foo() MERGE RANGE (88)

END

And now we're back to 2:

Partition Month Key

Current -

Archive 89

Python Gets Things Done

2012-03-05T19:46:00.003-08:00

A quick example of why I've grown to love Python.

My girlfriend had a laboriously scanned document in two pieces that needed to be combined. I had a .NET command line utility that I wrote a couple years ago that does just that, but because one of the scanned pages was a different size, it got cut off in the resulting output file. So I could either try to dig up some docs on the library I used, play around with Intellisense in Visual Studio, or see if there was a Python library that might work.

2 minutes of Googling later I had pyPDF, and maybe 10 minutes after that, I had reset the mediaBox on the outsized page, written it to a new pdf file, and emailed it off. Q.E.D.

DATEs aren't INTs, at least not any more

2012-03-05T13:32:00.003-08:00

I updated a database column today to utilize the SQL 2008 DATE type, upgrading from a SMALLDATETIME which always had a 00:00:00 hh:mm:ss component. I was fairly certain that this would not break any existing code, so I did not perform an exhaustive code search. My mistake.

A function broke with the following message:

Msg 206, Level 16, State 2, Line 1

Operand type clash: date is incompatible with int

"Strange," I thought. "How would an int be getting used as a date?"

Of course there's the basic method of datetime construction from a string: CAST("2012-01-01" as datetime), but that doesn't involve any INTs. Did someone try to use a similar INT (20120101) in place of the string? No, because that would cause an arithmetic overflow...

Well, it turns out that there's another way of using INTs as DATETIMEs:

select cast(CAST('2012-01-10' as datetime) as int)

Apparently this doesn't work any more:

select cast(CAST('2012-01-10' as date) as int)

And this is what was being performed in the function in question:

...

CASE WHEN io_end > 40947 THEN 1 ELSE 0 END

...

Which promptly broke when I changed the column to a DATE. So let this be a warning: even something as innocuous as a DATETIME to DATE change needs QA and code search.

Laptopistan is getting crowded

2012-02-23T12:59:00.003-08:00

This is purely anecdotal, of course, but on the occasions when I've ventured out from my home office to get some fresh air or avoid being pestered by my dog, I've seen more and more people working remotely. Your average Starbucks is completely overrun with laptop-facing contractors, remote workers, and students these days, as are most other public areas with seating and free WiFi. Getting a table all to yourself is becoming rare, and forget about finding an open power outlet!

This says to me that there is a serious hole in the co-working space market. Assuming these Starbucks campers order an average of a $4 latte per day (some get black coffee, which is cheaper, but some add a muffin), and the cost of a dedicated desk at Affinity Lab is on the order of $900, that works out to a price differential of $800/mo. Surely there must be a point somewhere in the middle for which one could add certain beneficial services while keeping the cost enough to attract customers.

Starbucks, after all, does not exactly inspire loyalty - the seating is functional, but rarely comfortable; there are never enough outlets; the coffee is okay but somewhat expensive; the WiFi is often slow; the environment is noisier than a worker would prefer; and the venues themselves hardly encourage hard work. If I could pay $200/mo for a similarly casual but better-outfitted space in which I could work, chat, and drink coffee, I would jump at the chance. I've been trying to check out DCIOLab, but they haven't emailed me back.

Makes me think there's a definite business opportunity here, but certainly a risky one. Timing is important - is NOW the inflection point in density of remote workers, or a year from now? You can quickly go broke renting a commercial space for a year with not enough customers. Is there enough consistency in the office support requirements of the proposed customers? A good coffee pot or two (or a Chemex or Aeropress, maybe) is definitely necessary, but a copier? Fax machine? Phones? A conference room?

These are ways to differentiate a workspace from a Starbucks, but I think they may raise the cost too much and get too little use. I don't need any of those things besides the coffee-making apparatus. In fact, all I want is coffee, power, internet, good light, and preferably high ceilings. Could one provide those things, pack people in as tightly as they do at Starbucks, and make money charging $200/mo?

Query plan troubleshooting

2012-02-16T07:31:00.001-08:00

I was troubleshooting a medium-complexity query problem this morning in which the query execution within an app was orders of magnitude slower than the execution with the same parameters within Management Studio. I knew this could be due to different cached plans, but couldn't remember why.

Fortunately, Erland Sommerskog had written a detailed and helpful article explaining why this happens: http://www.sommarskog.se/query-plan-mysteries.html

And indeed, the app was using ODBC and therefore the results of sys.dm_exec_plan_attributes showed ARITHABORT OFF, while my instance of Management Studio had it set to ON, therefore generating a separate cached plan. I'm going to update my settings in SSMS to set this off by default, especially since it appears that it has no effect with ANSI_WARNINGS set to on, other than to foil attempts to troubleshoot slow procs.

Refreshing Intellisense in SQL Management Studio 2008

2012-01-19T07:31:00.000-08:00

If you're like me, you create and drop a lot of tables while developing, and you may be getting annoyed with SQL Server Management Studio because it fills in the names of old tables while you're typing unless you hit Escape. Turns out that refreshing the Intellisense cache is as easy as

Ctrl-Shift-R

If your DATE comes through pyodbc as a unicode, it's probably the driver

2012-01-18T10:12:00.001-08:00

File this under "should have been obvious." I had a test script break today because I was trying to do date math on an object that wound up being a unicode instead of a date. I had gotten the variable value from a pyodbc query against a SQL 2008 database table with a DATE column. I knew that DATETIME columns came in as datetime.datetime objects, but the DATE column came in as a unicode, which seemed strange.

Turns out it's the ODBC driver. Demo:

In [78]: cnxn = pyodbc.connect("DSN=local;")

In [79]: crsr = cnxn.execute("select cast('2012-01-01' as date) as bar")

In [80]: r = crsr.fetchone()

In [81]: type(r.bar)
Out[81]: unicode

Then, when switching to SQLNCLI10:

In [96]: cnxn = pyodbc.connect("Driver={SQL Server Native Client 10.0};Server=localhost;Trusted_Connection=Yes;")

In [97]: crsr = cnxn.execute("select cast('2012-01-01' as date) as bar")

In [98]: r = crsr.fetchone()

In [99]: type(r.bar)
Out[99]: datetime.date

Faster, thinner, lighter

2012-01-17T13:04:00.001-08:00

It's almost time to retire my trusty old Dell E4300. The warranty just expired, which means that if Dell has optimized their average component MTBF vs warranty span correctly, the laptop will implode shortly. I replaced the original hard drive with a 2nd-gen Sandforce SSD a year ago, and it's made a big difference, but I'm ready for a little more speed. And, with the advent of the "Ultrabook" form factor, for even less weight and size. I've decided I want one.

At CES this year, there were plenty of ultrabooks on display - Engadget's roundup has a few, more on Anandtech). And most of them seemed to get at least some part of the formula right, but I don't think any of them have quite grabbed me yet. Maybe I'm being picky, but I've decided that my top priorities are:

- 1600 x 900 resolution

- Thunderbolt port (for docking)

- decent battery life (5 hrs+)

- light weight

I'm sick of my current 1280 x 800 screen - the drop in my productivity from 2 big 1920 x 1080 screens down to the tiny laptop screen is painfully apparent. 1366 x 768 is arguably worse, since I really need those vertical lines of resolution. And although I move around a lot with a laptop, which is why I want a good battery and light weight, I also use it as my primary workstation, so I need to be able to dock it and use my big monitors. In fact, a laptop that would support 3 screens instead of 2 like my current Dell would be preferable. Few of the ultrabooks seem to have this capability, but at least a Thunderbolt port would make it theoretically possible.

Other requirements are pretty normal. The number of USB ports matters, but not hugely, since most peripherals I use at home while connected to a dock. I do use my SD card reader pretty often, so it would be nice to keep that. A decent keyboard, preferably backlit, would be useful.

Requirements in mind, I took a look at the current crop of ultrabooks just demoed at CES, and... was totally disappointed. Nothing had all the features I wanted. The closest ones were:

Samsung Series 9

- no Thunderbolt but both HDMI and DisplayPort connections

- 1600 x 900 resolution (on what I hear is a great screen)

- light weight (2.5 lbs)

Sony Vaio Z

- Thunderbolt, sorta, via a proprietary connector as usual (thanks a lot, Sony)

- 1600 x 900

- light weight

- ungodly price ($1900+)

Here's my spreadsheet comparing the current crop of ultrabooks and their cousins using my completely biased and proprietary scoring system. It's a definite work in progress and the scoring may change without warning. If something comes along that scores above a 3, I'll probably buy it, but right now, none of these are worth the money.

More fun with SSIS type conversions

2012-01-04T23:00:00.001-08:00

Just finished painfully debugging a strange SSIS issue about which I could find no documentation anywhere on the web, so I'm noting it here.

I had changed a type within a large unwieldy data flow from DT_R4 (float) to DT_R8 (double), since the smaller data type was causing some weirdness around the least significant figures on some option prices. This seemed to work in testing, but I got a cryptic error about a conditional operation failing during a derived column transformation when I deployed it to Production. I opened the package, added an error redirection for the component and a data viewer, and sure enough, it was failing on a transformation that utilized some of the columns I had modified.

I couldn't find any reason why the conditional would fail, since it was faithfully spitting out BOOLs, but the resulting transformation split the float into Integer and Decimal by casting the value into a WSTR, 20 and then finding the decimal point. After trying a dozen other things, I tried adjusting the size of the WSTR cast to a WSTR, 40. That worked.

Note that the values that were failing would not even come close to 40 characters (ex: 110.68), but apparently the possibility of running into an overflow breaks something in the SSIS runtime.

So, a word to the wise: if casting from float (DT_R4 or DT_R8) to Unicode string (WSTR), make sure you leave enough room to cast any possible value.

pyodbc isn't playing well with multiprocessing

2011-12-22T07:28:00.000-08:00

Wrote my first Python script yesterday that incorporates the multiprocessing module in order to parallelize some CPU-intensive calculations that were going extremely slowly when run in the database. Mission accomplished there, and it was remarkably easy in Python, but then I hit a brick wall when trying to load everything back to the database.

As it turns out, the calculation time is dominated by the database IO (~25 mil rows in and out), so I went about trying to optimize that as best I could. The reads are coming off an SSD, so they're fine, but the writes are UPDATEs, and I didn't want to issue them one at a time and incur all the connection and transaction overhead for each one. Instead, I created a staging table as a heap and batched up INSERTs of my data in a specified batch size (1000 to start) using SQL 2008 row constructors. Then the main table gets updated from the staging table right at the end, so only one giant transaction and UPDATE statement is necessary.

This strategy appeared to work fine when single-threading, but then I figured I'd throw in multiprocessing there, too, since I already had the module imported and had crafted the functions around it. Didn't work.

The error message was 42S02, aka Invalid object specified - in other words, the INSERT can't find the staging table, even though I'm creating it within the Python script, during a single-threaded section before I Pool.map() out different calculated shards to be inserted. SQL Server can obviously accept just about as many threads as you want to throw at it, so that's not the problem. The table exists, which I confirmed by adding a breakpoint (well, a set_trace()) and looking at it through Management Studio. It's all using integrated authentication, and I'm a sysadmin on the server, so there can't be any problems with schema, although I threw in some "dbo." references just to make sure. I even tried to trace the SQLNCLI driver operations, but haven't managed to get usable output yet.

So the only thing I can come up with so far is an incompatibility between pyodbc and multiprocessing. If I get more time to investigate, I'll update the post, but for now I'm just going to have to switch the DB section back to single-threaded. Boo.

SQL could use some real string manipulation

2011-11-30T07:00:00.000-08:00

In order to remove the first "-" from a string, this is the query I just had to run:

update tbFoo set BBG_ticker = substring(BBG_ticker, 1, charindex('-', BBG_ticker) - 1) + ' ' + substring(BBG_ticker, charindex('-', BBG_ticker)+1, LEN(BBG_ticker) - charindex('-', BBG_ticker))

Does that seem like overkill to anyone else? Let's compare it to Python...

BBG_ticker.replace("-"," ",1)

Meeting new people in Laptopistan

2011-10-22T09:20:00.000-07:00

This morning I'm back at Starbucks, as we haven't yet moved closer in to DC where there are alternatives to corporate coffee world, and I made the acquaintance of a woman who

1) made me explain the meaning of my t-shirt

2) asked to use my laptop (instead of hers, for some reason?) to check her email

3) inquired about what I do for a living

I'm pretty sure the next step will be to introduce me to her daughter/niece/granddaughter who's been looking for a nice Jewish boy (which I'm not, but I'm sure it wouldn't matter).

Well, I suppose it's more interesting than doing R regressions while sitting at home.

Life as a contractor

2011-10-14T06:43:00.001-07:00

It's my own fault that I'm a contractor and not a full employee - a year ago, my girlfriend decided to go back to school in another state, and I followed her there. My office was gracious (desperate?) enough to keep me on as a remote employee, but I was converted to contractor status, for both employment flexibility (i.e. they can let me go easily if it doesn't work out) and tax liability purposes.

For the most part, this has worked out fine. Working from home can be a double-edged sword, but the non-existent commute is great and the ability to work from alternate locations seamlessly has been pretty nice - this summer we picked up and moved to Denver for 10 weeks with no interruption of my work, only a slight shift in working hours due to the time zone change.

But there are some definite drawbacks. My taxes are far more complicated now, I don't get employer-sponsored health insurance, and there's a bit of a disconnect from the rest of my team and the firm at large. This is both practical - I have difficulty hearing what's going on in many of my meetings, since I'm the only one on the phone instead of there in person - and psychological.

The latest case in point, and the catalyst for this post, occurred this morning. The chairman of my firm sent out an email announcing a minor contest sort of thing in which employees can win an electronic gizmo that shall remain nameless in exchange for updating their employee profile. I dutifully went to update mine, which I hadn't looked at since I was a Full-Time Employee... and found that I no longer have one. This is far from a big deal, but it's a reminder of the fact that although I work like an employee (I'm on a fixed rate, not an hourly one), consider myself part of the firm, and try to act in its best interest, there's a psychic distance between me and the Real Employees.

There are more examples. The firm celebrated its 20th birthday a short time ago, and Employees all received small gift bags. It should probably go without saying that I did not. On one's 5th anniversary as an Employee, one receives a very small token of acknowledgement of service to the firm. My 5th year is coming up, but I presume it will be sans token.

I assume this would be different at firms in which contracting is more widespread - contractors would be held either closer or farther away - but my situation is unusual at my office, so it's not worth HR's time to hold my hand through any weird episodes. Also, these are minor enough issues that I would feel ridiculous complaining about them in person, so I'm using this medium to work through them a little.

Anyway, I know that my blog is occasionally perused by people from my office, so let me reiterate that this is not a big deal and not intended to be a passive-aggressive complaint, it's just some musings on this strange state of employment that I, and a growing number of others, find myself in.

Loading Fixed Width files with BCP / Bulk Insert

2011-10-05T06:16:00.000-07:00

I had to craft some format files in order to load a couple of fixed width files using the SQL Server BULK INSERT / bcp tools, and had some issues with the documentation when trying to get them to work.

So let me state this explicitly: when attempting to load a file with no line breaks, you do not need a terminator for any field in your format file. Just specify all the field lengths, set the xsi:type to CharFixed, and the whole thing should stream in. Ex:

<record>
<field id="1" type="CharFixed" length="12" />
<field id="2" type="CharFixed" length="3" />
<field id="3" type="CharFixed" length="5" />
</record>
<row>
<column source="1" name="series" type="SQLCHAR" />
<column source="2" name="pool" type="SQLCHAR" />
<column source="3" name="deal" type="SQLCHAR" />
</row>

You can still mix and match - have a terminator for the last field in each line, for example - but if your file has 0 line breaks, you don't need it.

Also, the easiest way I found to handle skipping columns was to use OPENROWSET(BULK). Just make sure you're selecting the names of the columns from the ROW section of the format file, not the RECORD section.

i.e.

SELECT series, deal FROM
OPENROWSET(BULK 'source.txt',
FORMATFILE='sample.xml'
) as x;

The documentation covers that part a bit better, but just wanted to reproduce it for my own sake, since I'm sure I'll forget how I did it between now and the next time I use BULK INSERT a few years from now.

P.S. Gotta throw in a plug for tohtml.com here for making my code actually paste into Blogger and look decent to boot.

Laptopistan upgrade

2011-10-05T06:08:00.000-07:00

3G vs 4G tethering: night and day. I can now click on things through my remote desktop session and have them actually respond, instead of doing the mental 2-count (or switching to Google Reader while waiting, which is always a productivity killer).

I have some concerns about the battery life of my new HTC Thunderbolt, and I'm not blown away by the form factor, but so far I definitely prefer it to my iPhone 3GS. And the 4G factor alone makes it totally worthwhile.