Friday, April 27, 2007

Mysterious HTTP 400 errors caused by large Kerberos tickets

First off, I know there's been a lack of DBA material lately. This is because I'm not doing anything interesting, database-wise. I can tell you all about the woes of interfacing SQL Server with an Oracle system to which I have minimal access and no table schema, but that's pretty boring.

The most interesting problem I've fixed lately was the mysterious HTTP 400 errors plaguing one of the users of our system. Briefly, the application consists of a .NET web app that performs business-specific functions and then runs Microstrategy 8.0.1 in an iFrame to run OLAP-style reports. One of the users of the app was regularly experiencing HTTP 400 errors while using the app, and several others were experiencing these errors intermittently.

After reading the links below (and a bunch of others), I checked the HTTP error logs at C:\WINNT\system32\LogFiles\HTTPERR and found that the HTTP 400 errors were associated with a RequestLength error type. According to Microsoft, the default maximum request size is 16384. I installed Ethereal packet capture software on the QA web server and captured packets while the user attempted to run reports. A check of the request packets showed that GET requests for certain pages exceeded 16384 bytes immediately before HTTP 400 errors (see image).



I verified this by capturing packets while I ran reports, and my requests were in the 2000-3000 byte range (see image).



Apparently the larger size of his packets could be caused by an excessively large Kerberos authentication ticket combined with cookies and a long request string. The Kerberos ticket can be larger because of membership in many groups – the user is a member of at least 19 that I can see in Active Directory. This could be temporarily cleared up in some cases by deleting cookies to reduce request size, which is why I believe that other users who have had this problem have had temporary success at clearing it up by deleting their cookies.

According to posts I read on this topic (see the top link for the most useful exchange), using the IP address instead of the hostname will avoid Kerberos ticketing and therefore not cause the problem. This was the case in our situation – when using the IP address of the QA server, the user had no problems running reports. He also had no problems after I added the two registry settings above and cycled the required services. The only alternative I could see was to ask the user to use the IP address of the server instead of the DNS name, but this may be against company policy and would reduce flexibility if we need to change the DNS entry in the future.

Registry keys added under HKLM\System\CurrentControlSet\Services\HTTP\Parameters

MaxFieldLength 32768
MaxRequestBytes 32768

After adding these keys, all users stopped experiencing HTTP 400 errors.

Main links used for reference:

http://www.issociate.de/board/post/314237/HTTP_400_Bad_Request.html
http://support.microsoft.com/kb/820129
http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/c7d368f0-f175-4d58-b7a8-977e1d67bd07.mspx?mfr=true

Thursday, April 12, 2007

Overspecialization

My new pet peeve is overspecialization.

My esteemed employer emphasizes (at an institutional level) hiring top talent for very specific positions. I, for example, was hired as a SQL Server 2000 Developer. I took an extensive test that asked all kinds of very specific SQL Server 2000 questions. Fine, I know SQL Server, no problem.

Then I start work and find out that we're not supposed to touch any low-level SQL stuff because we have a DBA Team for that. Oookay, I'd rather do these things myself rather than break my train of thought to email somebody, and then wait for them to do something, but I'll give it a shot.

The first time I had to interact with the DBA team was over an issue with our backups. They weren't working. The DBAs have this dramatically complicated procedure that they install on every SQL Server that goes back and retrieves server metadata from a central DB and then dynamically creates a command to back up your databases. Great, except that it's not flexible enough to keep 3 days worth of backups, which is what we wanted. So they had a DBA Team member write a customized version of their stored proc that would add a timestamp to the files and then delete the old ones after 3 days.

Except it didn't work. I won't go into exactly why, but there were 3 different bugs within a relatively short stored proc. So I took it over, re-wrote it from scratch, and now it works fine, completely independent of the metadata repository which wasn't gaining us anything anyway.

Why did this happen? Because the guy spends all day every day installing SQL on new servers, installing SPs on old ones, setting up replication, and troubleshooting deadlocks. And he's probably been doing this for years. Even if he once was a great developer, skills (like brains) atrophy if unused. Which is not even his fault, because his job is designed to make him really good at the few things he does and not let him do anything else.

This causes problems not only for the individual, but for every team. I'm of the opinion that each team should be capable of functioning as an autonomous unit. If you need an outside expert occasionally for some piece of knowledge completely outside of your usual purview, fine. But a team with the skills to accomplish most dev tasks on its own can stay in high gear during development, instead of moving in fits and starts every time it has to beg somebody else to do something.

Getting back to the individual: of course a broad range of skills is a good thing! For me, learning new skills is one of the main things that keeps me interested in a job. Also, every time I interview somebody who has done one thing for his whole career, I have misgivings about hiring him, because

a) who knows if he can do anything else?
b) he has nothing extra to contribute to the team
c) having only specialized knowledge reduces your creativity when it comes to sticky problems

My favorite part about CapIQ was getting to work on everything, and in fact one of the reasons that I left was that I felt I was getting pigeon-holed into being just the DB hardware guy. Little did I know that other offices could be much more restrictive...

Wednesday, April 11, 2007

Infrastructure

I was going to post a comment on Allan Leinwand's post titled Web 2.0 & Death of the Network Engineer on GigaOM, but there were too many already and I didn't feel like reading them all. So I'm writing a quick response over here instead.

The CTO cited in the article knew nothing about the infrastructure details supporting his Web 2.0 venture. I wish him luck, but I think he's on the path to trouble.

The difference between good and great software engineers (or any professional, really) is knowledge of and attention to details. Anyone can write a stored procedure or a C# app, but a great developer's version will be 10% faster and more stable. Over time this can translate into huge savings, especially if that piece of code is run 10,000 times (or in some shops, 10,000 times A DAY).

Infrastructure works similarly. Specified, set up, and maintained properly, a company can squeeze maximum performance and capacity out of its infrastructure. This may not mean much when talking about one $10,000 web server here or there, but let's look at another piece of infrastructure that can make a much bigger difference: database hardware.

Database servers and disks have been by far the most expensive pieces of hardware owned by each team I've worked on, and the most susceptible to underutilization. For example, you can buy a 4-socket quad-core Clovertown server with cycles out the wazoo for under $30k, but if you only put $10k into the disks for it, it's not going to do much. And even if you do put more money into it, you need to know where it's going - will you go with fast, cheap, inflexible SAS storage, a soon-outgrown entry-level SAN like a CX3-10, or a monster DMX3000 but with no budget left over for spindles? Then when you buy it, who's going to have the knowledge to set it up so that it performs properly? If you dump all your DBs on one set of disks, you're potentially giving up a lot of performance, which means you'll have to buy more disks constantly as you scale up your user base.

A CTO who doesn't know these things is bad enough, but one who doesn't care is potentially leading his company into a financial quagmire. If you don't know what you're doing, it's easy to spend millions on infrastructure when you could have spent thousands. And the difference is knowing what you need and knowing how to use it.