NYC DBA: Commandments for scraping public data sources

I've used a couple different flavors of publicly available data as data sources, and I can tell you that there's a good reason why firms pay for clean data. Not that even vendor-processed data is usually 100% clean, but that's a different story. Anyway, I've learned a few things the hard way about scraping public data:

Use a resilient HTML parsing engine. BeautifulSoup is great, and makes it very easy to explore HTML structures, but you'll almost certainly want the LXML backend to avoid blowing up on unclosed tags and nonstandard nesting.
Always call strip(). I've had a couple of import busts in which things weren't matching that really looked like they should. 90% of the time it was because there was random whitespace that had infiltrated the actual data. This also leads me to...
Anticipate spacing changes. ESPECIALLY with hand-entered data, even if there's a template, there can always be an extra line between the logo and the header, a blank line between data rows, or new spacing for the file date. Whenever possible, search for a reference value that points the way to the data, and throw away those newlines rather than expecting data by default.
Keep original copies. The first thing you should do with any parsed source is make a raw copy of it (if you can) so that you can refer back to it when #5 happens.
Expect change! No warranty for most public data sources is granted, implied, or hinted at, and sometimes the exact opposite is the case: public data "providers" happily change formats, addresses, and datasets available to stymie anyone making systematic use of such. Make sure you've got good logging and debugging set up so you can figure out quickly where and how something changed.
Make things modular. Building on #5 above, if you've got your parser set up to quickly swap out the downloader, parser, normalizer, or persister (for example) for any given source and toggle those sources on and off easily, it'll make things much easier when you need to quickly hack up a new one because somebody changed something somewhere.
Handle multiple formats. I've parsed many Excel spreadsheets that usually have an XLDate in a column but, every so often, spit out a text string instead because of the way someone typed it or copied it in. If your date parser function knows about this, you don't have to think about it.
Don't hammer. When you're using someone else's data that they're not explicitly making available in a structured format, be nice, download one copy and build your import off of that. Otherwise you're putting undue load on their servers and exposing your IP to possible banning if they decide you're violating their ToS.
Inheritance is your friend. Most of the parsers I've written have a lot of common structure that needs a little tweaking for certain sources. If you've built a solid class hierarchy, you can easily override that save() method in the 2 subclasses that need it while only needing to write a basic one for the other dozen. Any time I find myself copying code, I generally try to move it up to the superclass.
Pad your dev time estimate! This is a general problem I have, but I always look at a source, pretty quickly shred the data out with bs4 or something similar, and go "yeah, this'll take 2 days." Sure, to code-complete. Then it'll take a week to figure out every stupid corner case, extra whitespace location, and placeholder for None. Trust me (and this goes double for you, future me, when you're re-reading this).

NYC DBA

Friday, July 26, 2013

Commandments for scraping public data sources

No comments:

Recommended Feeds

Blog Archive

About Me