This server is pretty beefy, a 4-socket quad-core with Xeon 7350s, but tempdb only had 8 files. I know that the rule of thumb of 1 file per core is no longer quite so hard and fast, but I figured it probably wouldn't hurt here. Created an extra 8 files, equalized the file sizes on the current ones, and had the user kick off the process again.
No waits! Or at least, no PAGELATCH_UP waits. Some SOS_SCHEDULER_YIELDs and CXPACKETs, but I took that to mean that we had successfully shifted the bottleneck off of allocations and onto CPU, where it should be.
It's rare that 5 minutes of configuration change can effect a significant gain in process speed, but it's pretty satisfying when it happens.