I've spent the week processing some gnarly text files -- some in the hundred million row range.
I've used python to open, parse, transform, and output these files. I've been running the jobs in parallel, often 6 -8 at a time, on a massive 8-processor, 16-core EC2 unit, using SSD.
And I would say that the output is bad on 0.001% of writes, like:
Expected output: |1107|2013-01-01 00:00:00|PS|Johnson|etc.
Actual output: |11072013-01-01 00:00:00|PS|Johnson|etc.
or |1107|2013-01-01 :00:00|PS|Johnson
Almost always, the problem is not GIGO, but rather that Python has failed to write a separator or part of a date field. Thus I assume that I'm overloading the SSD with these jobs, or rather that the computer is failing to throttle python based on write contention for the drive.
My question is this: how do I get the fastest processing from this box yet not induce these kind of "write" errors?