6

I've spent the week processing some gnarly text files -- some in the hundred million row range.

I've used python to open, parse, transform, and output these files. I've been running the jobs in parallel, often 6 -8 at a time, on a massive 8-processor, 16-core EC2 unit, using SSD.

And I would say that the output is bad on 0.001% of writes, like:

 Expected output:  |1107|2013-01-01 00:00:00|PS|Johnson|etc.

 Actual output:    |11072013-01-01 00:00:00|PS|Johnson|etc.
               or  |1107|2013-01-01 :00:00|PS|Johnson

Almost always, the problem is not GIGO, but rather that Python has failed to write a separator or part of a date field. Thus I assume that I'm overloading the SSD with these jobs, or rather that the computer is failing to throttle python based on write contention for the drive.

My question is this: how do I get the fastest processing from this box yet not induce these kind of "write" errors?

6
  • Aren't those write errors due to your way of doing things? It's unlikely that the SSD is overloaded. Do you write on the same file from different process / threads? Commented Jul 26, 2013 at 12:29
  • Please elaborate on "your way of doing things" -- my way is to write files using csv.writer with a pipe delimiter. The SSD is handling 50K record writes a second. Commented Jul 26, 2013 at 12:35
  • 2
    What I meant was, are you writing concurrently to the same file? Because if so, then this would likely be the cause of your problem. Commented Jul 26, 2013 at 12:37
  • Splitting master file by lines into smaller pieces. Processing each piece in parallel, with output going to a corresponding, independent file. Zero files in common for either reads or writes. That did need clarification -- thank you for prompting. Commented Jul 26, 2013 at 12:43
  • it's possible you're just getting race conditions if you're only using pure threads Commented Jul 26, 2013 at 17:55

1 Answer 1

1

Are you using the multiprocessing module (separate processes) or just using threads for the parallel processing?

I doubt very much that the SSD is the problem. Or python. But maybe the csv module has a race condition and isn't thread safe?

Also, check your code. And the inputs. Are the "bad" writes consistent? Can you reproduce them? You mention GIGO, but don't really rule it out ("Almost always, ...").

Sign up to request clarification or add additional context in comments.

4 Comments

not using the multiprocessing module -- will investigate, thanks. I just want to rule out some things, so you and I both "doubt" that the SSD could be the issue, but I'm looking for your experience here -- will device contention for the block device pause processing? will writes be queued? under heavy load, will characters simply fail to be written? CSV writer is suspect as is the chance that there is some garbage in the file that is stopping normal processing. Once I reasonably rule everything else out, GIGO must be the conclusion.
If the device cannot follow, then write will block. So characters won't fail to be written, it'll just wait until the disk is free, and will write.
The benefit of using the csv module is moderately low, so I would probably replace it with self-written code which I would ensure to be thread-safe. Probably then the problem is gone. Next task would be to find the reason in csv, bugfix it and give the fix back to the community ;-)
Xaqq -- that is very helpful, particularly in the category of "eliminating paranoia."

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.