Processing a CSV file in parallel using ruby

Question

I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.

How does one use Ruby to break a file into n number of smaller pieces?

What's the expensive part of your processing? Is it mostly the CSV parsing itself, or the subsequent computation? Is the computation actually amenable to parallelization or is it dependent on prior input? Can you easily identify record-separators (eg is each record terminated with "\n" or may "\n" be embedded in fields)? — dbenhur
– dbenhur, Commented Apr 5, 2012 at 5:25

Tilo · Accepted Answer · 2012-04-05 06:15:03Z

2

breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!

So UNIX 'split' will not cut it!

You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..

After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.

http://railscasts.com/episodes/271-resque

answered Apr 5, 2012 at 6:15

Tilo

33.8k5 gold badges83 silver badges107 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Tomato · Accepted Answer · 2012-04-05 13:49:15Z

1

For csv files, you can do this:

open("your_file.csv").each_line do |line|
  # do your stuff here like split lines
  line.split(",")

  # or store them in an array
  some_array << line

  # or write them back to a file
  some_file_handler << line
end

By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:

threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }

threads.each(&:join)

Notice you are responsible for keeping threads safe.

Hope this helps!

update:

According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.

edited Apr 5, 2012 at 13:49

answered Apr 5, 2012 at 3:28

Tomato

4382 silver badges7 bronze badges

3 Comments

pguardiario Over a year ago

This is the wrong way to handle csv files. Use the CSV library instead.

Tomato Over a year ago

Yeah, we should use csv lib. But the benifits?

pguardiario Over a year ago

Quoting and headers mostly but go look at the docs for more. Also the good feeling you get when you do something the right way :)

tartar · Accepted Answer · 2012-04-05 03:21:10Z

0

I would use linux split command to split this file into many smaller files. then, would process these smaller parts.

answered Apr 5, 2012 at 3:21

tartar

6885 silver badges17 bronze badges

4 Comments

pguardiario Over a year ago

The bottleneck is the disk. Breaking it up into smaller files is unlikely to help, unless the files are on separate disks.

tartar Over a year ago

that may be true but how to avoid disk access. even with multiple threads, there needs to be either a file handle for each thread or file handle position should be moved every time a thread wants to do some operation. so, with separate files and separate processes ( or threads), you may minimizing waiting duration due to I/O related thread suspensions. am I wrong?

dbenhur Over a year ago

split processes lines, but CSV can have newlines embedded in quoted fields, so splitting on newline only is problamatic in the general case.

pguardiario Over a year ago

I suppose it depends on the disk but it seems unlikely to me.

Collectives™ on Stack Overflow

Processing a CSV file in parallel using ruby

3 Answers 3

Comments

3 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related