I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.
How does one use Ruby to break a file into n number of smaller pieces?
I have a very large CSV file, ~ 800,000 lines. I would like to attempt to process this file in parellel to speed up my script.
How does one use Ruby to break a file into n number of smaller pieces?
breaking up the CSV file into chunks is in order, but you have to keep in mind that each chunk needs to keep the first line with the CSV-header!
So UNIX 'split' will not cut it!
You'll have to write your own little Ruby script which reads the first line and stores it in a variable, then distributes the next N lines to a new partial CSV file, but first copying the CSV-header line into it. etc..
After creating each file with the header and a chunk of lines, you could then use Resque to enlist those files for parallel processing by a Resque worker.
For csv files, you can do this:
open("your_file.csv").each_line do |line|
# do your stuff here like split lines
line.split(",")
# or store them in an array
some_array << line
# or write them back to a file
some_file_handler << line
end
By storing lines(or splitted lines) in array(memory) or file, you can break a large file into smaller pieces. After that, threads can be used to process each piece:
threads = []
1.upto(5) { |i| threads << Thread.new { do your stuff with file[i] } }
threads.each(&:join)
Notice you are responsible for keeping threads safe.
Hope this helps!
update:
According to pguardiario's advice, we can use csv from stand library instead of opening the file directly.
I would use linux split command to split this file into many smaller files. then, would process these smaller parts.