2

Let's say, I can read CSV by the built in CSV parser. like this:

CSV.foreach(file_path, quote_char: '"', col_sep: ',', row_sep: :auto, headers: true) { |line|
    #some code here
}

this code reads and parse whole csv from start to end.

So, my question is if it is possible or some no-lame way to read the CSV simultaneously > like one part of script would read csv from start to half and second part of script from half to end just by accesing the file on disk?

! without reading the csv into array/memory/other

ruby pseudo code (with knowing all lines in file)

threads = []

threads << Thread.new do
    csvread(startrowindex,halfrowindex)
    end

threads << Thread.new do
   csvread(halfrowindex+1,endrowindex)
   end
threads.each(&:join)
4
  • No, CSV reads the file sequentially. What you could do is read CSV and distribute lines lazily to several different worker threads as they arrive. Commented Apr 22, 2015 at 6:44
  • hmm, that is actually a very good idea, thank you, i will try that! Commented Apr 22, 2015 at 6:50
  • hmm, i must be doing something wrong... it is slower than before!!! :D :( help please Commented Apr 22, 2015 at 7:04
  • See the comment at the end of my answer. You cannot speed up anything using threads in MRI except blocked IO, since MRI runs only one thread at a time. If you are after speedups, either go with JRuby or Rubinius, or use processes instead of threads. Commented Apr 22, 2015 at 7:06

1 Answer 1

2

What I said in the comments - for example, using the peach gem:

require 'csv'
require 'peach'

CSV.foreach("a.csv").peach(2) do |row|
  row.map(&:to_i).reduce(&:+)
end

If you are using MRI, you will suffer from GIL; if the workers are doing some heavy lifting, this code should be a bit slower than the non-threaded one. If your slowness is related to CPU, switch to JRuby or Rubinius, as they don't have GIL. If it is related to IO blocking, then this should help even on MRI.

Sign up to request clarification or add additional context in comments.

7 Comments

well, i tried that, but probably the file is just too big, or i have a small amount of ram (8GB) so no success. I will probably go by with my old slow but reliable solution. Anyway, big thanks for your ideas and help!
I am not sure what you mean - why does RAM matter? The file should be loaded one row at a time. What kind of error are you getting?
I am getting no error. When I open the task manager in windows, i just see only that ruby had consumed all free ram and nothing is happening. my csv file has about 3.8mil rows and about 2.5GB
Okay, I found the issue - my deepest apologies, it seems that parallel doesn't work with iterators properly (it coerces the collection to_a unless it is a lambda, which made it load the whole CSV into memory). Changing to peach should solve that issue.
No, just peach. There are a bunch of different libraries for threaded master-worker pattern (peach, parallel, workers, parallel-each, producer-consumer...), they all do more or less the same thing. I already changed the sample code in the answer to work with peach instead of parallel. (Forgot to change the link, fixed that too)
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.