ruby read csv but on two different rows simultaneously

Question

Let's say, I can read CSV by the built in CSV parser. like this:

CSV.foreach(file_path, quote_char: '"', col_sep: ',', row_sep: :auto, headers: true) { |line|
    #some code here
}

this code reads and parse whole csv from start to end.

So, my question is if it is possible or some no-lame way to read the CSV simultaneously > like one part of script would read csv from start to half and second part of script from half to end just by accesing the file on disk?

! without reading the csv into array/memory/other

ruby pseudo code (with knowing all lines in file)

threads = []

threads << Thread.new do
    csvread(startrowindex,halfrowindex)
    end

threads << Thread.new do
   csvread(halfrowindex+1,endrowindex)
   end
threads.each(&:join)

No, CSV reads the file sequentially. What you could do is read CSV and distribute lines lazily to several different worker threads as they arrive. — Amadan
– Amadan, Commented Apr 22, 2015 at 6:44
hmm, that is actually a very good idea, thank you, i will try that! — user_pruser
– user_pruser, Commented Apr 22, 2015 at 6:50
hmm, i must be doing something wrong... it is slower than before!!! :D :( help please — user_pruser
– user_pruser, Commented Apr 22, 2015 at 7:04
See the comment at the end of my answer. You cannot speed up anything using threads in MRI except blocked IO, since MRI runs only one thread at a time. If you are after speedups, either go with JRuby or Rubinius, or use processes instead of threads. — Amadan
– Amadan, Commented Apr 22, 2015 at 7:06

Amadan · Accepted Answer · 2015-04-22 08:28:44Z

2

What I said in the comments - for example, using the peach gem:

require 'csv'
require 'peach'

CSV.foreach("a.csv").peach(2) do |row|
  row.map(&:to_i).reduce(&:+)
end

If you are using MRI, you will suffer from GIL; if the workers are doing some heavy lifting, this code should be a bit slower than the non-threaded one. If your slowness is related to CPU, switch to JRuby or Rubinius, as they don't have GIL. If it is related to IO blocking, then this should help even on MRI.

edited Apr 22, 2015 at 8:28

answered Apr 22, 2015 at 6:56

Amadan

200k23 gold badges253 silver badges321 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

user_pruser Over a year ago

well, i tried that, but probably the file is just too big, or i have a small amount of ram (8GB) so no success. I will probably go by with my old slow but reliable solution. Anyway, big thanks for your ideas and help!

Amadan Over a year ago

I am not sure what you mean - why does RAM matter? The file should be loaded one row at a time. What kind of error are you getting?

user_pruser Over a year ago

I am getting no error. When I open the task manager in windows, i just see only that ruby had consumed all free ram and nothing is happening. my csv file has about 3.8mil rows and about 2.5GB

Amadan Over a year ago

Okay, I found the issue - my deepest apologies, it seems that parallel doesn't work with iterators properly (it coerces the collection to_a unless it is a lambda, which made it load the whole CSV into memory). Changing to peach should solve that issue.

Amadan Over a year ago

No, just peach. There are a bunch of different libraries for threaded master-worker pattern (peach, parallel, workers, parallel-each, producer-consumer...), they all do more or less the same thing. I already changed the sample code in the answer to work with peach instead of parallel. (Forgot to change the link, fixed that too)

|

Collectives™ on Stack Overflow

ruby read csv but on two different rows simultaneously

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related