2

I have an application I wrote in PHP (on symfony) that imports large CSV files (up to 100,000 lines). It has a real memory usage problem. Once it gets through about 15,000 rows, it grinds to a halt.

I know there are measures I could take within PHP but I'm kind of done with PHP, anyway.

If I wanted to write an app that imports CSV files, do you think there would be any significant difference between Ruby and Python? Is either one of them geared to more import-related tasks? I realize I'm asking a question based on very little information. Feel free to ask me to clarify things, or just speak really generally.

If it makes any difference, I really like Lisp and I would prefer the Lispier of the two languages, if possible.

15
  • 2
    When you have a CSV file that you need to import (talking about databases) why not use the "csvimport" functionallity that most DBs provide? Commented Dec 31, 2010 at 16:26
  • 1
    you should not depend on a framework for task like this, had you ever write a simple script do parse/import the large csv ? Commented Dec 31, 2010 at 16:35
  • 4
    Changing the language won't help. You need to fix your bad habits first. Commented Dec 31, 2010 at 16:44
  • 1
    @ajreal: you do at least want a library to handle CSV. Not everything about CSV is as simple as row.split(",") (you have to be able to deal with escaping and quoting so that you can have commas inside of cells) Commented Dec 31, 2010 at 17:14
  • 1
    Ugh. I asked my question in a retarded way. It was supposed to be mainly about the memory usage. Commented Dec 31, 2010 at 20:06

4 Answers 4

10

What are you importing the CSV file into? Couldn't you parse the CSV file in a way that doesn't load the whole thing into memory at once (i.e. work with one line at a time)?

If so, then you can use Python's standard csv library to do something like the following

import csv
with open('csvfile.csv', 'rb') as source:
    rdr= csv.reader( source )
    for row in rdr:
        # do whatever with row

Now don't take this answer as an immediate reason to switch to Python. I'd be very surprised if PHP didn't have a similar functionality in its CSV library, etc.

Sign up to request clarification or add additional context in comments.

3 Comments

@Ken Bloom: It was brilliant. I couldn't decide whether to add Python to your answer or clone the answer. In retrospect, I think I should have added the Python code to the Ruby answer, because then you'd get the credit you deserve for saying "don't take this answer as an immediate reason to switch".
I should have been clearer in my question, sorry: it's not the actual reading of the file that's necessarily causing my memory problems; it's the slicing and dicing I do afterward.
@Jason Swett: "I should have been clearer in my question,". Then update your question to be clearer. The comment here is about useless. Please fix the question to state your real problem.
10

What are you importing the CSV file into? Couldn't you parse the CSV file in a way that doesn't load the whole thing into memory at once (i.e. work with one line at a time)?

If so, then you can use Ruby's standard CSV library to do something like the following"

CSV.open('csvfile.csv', 'r') do |row|
  #executes once for each row
  p row
end

Now don't take this answer as an immediate reason to switch to Ruby. I'd be very surprised if PHP didn't have a similar functionality in its CSV library, so you should investigate PHP more thoroughly before deciding that you need to switch languages.

4 Comments

Python also has that same capability. I see no reason why PHP wouldn't.
The fact that the language features of Ruby make the processing a little more natural (in my opinion) might constitute a reason to switch to Ruby. But it's certainly possible to get PHP to process CSV files reasonably fast. For instance, I just wrote a small PHP script that uses fgetcsv to read each row of a 1,000,000-line CSV file. (That's all it does; there's no additional processing.) On my Mac laptop, that operation takes 5 seconds. That's not terrible. The equivalent Ruby script (like the above) takes quite a bit longer, on the same machine (with Ruby 1.8).
PHP does have the same capability (fgetcsv) and I am using it. I should have been clearer in my question, sorry: it's not the actual reading of the file that's necessarily causing my memory problems; it's the slicing and dicing I do afterward.
@Jason, then ask a (new) question about the specific slicing and dicing you're doing. (Also consider running a profiler to see where the bottlenecks are.)
3

The equivalent in python (wait for it):

import csv
reader = csv.reader(open("some.csv", "rb"))
for row in reader:
    print row

This code does not load the entire csv file in memory first but, instead, parses it line by line with iterators. I bet your problem is happening "after" the line is read, where you are somehow buffering the data (by storing it in a dictionary or array of some sort).

When dealing with bigdata, you need to discard of the data as fast as you can and buffer a little as possible. In the example above "print" is doing just that, performing some operation on the line of data but not storing/buffering any of it so python's GC can do away with that reference as soon as the loop scope ends.

I hope this helps.

1 Comment

Please use the with statement when you open files.
1

I think the problem is that you are loading the csv in memory at once. If that is the case then I am sure that also python/ruby is going to blow up on you. I am a big fan of python, but that is just a personal opinion.

2 Comments

I'm not sure about Ruby, buy Python can parse a CSV file line by line without having to load the entire thing into memory all at once. This is the behavior in the two Python examples posted above.
@Chris I know that. PHP can do that too, but if you load it at once you are getting yourself into a problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.