1

I wish to to the following as fast as possible with Python:

  • read rows i to j of a csv file
  • create the concatenation of all the strings in csv[row=(loop i to j)][column=3]

My first code was a loop (i to j) of the following:

with open('Train.csv', 'rt') as f:
    row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
    tags = (row[3].decode('utf8'))
return tags

but my code above reads the csv one column at a time and is slow.

How can I read all rows in one call and concatenate fast?


Edit for additional information:

the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).

4 Answers 4

2

Since I know which data you are interested in, I can speak from experience:

import csv
with open('Train.csv', 'rt') as csvfile:
     reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
     for row in reader:
         row[0]  # ID
         row[1]  # title
         row[2]  # body
         row[3]  # tags

You can of course per row select anything you want, and store it as you like.

By using an iterator variable, you can decide which rows to collect:

import csv
with open('Train.csv', 'rt') as csvfile:
     reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
     linenum = 0
     tags = []      # you can preallocate memory to this list if you want though.
     for row in reader:
         if linenum > 1000 and linenum < 2000: 
            tags.append(row[3])    # tags
         if linenum == 2000:
            break   # so it won't read the next 3 million rows
         linenum += 1

The good thing about it is also that this will really use low memory as you read in line by line.

As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).

Then I used:

train = pandas.io.parsers.read_csv(file, quotechar="\"")

To quickly read in the split files.

Sign up to request clarification or add additional context in comments.

1 Comment

@tucson I even wrote my own reader in C++, because there was nothing perfectly adressing this file in C++ (being really big, I also asked a question on stackoverflow).
1

If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just

tags = " ".join(x.split("\t")[3]
                for x in open("Train.csv").readlines()[from_row:to_row+1])

is going to be the fastest way.

If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.

If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.

If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.

4 Comments

The file is actually huge: 7GB; and I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).
@tucson Facebook/Stackoverflow competition training data :P? It must be.
@tucson My answer provides how I dealt with the data.
@6502 I am getting a memory error when working with a 15MB csv file (" for x in open("Train.csv").readlines()[from_row:to_row+1]) MemoryError". Is the file too big for this command?
1

Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.

The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?

If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:

  1. Reduce file to the required data.
  2. Transform the remaining data.

1 Comment

Thank you for this generic fundamental information. Very useful.
1

sed is designed for the task 'read rows i to j of a csv file'.to

If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.