Python fast way to read several rows of csv text?

Question

I wish to to the following as fast as possible with Python:

read rows i to j of a csv file
create the concatenation of all the strings in csv[row=(loop i to j)][column=3]

My first code was a loop (i to j) of the following:

with open('Train.csv', 'rt') as f:
    row = next(itertools.islice(csv.reader(f), row_number, row_number+1))
    tags = (row[3].decode('utf8'))
return tags

but my code above reads the csv one column at a time and is slow.

How can I read all rows in one call and concatenate fast?

Edit for additional information:

the csv file size is 7GB; I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).

PascalVKooten · Accepted Answer · 2013-09-26 08:40:27Z

2

Since I know which data you are interested in, I can speak from experience:

import csv
with open('Train.csv', 'rt') as csvfile:
     reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
     for row in reader:
         row[0]  # ID
         row[1]  # title
         row[2]  # body
         row[3]  # tags

You can of course per row select anything you want, and store it as you like.

By using an iterator variable, you can decide which rows to collect:

import csv
with open('Train.csv', 'rt') as csvfile:
     reader = csv.reader(csvfile, delimiter=' ', quotechar='|')
     linenum = 0
     tags = []      # you can preallocate memory to this list if you want though.
     for row in reader:
         if linenum > 1000 and linenum < 2000: 
            tags.append(row[3])    # tags
         if linenum == 2000:
            break   # so it won't read the next 3 million rows
         linenum += 1

The good thing about it is also that this will really use low memory as you read in line by line.

As mentioned, if you want the later cases, it still has to parse the data to get there (this is inevitable since there are newlines in the text, so you can't skip to a certain row). Personally, I just roughly used linux's split, to split the file in chunks, and then edited them making sure they start at an ID (and end with a tag).

Then I used:

train = pandas.io.parsers.read_csv(file, quotechar="\"")

To quickly read in the split files.

edited Sep 26, 2013 at 8:40

answered Sep 26, 2013 at 8:24

PascalVKooten

21.6k18 gold badges115 silver badges169 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

PascalVKooten Over a year ago

@tucson I even wrote my own reader in C++, because there was nothing perfectly adressing this file in C++ (being really big, I also asked a question on stackoverflow).

6502 · Accepted Answer · 2013-09-26 08:16:07Z

1

If the file is not HUGE (hundred of megabytes) and you actually need to read a lot of rows then probably just

tags = " ".join(x.split("\t")[3]
                for x in open("Train.csv").readlines()[from_row:to_row+1])

is going to be the fastest way.

If the file is instead very big the only thing you can do is iterating over all lines because CSV is uses unfortunately (in general) variable-sized records.

If by chance the specific CSV uses a fixed-size record format (not uncommon for large files) then directly seeking into the file may be an option.

If the file uses variable-sized records and the search must be done several times with different ranges then creating a simple external index just once (e.g. line->file offset for all line numbers that are a multiple of 1000) can be good idea.

edited Sep 26, 2013 at 8:16

answered Sep 26, 2013 at 8:06

6502

115k17 gold badges177 silver badges277 bronze badges

4 Comments

Timothée HENRY Over a year ago

The file is actually huge: 7GB; and I have only 4GB of RAM, on windows XP; but I don't need to read all columns (only 1% of the 7GB would be good I think).

PascalVKooten Over a year ago

@tucson Facebook/Stackoverflow competition training data :P? It must be.

PascalVKooten Over a year ago

@tucson My answer provides how I dealt with the data.

Timothée HENRY Over a year ago

@6502 I am getting a memory error when working with a 15MB csv file (" for x in open("Train.csv").readlines()[from_row:to_row+1]) MemoryError". Is the file too big for this command?

Achim · Accepted Answer · 2013-09-26 08:27:26Z

1

Your question does not contain enough information, probably because you don't see some existing complexity: Most CSV files contain one record per line. In that case it's simple to skip the rows you're not interested in. But in CSV records can span lines, so a general solution (like the CSV reader from the standard library) has to parse the records to skip lines. It's up to you to decide what optimization is ok in your use case.

The next problem is, that you don't know, which part of the code you posted, is too slow. Measure it. Your code will never run faster than the time you need to read the file from disc. Have you checked that? Or have you guessed what part's to slow?

If you want to do fast transformations of CSV data which fits to memory, I would propose to use/learn Pandas. So it would probably a good idea to split your code in two steps:

Reduce file to the required data.
Transform the remaining data.

answered Sep 26, 2013 at 8:27

Achim

15.7k15 gold badges92 silver badges161 bronze badges

1 Comment

Timothée HENRY Over a year ago

Thank you for this generic fundamental information. Very useful.

Leonardo.Z · Accepted Answer · 2013-09-26 09:01:28Z

1

sed is designed for the task 'read rows i to j of a csv file'.to

If the solution does not have to be pure Python, I think preprocess the csv file with sed sed -n 'i, jp', then parse the output with Python would be simple and quick.

answered Sep 26, 2013 at 9:01

Leonardo.Z

9,8413 gold badges37 silver badges38 bronze badges

Collectives™ on Stack Overflow

Python fast way to read several rows of csv text?

4 Answers 4

1 Comment

4 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

4 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related