Iterate from a certain row of a csv file in Python

Question

I have a csv file with many millions of rows. I want to start iterating from the 10,000,000 row. At the moment I have the code:

    with open(csv_file, encoding='UTF-8') as f: 
        r = csv.reader(f)
        for row_number, row in enumerate(r):    
            if row_number < 10000000:
                continue
            else:
                process_row(row)

This works, however take several seconds to run before the rows of interest appear. Presumably all the unrequired rows are loaded into python unnecessarily, slowing it down. Is there a way of starting the iteration process on a certain row - i.e. without the start of the data read in.

Any reason you can't use tail to skip the first N lines and pipe that to your python script? — Alan
– Alan, Commented Jun 27, 2016 at 22:47
Side-note: You want to pass newline='' to the open call; the csv module expects you to leave newline interpolation to it, you don't want open performing line ending conversions. — ShadowRanger
– ShadowRanger, Commented Jun 27, 2016 at 22:57

Padraic Cunningham · Accepted Answer · 2016-06-27 23:10:58Z

5

You could use islice:

from itertools import islice

with open(csv_file, encoding='UTF-8') as f:
    r = csv.reader(f)
    for row in islice(r,  10000000, None):
            process_row(row)

It still iterates over all the rows but does it a lot more efficiently.

You could also use the consume recipe which calls functions that consume iterators at C speed, calling it on the file object before you pass it to the csv.reader, so you also avoid needlessly processing those lines with the reader:

import collections
from itertools import islice
def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)


with open(csv_file, encoding='UTF-8') as f:
    consume(f, 9999999)
    r = csv.reader(f)
    for row  in r:
          process_row(row)

As Shadowranger commented, if a file could conatin embedded newlines then you would have to consume the reader and pass newline="" but if that is not the case then use do consume the file object as the performance difference will be considerable especially if you have a lot of columns.

edited Jun 27, 2016 at 23:10

answered Jun 27, 2016 at 22:48

Padraic Cunningham

181k30 gold badges264 silver badges327 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ShadowRanger Over a year ago

You shouldn't run the consume on the raw file handle if there is a chance that a field could contain embedded newlines (legal in most if not all CSV dialects). Skipping before csv.reader wrapping means you'll incorrectly interpret field embedded newlines as record separators.

Padraic Cunningham Over a year ago

@ShadowRanger, true, I added a note.

Collectives™ on Stack Overflow

Iterate from a certain row of a csv file in Python

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related