2

I have a csv file with many millions of rows. I want to start iterating from the 10,000,000 row. At the moment I have the code:

    with open(csv_file, encoding='UTF-8') as f: 
        r = csv.reader(f)
        for row_number, row in enumerate(r):    
            if row_number < 10000000:
                continue
            else:
                process_row(row)      

This works, however take several seconds to run before the rows of interest appear. Presumably all the unrequired rows are loaded into python unnecessarily, slowing it down. Is there a way of starting the iteration process on a certain row - i.e. without the start of the data read in.

2
  • Any reason you can't use tail to skip the first N lines and pipe that to your python script? Commented Jun 27, 2016 at 22:47
  • Side-note: You want to pass newline='' to the open call; the csv module expects you to leave newline interpolation to it, you don't want open performing line ending conversions. Commented Jun 27, 2016 at 22:57

1 Answer 1

5

You could use islice:

from itertools import islice

with open(csv_file, encoding='UTF-8') as f:
    r = csv.reader(f)
    for row in islice(r,  10000000, None):
            process_row(row)  

It still iterates over all the rows but does it a lot more efficiently.

You could also use the consume recipe which calls functions that consume iterators at C speed, calling it on the file object before you pass it to the csv.reader, so you also avoid needlessly processing those lines with the reader:

import collections
from itertools import islice
def consume(iterator, n):
    "Advance the iterator n-steps ahead. If n is none, consume entirely."
    # Use functions that consume iterators at C speed.
    if n is None:
        # feed the entire iterator into a zero-length deque
        collections.deque(iterator, maxlen=0)
    else:
        # advance to the empty slice starting at position n
        next(islice(iterator, n, n), None)


with open(csv_file, encoding='UTF-8') as f:
    consume(f, 9999999)
    r = csv.reader(f)
    for row  in r:
          process_row(row)  

As Shadowranger commented, if a file could conatin embedded newlines then you would have to consume the reader and pass newline="" but if that is not the case then use do consume the file object as the performance difference will be considerable especially if you have a lot of columns.

Sign up to request clarification or add additional context in comments.

2 Comments

You shouldn't run the consume on the raw file handle if there is a chance that a field could contain embedded newlines (legal in most if not all CSV dialects). Skipping before csv.reader wrapping means you'll incorrectly interpret field embedded newlines as record separators.
@ShadowRanger, true, I added a note.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.