0

I've got huge csv files and a few thousands of files (each file running into Gbs and some running into Mbs). However, my interest is only the last n rows (say 50 records) of each of these files. My question is a general one about speed and efficiency: would it be faster if I read_csv all files using skiprows, or slower, or would it make no difference in terms of speed, thanks?

1
  • Why ask such a question when you can time it yourself and see how it works? Commented Jul 19, 2024 at 2:43

1 Answer 1

1

You can use the timeit module to measure how long your code takes to run. It looks like read_csv() is slightly faster if you use skiprows.

import timeit
import pandas as pd

def test():
    df = pd.read_csv('large.csv')

def test2():
    df = pd.read_csv('large.csv', skiprows=range(0,10000))

if __name__ == "__main__":
    print(timeit.timeit("test()",  globals=globals(), number=500))
    print(timeit.timeit("test2()",  globals=globals(), number=500))
# iterations without skiprows with skiprows
100 4.880708541997592 4.318660000004456
500 23.931738541999948 21.48539920800249
Sign up to request clarification or add additional context in comments.

2 Comments

Marginally faster. Thanks, Gates. Just curious - would you also know how skiprows actually works under the hood? i.e., the entire file is read and the unnecessary data removed, or the read itself picks up the relevant data straight-off?
Since you need to parse CSV, you can't magically jump to the end of the file. This is parsing the whole file.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.