I've got huge csv files and a few thousands of files (each file running into Gbs and some running into Mbs). However, my interest is only the last n rows (say 50 records) of each of these files. My question is a general one about speed and efficiency: would it be faster if I read_csv all files using skiprows, or slower, or would it make no difference in terms of speed, thanks?
1 Answer
You can use the timeit module to measure how long your code takes to run. It looks like read_csv() is slightly faster if you use skiprows.
import timeit
import pandas as pd
def test():
df = pd.read_csv('large.csv')
def test2():
df = pd.read_csv('large.csv', skiprows=range(0,10000))
if __name__ == "__main__":
print(timeit.timeit("test()", globals=globals(), number=500))
print(timeit.timeit("test2()", globals=globals(), number=500))
| # iterations | without skiprows | with skiprows |
|---|---|---|
| 100 | 4.880708541997592 | 4.318660000004456 |
| 500 | 23.931738541999948 | 21.48539920800249 |
2 Comments
JubG
Marginally faster. Thanks, Gates. Just curious - would you also know how skiprows actually works under the hood? i.e., the entire file is read and the unnecessary data removed, or the read itself picks up the relevant data straight-off?
mozway
Since you need to parse CSV, you can't magically jump to the end of the file. This is parsing the whole file.