Filter out rows from CSV before loading to pandas dataframe

Question

I have a large csv file, that I cannot load into a DataFrame using read_csv() due to memory issues. However in the first column of the csv there is a {0,1} flag, and I only need to load the rows with a '1', which will easily be small enough to fit in a DataFrame. Is there any way to load the data with a condition, or to manipulate the csv prior to loading it (similar to grep)?

You could easily make a new csv filtered on that column, no? — juanpa.arrivillaga
– juanpa.arrivillaga, Commented Apr 17, 2017 at 23:31

piRSquared · Accepted Answer · 2017-04-18 14:58:00Z

8

You can use pd.read_csvs the comment parameter and set it to '0'

import pandas as pd
from io import StringIO

txt = """col1,col2
1,a
0,b
1,c
0,d"""

pd.read_csv(StringIO(txt), comment='0')

   col1 col2
0     1    a
1     1    c

You can also use chunksize to turn pd.read_csv into an iterator and process it with query and pd.concat
NOTE: As the OP pointed out, chunk size of 1 isn't realistic. I used it for demonstration purposes only. Please increase it to suit individual needs.

pd.concat([df.query('col1 == 1') for df in pd.read_csv(StringIO(txt), chunksize=1)])
# Equivalent to and slower than... use the commented line for better performance
# pd.concat([df[df.col1 == 1] for df in pd.read_csv(StringIO(txt), chunksize=1)])

   col1 col2
0     1    a
2     1    c

edited Apr 18, 2017 at 14:58

answered Apr 17, 2017 at 23:33

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

piRSquared Over a year ago

@juanpa.arrivillaga what are you talking about, just as it was intended to be used :-)

ibav Over a year ago

I tried to generalize the problem, but in reality the filter I'm using a string in the middle of the document. The comment field answered my question exactly, but is limited in that it only works on a single character at the beginning of the file. The concat solution works perfectly, although I increased the chunksize - 1 was too slow. I also needed to add low_memory = False to get around some data_type issues. mtms = pd.concat([df.query('Pool=="FX"') for df in pd.read_csv(mtms, chunksize=1000, low_memory = False)])

piRSquared Over a year ago

@ibav Yes! Please increase the chunk size. 1 was for demonstration purposes.

Collectives™ on Stack Overflow

Filter out rows from CSV before loading to pandas dataframe

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related