Unable to Reuse Input Stream after read_csv Call in Pandas

Question

I have an input file like this:

a
1,100
2,200
3,300
b
1,100,200
2,200,300
3,300,400
c
...

I want to read the file into multiple data frames, with code like this (to simplify the problem, we assume the number of rows for each table is fixed):

import pandas as pd

with open("file.csv", "r") as f:
    while True:
        table_name = f.readline()
        if table_name:
            table_df = pd.read_csv(f, nrows=3)
            # Do other stuff
        else:
            break

My initial expectation was that pd.read_csv(f, nrows=3) consumes only limited number of rows from the input stream and the next f.readline() call would read on. However, it turns out that after the first read_csv call, the stream position of f is set to the end of the file and I can no longer read from the same stream f. My pandas version is 0.25.0. Is this a bug or an expected behaviour? Is there any way to reuse the same input stream to read multiple data frames?

Have you tried this answer stackoverflow.com/questions/23853553/… — RootTwo
– RootTwo, Commented Aug 13, 2019 at 14:02
Why would you want to reuse the same input stream ? Is your file huge ? If not, you could do lines = f.readlines() and simply play with slicing and range(start, [stop, [step]]) — IMCoins
– IMCoins, Commented Aug 13, 2019 at 14:04
@RootTwo Thanks. IIUC, each read_csv call in that solution creates a new stream and reader, which leads to two problems: (1) it would not work if the stream can only be consumed once (e.g. sys.stdin); (2) even with skiprows, every time the reader needs to consume the stream from the beginning to count the number of rows to skip, which would lead to quite some overhead. — GZ0
– GZ0, Commented Aug 13, 2019 at 14:25
@IMCoins The input is not very small but can still fit into the memory for now. It is just that some extra manipulation is needed (as shown in an answer below) to create wrapped streams to be fed into the pd.read_csv method. I was trying to see if there is a better way to do that. — GZ0
– GZ0, Commented Aug 13, 2019 at 14:55

RomanPerekhrest · Accepted Answer · 2019-08-13 15:33:02Z

3

pandas.read_csv will create a file reader object from filepath_or_buffer argument at once and nrows= param just gives an ability to get a slice from the reader (it can not re-instantiate a new reader from the same file-object)

By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

Basing on your input file format, let's suppose that a table_name is treated as a row with single string without separator , (i.e. a, b). You can achieve the needed result with manually passing a slice of rows to read_csv constructor:

import pandas as pd
import io
from itertools import islice

with open("file.csv", "r") as f:
    dfs = []
    while True:
        table_name = f.readline().strip()
        if table_name and ',' not in table_name:
            data = ''.join(islice(f, 3)).strip()
            table_df = pd.read_csv(io.StringIO(data), sep=',', header=None)
            dfs.append([table_name, table_df])
        else:
            break

# check results
for t_name, df in dfs:
    print('---', t_name)
    print(df)

Sample output:

--- a
   0    1
0  1  100
1  2  200
2  3  300
--- b
   0    1    2
0  1  100  200
1  2  200  300
2  3  300  400

edited Aug 13, 2019 at 15:33

answered Aug 13, 2019 at 14:28

RomanPerekhrest

93.1k4 gold badges75 silver badges112 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

GZ0 Over a year ago

pandas.read_csv will consume all the content from filepath_or_buffer argument and nrows= param just gives an ability to get a slice from entirely read data. Is this documented? I can't seem to find it in pandas documentation.

RomanPerekhrest Over a year ago

@GZ0, don't worry, I've updated the explanation but the issue of initial approach is still in that read_csv can not re-instantiate a new reader from the same file-object

GZ0 Over a year ago

Is there any other convenient way to feed the islice iterable to read_csv without creating a full concatenation -- which doubles the amount of memory needed for loading a table?

RootTwo · Accepted Answer · 2019-08-13 21:21:45Z

1

Use the csv module in the Python standard library and use send to indicate the number of rows you want to a generator function:

import csv
import pandas as pd

def csvreader(filename):
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)

        count = yield

        while True:
            rows = []
            for n,row in enumerate(reader):
                rows.append(row)
                if n == count:
                    break

            count = yield(pd.DataFrame(rows))

testfile.csv:

i, j, k
1, 2, 4
2, 4, 8
3, 6, 12
4, 8, 16
. . .

Set up the generator

x = csvreader(s)
next(x)

Request next 2 rows:

x.send(2)
               #returned DataFrame
                    i   j   k
               0    1   2   4
               1    2   4   8

Request next 3 rows:

x.send(3)
               #returned DataFrame

                    i   j   k
               0    3   6   12
               1    4   8   16
               2    5   10  20

Note the index starts over each time. This could be fixed by specifying a column as the index (add a running counter to each row if needed):

count = yield(pd.DataFrame(rows), index=<some column name>)

answered Aug 13, 2019 at 21:21

RootTwo

4,4361 gold badge13 silver badges15 bronze badges

1 Comment

GZ0 Over a year ago

Thanks a lot for the reply. But parsing the input using the csv module directly would lose the rich preprocessing functionalities provided by pd.read_csv. I prefer a solution that can reuse pd.read_csv or similar pandas methods.

RootTwo · Accepted Answer · 2019-08-13 23:07:23Z

0

Don't know why I didn't think of this earlier. Setting iterator=True returns an iterator over the csv file. Then use get_chunk() to select how many lines to read:

reader = pd.read_csv(f, iterator=True)

reader.get_chunk(2)
                            i   j   k
        returns ->     0    1   2   4
                       1    2   4   8
reader.get_chunk(3)
                            i   j   k
                       2    3   6   12
        returns ->     3    4   8   16
                       4    5   10  20

answered Aug 13, 2019 at 23:07

RootTwo

4,4361 gold badge13 silver badges15 bronze badges

1 Comment

GZ0 Over a year ago

IIUC, this seems to work only if the input can be seen as a single homogeneous table. The input in my main post was heterogeneous and there are also other information (e.g. table names) between the tables.

Collectives™ on Stack Overflow

Unable to Reuse Input Stream after read_csv Call in Pandas

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related