0

I have an input file like this:

a
1,100
2,200
3,300
b
1,100,200
2,200,300
3,300,400
c
...

I want to read the file into multiple data frames, with code like this (to simplify the problem, we assume the number of rows for each table is fixed):

import pandas as pd

with open("file.csv", "r") as f:
    while True:
        table_name = f.readline()
        if table_name:
            table_df = pd.read_csv(f, nrows=3)
            # Do other stuff
        else:
            break

My initial expectation was that pd.read_csv(f, nrows=3) consumes only limited number of rows from the input stream and the next f.readline() call would read on. However, it turns out that after the first read_csv call, the stream position of f is set to the end of the file and I can no longer read from the same stream f. My pandas version is 0.25.0. Is this a bug or an expected behaviour? Is there any way to reuse the same input stream to read multiple data frames?

4
  • Have you tried this answer stackoverflow.com/questions/23853553/… Commented Aug 13, 2019 at 14:02
  • Why would you want to reuse the same input stream ? Is your file huge ? If not, you could do lines = f.readlines() and simply play with slicing and range(start, [stop, [step]]) Commented Aug 13, 2019 at 14:04
  • @RootTwo Thanks. IIUC, each read_csv call in that solution creates a new stream and reader, which leads to two problems: (1) it would not work if the stream can only be consumed once (e.g. sys.stdin); (2) even with skiprows, every time the reader needs to consume the stream from the beginning to count the number of rows to skip, which would lead to quite some overhead. Commented Aug 13, 2019 at 14:25
  • @IMCoins The input is not very small but can still fit into the memory for now. It is just that some extra manipulation is needed (as shown in an answer below) to create wrapped streams to be fed into the pd.read_csv method. I was trying to see if there is a better way to do that. Commented Aug 13, 2019 at 14:55

3 Answers 3

3

pandas.read_csv will create a file reader object from filepath_or_buffer argument at once and nrows= param just gives an ability to get a slice from the reader (it can not re-instantiate a new reader from the same file-object)

By file-like object, we refer to objects with a read() method, such as a file handler (e.g. via builtin open function) or StringIO.

Basing on your input file format, let's suppose that a table_name is treated as a row with single string without separator , (i.e. a, b). You can achieve the needed result with manually passing a slice of rows to read_csv constructor:

import pandas as pd
import io
from itertools import islice

with open("file.csv", "r") as f:
    dfs = []
    while True:
        table_name = f.readline().strip()
        if table_name and ',' not in table_name:
            data = ''.join(islice(f, 3)).strip()
            table_df = pd.read_csv(io.StringIO(data), sep=',', header=None)
            dfs.append([table_name, table_df])
        else:
            break

# check results
for t_name, df in dfs:
    print('---', t_name)
    print(df) 

Sample output:

--- a
   0    1
0  1  100
1  2  200
2  3  300
--- b
   0    1    2
0  1  100  200
1  2  200  300
2  3  300  400
Sign up to request clarification or add additional context in comments.

3 Comments

pandas.read_csv will consume all the content from filepath_or_buffer argument and nrows= param just gives an ability to get a slice from entirely read data. Is this documented? I can't seem to find it in pandas documentation.
@GZ0, don't worry, I've updated the explanation but the issue of initial approach is still in that read_csv can not re-instantiate a new reader from the same file-object
Is there any other convenient way to feed the islice iterable to read_csv without creating a full concatenation -- which doubles the amount of memory needed for loading a table?
1

Use the csv module in the Python standard library and use send to indicate the number of rows you want to a generator function:

import csv
import pandas as pd

def csvreader(filename):
    with open(filename) as csvfile:
        reader = csv.DictReader(csvfile)

        count = yield

        while True:
            rows = []
            for n,row in enumerate(reader):
                rows.append(row)
                if n == count:
                    break

            count = yield(pd.DataFrame(rows))

testfile.csv:

i, j, k
1, 2, 4
2, 4, 8
3, 6, 12
4, 8, 16
. . . 

Set up the generator

x = csvreader(s)
next(x)

Request next 2 rows:

x.send(2)
               #returned DataFrame
                    i   j   k
               0    1   2   4
               1    2   4   8

Request next 3 rows:

x.send(3)
               #returned DataFrame

                    i   j   k
               0    3   6   12
               1    4   8   16
               2    5   10  20

Note the index starts over each time. This could be fixed by specifying a column as the index (add a running counter to each row if needed):

count = yield(pd.DataFrame(rows), index=<some column name>)

1 Comment

Thanks a lot for the reply. But parsing the input using the csv module directly would lose the rich preprocessing functionalities provided by pd.read_csv. I prefer a solution that can reuse pd.read_csv or similar pandas methods.
0

Don't know why I didn't think of this earlier. Setting iterator=True returns an iterator over the csv file. Then use get_chunk() to select how many lines to read:

reader = pd.read_csv(f, iterator=True)
​
reader.get_chunk(2)
                            i   j   k
        returns ->     0    1   2   4
                       1    2   4   8
reader.get_chunk(3)
                            i   j   k
                       2    3   6   12
        returns ->     3    4   8   16
                       4    5   10  20

1 Comment

IIUC, this seems to work only if the input can be seen as a single homogeneous table. The input in my main post was heterogeneous and there are also other information (e.g. table names) between the tables.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.