Handling Variable Number of Columns with Pandas - Python

Question

I have a data set that looks like this (at most 5 columns - but can be less)

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

If I try

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

DSM · Accepted Answer · 2013-03-06 15:55:05Z

84

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

answered Mar 6, 2013 at 15:55

DSM

355k67 gold badges606 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Jackie Shephard Over a year ago

Thank You! This worked - the engine='python' seems to be key. Adding this attribute makes both read_table and read_csv work.

Wes McKinney Over a year ago

This seems pretty warty to me. Adding a github issue: github.com/pydata/pandas/issues/2981

EliadL Over a year ago

What fixed it for me was names=my_cols where my_cols was at least as long as the line with the most fields. If the max number of fields isn't known in advance, you can dynamically extract by reading the file beforehand via

with open('my.csv') as f:  num_cols = max(len(line.split(',')) for line in f);  f.seek(0);  df = pd.read_csv(f, names=range(num_cols))

but the down side is that the file is read twice.

Luca Over a year ago

with Pandas version 0.23.4 pd.read_csv(file, names=my_cols) works even if len(my_cols) is less than the number of fields on one or more lines. The extra fields just get discarded.

Gena Kukartsev Over a year ago

with pandas 0.25.3 it fails if some column is longer than my_cols

herrfz · Accepted Answer · 2013-03-06 09:58:12Z

19

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

answered Mar 6, 2013 at 9:58

herrfz

4,9044 gold badges29 silver badges37 bronze badges

1 Comment

lowzhao Over a year ago

This method may not work because comma maybe escaped by quotes(ie. "hello, world")

Jackie Shephard · Accepted Answer · 2013-03-06 15:40:49Z

1

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

answered Mar 6, 2013 at 15:40

Jackie Shephard

8631 gold badge6 silver badges6 bronze badges

Collectives™ on Stack Overflow

Handling Variable Number of Columns with Pandas - Python

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

5 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related