84

I have a data set that looks like this (at most 5 columns - but can be less)

1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
....

I am trying to use pandas read_table to read this into a 5 column data frame. I would like to read this in without additional massaging.

If I try

import pandas as pd
my_cols=['A','B','C','D','E']
my_df=pd.read_table(path,sep=',',header=None,names=my_cols)

I get an error - "column names have 5 fields, data has 3 fields".

Is there any way to make pandas fill in NaN for the missing columns while reading the data?

3 Answers 3

84

One way which seems to work (at least in 0.10.1 and 0.11.0.dev-fc8de6d):

>>> !cat ragged.csv
1,2,3
1,2,3,4
1,2,3,4,5
1,2
1,2,3,4
>>> my_cols = ["A", "B", "C", "D", "E"]
>>> pd.read_csv("ragged.csv", names=my_cols, engine='python')
   A  B   C   D   E
0  1  2   3 NaN NaN
1  1  2   3   4 NaN
2  1  2   3   4   5
3  1  2 NaN NaN NaN
4  1  2   3   4 NaN

Note that this approach requires that you give names to the columns you want, though. Not as general as some other ways, but works well enough when it applies.

Sign up to request clarification or add additional context in comments.

5 Comments

Thank You! This worked - the engine='python' seems to be key. Adding this attribute makes both read_table and read_csv work.
This seems pretty warty to me. Adding a github issue: github.com/pydata/pandas/issues/2981
What fixed it for me was names=my_cols where my_cols was at least as long as the line with the most fields. If the max number of fields isn't known in advance, you can dynamically extract by reading the file beforehand via with open('my.csv') as f: num_cols = max(len(line.split(',')) for line in f); f.seek(0); df = pd.read_csv(f, names=range(num_cols)) but the down side is that the file is read twice.
with Pandas version 0.23.4 pd.read_csv(file, names=my_cols) works even if len(my_cols) is less than the number of fields on one or more lines. The extra fields just get discarded.
with pandas 0.25.3 it fails if some column is longer than my_cols
19

I'd also be interested to know if this is possible, from the doc it doesn't seem to be the case. What you could probably do is read the file line by line, and concatenate each reading to a DataFrame:

import pandas as pd

df = pd.DataFrame()

with open(filepath, 'r') as f:
    for line in f:
        df = pd.concat( [df, pd.DataFrame([tuple(line.strip().split(','))])], ignore_index=True )

It works but not in the most elegant way, I guess...

1 Comment

This method may not work because comma maybe escaped by quotes(ie. "hello, world")
1

Ok. Not sure how efficient this is - but here is what I have done. Would love to hear if there is a better way to do this. Thanks !

from pandas import DataFrame

list_of_dicts=[]
labels=['A','B','C','D','E']
for line in file:
    line=line.rstrip('\n')
    list_of_dicts.append(dict(zip(labels,line.split(','))))
frame=DataFrame(list_of_dicts)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.