Add more columns to a Pandas Dataframe

Question

I'm using the Pandas function pd.read_csv to import a txt file delimited by |. The column header has 419 fields so it's calculating them all as such. Some of the rows have more than 419 columns though.

So How would I make the data frame malleable to increasing columns as it needs or just add X amount of columns to allow for more columns down the road.

Example:

How would I account for "F"?

A B C D E
A B C D E
A B C D E F

This is the error I'm receiving. I'm using python 3 in Jupyter notebook.

ParserError: Error tokenizing data. C error: Expected 419 fields in line 7945, saw 424

This is the code I'm trying to use

data = pd.read_csv('filepath.txt', sep="|",skip_blank_lines=True, encoding = 'latin-1', header= None)

Why does the input file have unpredictable numbers of items in the rows? — lxop
– lxop, Commented Nov 27, 2018 at 16:05
There are extra deliminator in some rows which cause the rows to shift to the right. The biggest row is 480 columns. Part of the reason I'd like to import it as a dataframe is to see if there are any patterns. Rather of importing the individual rows and changing them individually. — Xderic
– Xderic, Commented Nov 27, 2018 at 16:15
@Xderic if you know the largest row just add the names param when you read: names=range(0,480) — It_is_Chris
– It_is_Chris, Commented Nov 27, 2018 at 16:55

jpp · Accepted Answer · 2018-11-27 16:08:54Z

2

With your setup, you won't know the number of columns until you've read every single row. That won't be efficient. One way is to read the data into a list of lists, appending an arbitrary number of NaN values as necessary. Then feed to the pd.DataFrame constructor.

Here's an example:

from io import StringIO
import csv
import numpy as np

x = StringIO("""A|B|C|D|E
A|B|C|D|E
A|B|C|D|E|F""")

# replace x with open('file.csv', 'r')
with x as fin:
    data = list(csv.reader(fin, delimiter='|'))

num = max(map(len, data))
data = [i+[np.nan]*(num-len(i)) for i in data]
df = pd.DataFrame(data)

print(df)

   0  1  2  3  4    5
0  A  B  C  D  E  NaN
1  A  B  C  D  E  NaN
2  A  B  C  D  E    F

answered Nov 27, 2018 at 16:08

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ALollz Over a year ago

I wonder, is it horribly inefficient to read it as a .csv with the separator as '\n' and then just df[0].str.split('|', expand=True)?

jpp Over a year ago

@ALollz, That should work, but in my experience the Pandas str accessor is poor versus a list comprehension. But feel free to post that solution, it seems viable.

boot-scootin · Accepted Answer · 2018-11-27 16:53:18Z

A solution using pure pandas:

>>> import pandas as pd
>>> data = pd.read_csv('filepath.txt', sep="|",skip_blank_lines=True, encoding = 'latin-1', header= None)
>>> data
             0
0    A B C D E
1    A B C D E
2  A B C D E F

We're able to split each row on whitespace since the delimiter we specified above doesn't exist (AFAIK) in the dataset and therefore creates just one column:

>>> s = data[0].apply(lambda x: x.split())
>>> s
0       [A, B, C, D, E]
1       [A, B, C, D, E]
2    [A, B, C, D, E, F]
Name: 0, dtype: object

Iterate across the list in each row, creating a dictionary column: value mapping for later use with pd.DataFrame constructor:

>>> s = s.apply(lambda x: {'col_' + str(i): v for i, v in enumerate(x)})
>>> s
0    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
1    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
2    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
Name: 0, dtype: object

We'll use the pd.DataFrame.from_records method, which can take data of the following format:

>>> s = s.values.tolist()
>>> s
[{'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E'}, {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E'}, {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E', 'col_5': 'F'}]
>>> df = pd.DataFrame.from_records(s)
>>> df
  col_0 col_1 col_2 col_3 col_4 col_5
0     A     B     C     D     E   NaN
1     A     B     C     D     E   NaN
2     A     B     C     D     E     F

Collectives™ on Stack Overflow

Add more columns to a Pandas Dataframe

2 Answers 2

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related