1

I'm using the Pandas function pd.read_csv to import a txt file delimited by |. The column header has 419 fields so it's calculating them all as such. Some of the rows have more than 419 columns though.

So How would I make the data frame malleable to increasing columns as it needs or just add X amount of columns to allow for more columns down the road.

Example:

How would I account for "F"?

A B C D E
A B C D E
A B C D E F

This is the error I'm receiving. I'm using python 3 in Jupyter notebook.

ParserError: Error tokenizing data. C error: Expected 419 fields in line 7945, saw 424

This is the code I'm trying to use

data = pd.read_csv('filepath.txt', sep="|",skip_blank_lines=True, encoding = 'latin-1', header= None)
3
  • Why does the input file have unpredictable numbers of items in the rows? Commented Nov 27, 2018 at 16:05
  • There are extra deliminator in some rows which cause the rows to shift to the right. The biggest row is 480 columns. Part of the reason I'd like to import it as a dataframe is to see if there are any patterns. Rather of importing the individual rows and changing them individually. Commented Nov 27, 2018 at 16:15
  • @Xderic if you know the largest row just add the names param when you read: names=range(0,480) Commented Nov 27, 2018 at 16:55

2 Answers 2

2

With your setup, you won't know the number of columns until you've read every single row. That won't be efficient. One way is to read the data into a list of lists, appending an arbitrary number of NaN values as necessary. Then feed to the pd.DataFrame constructor.

Here's an example:

from io import StringIO
import csv
import numpy as np

x = StringIO("""A|B|C|D|E
A|B|C|D|E
A|B|C|D|E|F""")

# replace x with open('file.csv', 'r')
with x as fin:
    data = list(csv.reader(fin, delimiter='|'))

num = max(map(len, data))
data = [i+[np.nan]*(num-len(i)) for i in data]
df = pd.DataFrame(data)

print(df)

   0  1  2  3  4    5
0  A  B  C  D  E  NaN
1  A  B  C  D  E  NaN
2  A  B  C  D  E    F
Sign up to request clarification or add additional context in comments.

2 Comments

I wonder, is it horribly inefficient to read it as a .csv with the separator as '\n' and then just df[0].str.split('|', expand=True)?
@ALollz, That should work, but in my experience the Pandas str accessor is poor versus a list comprehension. But feel free to post that solution, it seems viable.
1

A solution using pure pandas:

>>> import pandas as pd
>>> data = pd.read_csv('filepath.txt', sep="|",skip_blank_lines=True, encoding = 'latin-1', header= None)
>>> data
             0
0    A B C D E
1    A B C D E
2  A B C D E F

We're able to split each row on whitespace since the delimiter we specified above doesn't exist (AFAIK) in the dataset and therefore creates just one column:

>>> s = data[0].apply(lambda x: x.split())
>>> s
0       [A, B, C, D, E]
1       [A, B, C, D, E]
2    [A, B, C, D, E, F]
Name: 0, dtype: object

Iterate across the list in each row, creating a dictionary column: value mapping for later use with pd.DataFrame constructor:

>>> s = s.apply(lambda x: {'col_' + str(i): v for i, v in enumerate(x)})
>>> s
0    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
1    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
2    {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'co...
Name: 0, dtype: object

We'll use the pd.DataFrame.from_records method, which can take data of the following format:

>>> s = s.values.tolist()
>>> s
[{'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E'}, {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E'}, {'col_0': 'A', 'col_1': 'B', 'col_2': 'C', 'col_3': 'D', 'col_4': 'E', 'col_5': 'F'}]
>>> df = pd.DataFrame.from_records(s)
>>> df
  col_0 col_1 col_2 col_3 col_4 col_5
0     A     B     C     D     E   NaN
1     A     B     C     D     E   NaN
2     A     B     C     D     E     F

1 Comment

Or data[0].str.split() ?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.