Parse text file python and covert to pandas dataframe

Question

I am trying to parse a text file, converting it into a pandas dataframe. The file (inclusive of blank lines):

HEADING1
value 1

HEADING2
value 2

HEADING1,
value 11

HEADING2
value 12

should be converted into a dataframe:

HEADING1, HEADING2
value 1, value 2
value 11, value 12

I have tried the following code. However, I am not sure using converters could work?

df = pd.read_table(textfile, header=None, skip_blank_lines=True, delimiter='\n',
                   # converters= 'what should I use?',
                   names= 'HEADING1, HEADING2'.split() )

you could try importing the whole thing as a series and then finding the rows that belong to header one as series[series == "header1"].index.tolist() + 1 — Adam
– Adam, Commented May 31, 2017 at 15:17
@Adam I am not sure I understood your suggestion (other than in principle). What would the code look like? — Andreuccio
– Andreuccio, Commented May 31, 2017 at 15:21

piRSquared · Accepted Answer · 2017-05-31 15:31:24Z

You parse the text yourself and split on '\n\n'

# split file by `'\n\n'` to get rows
# split again by `'\n'` to get columns
# `zip` to get convenient lists of headers and values
cols, vals = zip(
    *[line.split('\n') for line in open(textfile).read().split('\n\n')]
)

# construct a `pd.Series`
# note: your index contained in the `cols` list will not be unique
s = pd.Series(vals, cols)

# we'll need to enumerate the duplicated index values so that we can unstack
# we do this by creating a `pd.MultiIndex` with `cumcount` then the header values
s.index = [s.groupby(level=0).cumcount(), s.index]

# finally, `unstack`
s.unstack()

   HEADING1  HEADING2
0   value 1   value 2
1  value 11  value 12

Breakdown

list comprehension

[line.split('\n') for line in StringIO(txt).read().split('\n\n')]

[['HEADING1', 'value 1'],
 ['HEADING2', 'value 2'],
 ['HEADING1', 'value 11'],
 ['HEADING2', 'value 12']]

with zip

list(zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')]))

[('HEADING1', 'HEADING2', 'HEADING1', 'HEADING2'),
 ('value 1', 'value 2', 'value 11', 'value 12')]

setting cols and vals

cols, vals = zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')])

print(cols)
print()
print(vals)

('HEADING1', 'HEADING2', 'HEADING1', 'HEADING2')

('value 1', 'value 2', 'value 11', 'value 12')

Making a series

s = pd.Series(vals, cols)
s

HEADING1     value 1
HEADING2     value 2
HEADING1    value 11
HEADING2    value 12
dtype: object

Enumerating the index values

s.index = [s.groupby(level=0).cumcount(), s.index]
s

0  HEADING1     value 1
   HEADING2     value 2
1  HEADING1    value 11
   HEADING2    value 12
dtype: object

unstack

s.unstack()

   HEADING1  HEADING2
0   value 1   value 2
1  value 11  value 12

Full Demo

import pandas as pd
from io import StringIO

txt = """HEADING1
value 1

HEADING2
value 2

HEADING1
value 11

HEADING2
value 12"""

cols, vals = zip(*[line.split('\n') for line in StringIO(txt).read().split('\n\n')])

s = pd.Series(vals, cols)
s.index = [s.groupby(level=0).cumcount(), s.index]

s.unstack()

   HEADING1  HEADING2
0   value 1   value 2
1  value 11  value 12

piRSquared · Accepted Answer · 2017-05-31 19:17:25Z

1

Using defaultdict

from collections import defaultdict
from io import StringIO
import pandas as pd

txt = """HEADING1
value 1

HEADING2
value 2

HEADING1
value 11

HEADING2
value 12"""

d = defaultdict(list)
[
    d[k].append(v)
    for k, v in [line.split('\n')
    for line in StringIO(txt).read().split('\n\n')]
];
pd.DataFrame(d)

   HEADING1  HEADING2
0   value 1   value 2
1  value 11  value 12

answered May 31, 2017 at 19:17

piRSquared

296k68 gold badges509 silver badges654 bronze badges

1 Comment

Chad Juliano Over a year ago

Incredibly simple solution!

Collectives™ on Stack Overflow

Parse text file python and covert to pandas dataframe

2 Answers 2

1 Comment

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related