Python pandas different separators for header and data

Question

I wrote a nice script that parses into a pandas dataframe some monolitic output for easy plotting and manipulation.

I can read different sections of the lines and turn them into a bigger dataframe almost always

I can combine the following two:

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00
...
...

and

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00

...
...

by making them a dataframe with some crafty one liner pd.read_csv(filename, encoding='latin1', sep='\s+', skip_blank_lines=True, skiprows=lambda x: x not in rows_to_keep) once they are dataframes I can concatenate them.

HOWEVER, my problem arises when the headers suddenly also have one nasty space like bellow (notice the differce between 224Pu and 222 U). This last sections completely screws up my read_cvs.

  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08
...
...

Is there any way where I can apply different "sep" settings for header and data?

Thank you in advance.

I tried different versions of sep, including sep=\s{2,} and it reads correctly the headers but now obviously it misses the data.

I do not see how adding more separators would help me as they are both spaces, or maybe I don't know enough about regexp.

Can you provide the data? Or an MRE - stackoverflow.com/help/minimal-reproducible-example — Cam
– Cam, Commented Jan 17, 2024 at 17:35
It is a fixed width size produced by (drum roll) fortran. I didn't know about this. I already found the answer by treating the header one way and the data in another but this is certainly worth looking into. Thanks. — donpicoro
– donpicoro, Commented Jan 18, 2024 at 10:55

Corralien · Accepted Answer · 2024-01-17 17:47:39Z

1

You can handle headers and data separately:

import re

filename = 'data.csv'
with open(filename) as fp:
    while True:
        row = fp.readline().strip()
        if row:
            headers = re.split(r'\s{2,}', row)
            break
    df = pd.read_csv(fp, sep='\s+', header=None, names=headers, ...)

First case:

>>> df
   E*/MeV         224Pu     225Pu     226Pu         227Pu  228Pu  229Pu  230Pu
0    48.0  1.920000e-12  0.000001  0.000005  3.540000e-07    0.0    0.0    0.0

Second case:

>>> df
   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09

answered Jan 17, 2024 at 17:47

Corralien

121k8 gold badges44 silver badges69 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ndclt · Accepted Answer · 2024-01-17 17:43:42Z

Here's my try:

from io import StringIO

import pandas as pd

to_load = """  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08"""

df = pd.read_csv(StringIO(to_load), sep="\s+", skiprows=1, header=None)
columns = pd.Index(pd.read_csv(StringIO(to_load), sep="\s{2,}", nrows=1, header=None).loc[0,:])
df.columns = columns
print(df)

prints:

0  E*/MeV         222 U     223 U  ...         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  ...  6.710000e-07  9.760000e-07  9.770000e-09

I thanks to skirows and nrows argument, I can read only the header column with one seperator and before the content with another.

mozway · Accepted Answer · 2024-01-17 17:59:58Z

1

In his particular case, you could use spaces that are followed by a digit as separator:

df = pd.read_csv('tricky.csv', sep=r'\s+(?=\d)', engine='python')

Output:

   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09

answered Jan 17, 2024 at 17:59

mozway

267k13 gold badges56 silver badges106 bronze badges

2 Comments

donpicoro Over a year ago

This is probably also a valid answer but I think treating the headers separately gives the flexibility to face new unforseen formats without being a regexp guru.

mozway Over a year ago

@donpicoro of course, I just provided this as a handy workaround ;)

Collectives™ on Stack Overflow

Python pandas different separators for header and data

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related