3

I wrote a nice script that parses into a pandas dataframe some monolitic output for easy plotting and manipulation.

I can read different sections of the lines and turn them into a bigger dataframe almost always

I can combine the following two:

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00
...
...

and

  E*/MeV   224Pu     225Pu     226Pu     227Pu     228Pu     229Pu     230Pu  
    48.0 0.192E-11 0.106E-05 0.498E-05 0.354E-06 0.000E+00 0.000E+00 0.000E+00

...
...

by making them a dataframe with some crafty one liner pd.read_csv(filename, encoding='latin1', sep='\s+', skip_blank_lines=True, skiprows=lambda x: x not in rows_to_keep) once they are dataframes I can concatenate them.

HOWEVER, my problem arises when the headers suddenly also have one nasty space like bellow (notice the differce between 224Pu and 222 U). This last sections completely screws up my read_cvs.

  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08
...
...

Is there any way where I can apply different "sep" settings for header and data?

Thank you in advance.

I tried different versions of sep, including sep=\s{2,} and it reads correctly the headers but now obviously it misses the data.

I do not see how adding more separators would help me as they are both spaces, or maybe I don't know enough about regexp.

3
  • Can you provide the data? Or an MRE - stackoverflow.com/help/minimal-reproducible-example Commented Jan 17, 2024 at 17:35
  • Looks like a fixed width file, use read_fwf Commented Jan 17, 2024 at 17:54
  • It is a fixed width size produced by (drum roll) fortran. I didn't know about this. I already found the answer by treating the header one way and the data in another but this is certainly worth looking into. Thanks. Commented Jan 18, 2024 at 10:55

3 Answers 3

1

You can handle headers and data separately:

import re

filename = 'data.csv'
with open(filename) as fp:
    while True:
        row = fp.readline().strip()
        if row:
            headers = re.split(r'\s{2,}', row)
            break
    df = pd.read_csv(fp, sep='\s+', header=None, names=headers, ...)

First case:

>>> df
   E*/MeV         224Pu     225Pu     226Pu         227Pu  228Pu  229Pu  230Pu
0    48.0  1.920000e-12  0.000001  0.000005  3.540000e-07    0.0    0.0    0.0

Second case:

>>> df
   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09
Sign up to request clarification or add additional context in comments.

Comments

1

Here's my try:

from io import StringIO

import pandas as pd

to_load = """  E*/MeV   222 U     223 U     224 U     225 U     226 U     227 U     228 U  
    48.0 0.722E-06 0.102E-03 0.306E-04 0.188E-06 0.671E-06 0.976E-06 0.977E-08"""

df = pd.read_csv(StringIO(to_load), sep="\s+", skiprows=1, header=None)
columns = pd.Index(pd.read_csv(StringIO(to_load), sep="\s{2,}", nrows=1, header=None).loc[0,:])
df.columns = columns
print(df)

prints:

0  E*/MeV         222 U     223 U  ...         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  ...  6.710000e-07  9.760000e-07  9.770000e-09

I thanks to skirows and nrows argument, I can read only the header column with one seperator and before the content with another.

Comments

1

In his particular case, you could use spaces that are followed by a digit as separator:

df = pd.read_csv('tricky.csv', sep=r'\s+(?=\d)', engine='python')

Output:

   E*/MeV         222 U     223 U     224 U         225 U         226 U         227 U         228 U
0    48.0  7.220000e-07  0.000102  0.000031  1.880000e-07  6.710000e-07  9.760000e-07  9.770000e-09

2 Comments

This is probably also a valid answer but I think treating the headers separately gives the flexibility to face new unforseen formats without being a regexp guru.
@donpicoro of course, I just provided this as a handy workaround ;)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.