Using pandas to read a text file with the headers in multiple rows

Question

I have the following file:

        OBJ             OBJ     OBJ                                                                                           
       DATE        OBJ  CLASS  DATE        OBJ                                       OBJ                         OBJ          
CLASS  ENTERED1    TAG  REPL   ENTERED2    NAME                                      TSTMP                       USERID       
-----  ----------  ---  -----  ----------  ----------------------------------------  --------------------------  --------     
EZ4    2002-02-06  D           2002-02-06  abc random                                2002-02-06-10.05.30.242768  2342342      
000    2001-09-27  D           2001-09-27  none                                      2001-09-27-10.23.31.121606  2343242      
011    1974-07-09              0001-01-01  board COMPENSATON                         1997-01-22-09.10.23.145853

The columns are defined with underscores and the header texts are broken up across multiple rows. I'd like to read this into a pandas dataframe such that the column texts are intact and the underscores are removed. What can I do to get the dataframe in the format I'm seeking?

piRSquared · Accepted Answer · 2017-09-26 22:59:20Z

2

I'd use pd.read_fwf for fixed width but specify a separator of '\s{2,}'

g = lambda x: '' if x.startswith('Unnamed') else x

pd.read_fwf(
    'file.txt',
    sep='\s{2,}',
    header=[0, 1, 2],
    skiprows=[3]
).rename(columns=g)

                OBJ        OBJ                                                                      
               DATE  OBJ CLASS        DATE                OBJ                                       
  CLASS    ENTERED1  TAG  REPL    ENTERED2               NAME                       TSTMP     USERID
0   EZ4  2002-02-06    D   NaN  2002-02-06         abc random  2002-02-06-10.05.30.242768  2342342.0
1   000  2001-09-27    D   NaN  2001-09-27               none  2001-09-27-10.23.31.121606  2343242.0
2   011  1974-07-09  NaN   NaN  0001-01-01  board COMPENSATON  1997-01-22-09.10.23.145853        NaN

answered Sep 26, 2017 at 22:59

piRSquared

296k68 gold badges509 silver badges654 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

cs95 · Accepted Answer · 2017-09-26 22:06:53Z

2

It looks like the first two rows and the third are unnecessary, so skip them by specifying a skiprows attribute.

df = pd.read_csv('file.txt', skiprows=(0, 1, 3), sep='\s{2,}', engine='python')

answered Sep 26, 2017 at 22:06

cs95

406k106 gold badges745 silver badges798 bronze badges

Comments

Ken Schumack · Accepted Answer · 2017-09-26 22:05:58Z

0

Store the text information before you get to the underscores. When you get to the underscores save the start and end positions of each "column" then go back and concatenate that text as appropriate. Consider writing as a reusable function ... you will need to do something like this again someday :-)

answered Sep 26, 2017 at 22:05

Ken Schumack

7194 silver badges11 bronze badges

Collectives™ on Stack Overflow

Using pandas to read a text file with the headers in multiple rows

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related