2

I have the following file:

        OBJ             OBJ     OBJ                                                                                           
       DATE        OBJ  CLASS  DATE        OBJ                                       OBJ                         OBJ          
CLASS  ENTERED1    TAG  REPL   ENTERED2    NAME                                      TSTMP                       USERID       
-----  ----------  ---  -----  ----------  ----------------------------------------  --------------------------  --------     
EZ4    2002-02-06  D           2002-02-06  abc random                                2002-02-06-10.05.30.242768  2342342      
000    2001-09-27  D           2001-09-27  none                                      2001-09-27-10.23.31.121606  2343242      
011    1974-07-09              0001-01-01  board COMPENSATON                         1997-01-22-09.10.23.145853  

The columns are defined with underscores and the header texts are broken up across multiple rows. I'd like to read this into a pandas dataframe such that the column texts are intact and the underscores are removed. What can I do to get the dataframe in the format I'm seeking?

3 Answers 3

2

I'd use pd.read_fwf for fixed width but specify a separator of '\s{2,}'

g = lambda x: '' if x.startswith('Unnamed') else x

pd.read_fwf(
    'file.txt',
    sep='\s{2,}',
    header=[0, 1, 2],
    skiprows=[3]
).rename(columns=g)

                OBJ        OBJ                                                                      
               DATE  OBJ CLASS        DATE                OBJ                                       
  CLASS    ENTERED1  TAG  REPL    ENTERED2               NAME                       TSTMP     USERID
0   EZ4  2002-02-06    D   NaN  2002-02-06         abc random  2002-02-06-10.05.30.242768  2342342.0
1   000  2001-09-27    D   NaN  2001-09-27               none  2001-09-27-10.23.31.121606  2343242.0
2   011  1974-07-09  NaN   NaN  0001-01-01  board COMPENSATON  1997-01-22-09.10.23.145853        NaN
Sign up to request clarification or add additional context in comments.

Comments

2

It looks like the first two rows and the third are unnecessary, so skip them by specifying a skiprows attribute.

df = pd.read_csv('file.txt', skiprows=(0, 1, 3), sep='\s{2,}', engine='python')

Comments

0

Store the text information before you get to the underscores. When you get to the underscores save the start and end positions of each "column" then go back and concatenate that text as appropriate. Consider writing as a reusable function ... you will need to do something like this again someday :-)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.