Convert text file with header to pandas dataframe

Question

I have been struggling to convert a text file to a pandas Dataframe, so I can subsequently do calculations on the values and plot the coordinates.

The text file has the following format with a long header and then many rows. I put an example of part of the header and one row below. I wrote a small script to get the start and final line in the text file table part that I'm interested in.

starfile_name:

# version 30001

data_particles

loop_ 
_rlnTomoParticleName #1 
_rlnTomoName #2 
_rlnNormCorrection #21 
_rlnLogLikeliContribution #22 
_rlnMaxValueProbDistribution #23 
_rlnNrOfSignificantSamples #24 
  TS_002/1     TS_002            1            2            1  1733.000000  3485.000000   938.000000     -1.08872     -1.08872     0.411277   131.760000    89.920000    97.200000 PseudoSubtomo/job052/Subtomograms/TS_002/1_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/1_weights.mrc            1    92.905599    28.438417    57.199867     1.000000 1.128367e+06     0.017733          224 
  TS_002/2     TS_002            1            1            1  1124.000000   693.000000  1096.000000     0.411277     -1.08872     -1.08872    79.270000    86.780000   100.730000 PseudoSubtomo/job052/Subtomograms/TS_002/2_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/2_weights.mrc            1   159.849821     4.120413   101.904501     1.000000 1.126854e+06     0.183934           37 
  TS_002/3     TS_002            1            2            1  1694.000000  2329.000000  1378.000000     5.955277     -6.63272     -1.08872   -140.62000    88.860000    99.000000 PseudoSubtomo/job052/Subtomograms/TS_002/3_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/3_weights.mrc            1   127.794678     4.085294   168.730698     1.000000 1.124178e+06     0.184649           18

I used the following lines to turn this into a DataFrame

#skip is the line number where the header and irrelevant part of the table ends 
#foot is the number of rows at the end of the table that I'm not interested in
pandas_table = pd.read_csv(starfile_name, engine='python', index_col=False, header=None,skiprows=int(skip), skipfooter=int(foot), sep="\t")
print(pandas_table)
df = pd.DataFrame(data=pandas_table)
df

It appears that the whole table is read as if it is just one column. I tried providing column tags, but they don't line up with the actual data. I also played around with the str.split() and squeeze() options, but I keep getting errors.

output:

                                                      0
0     TS_002/1     TS_002            1            2 ...
1     TS_002/2     TS_002            1            1 ...
2     TS_002/3     TS_002            1            2 ...
3     TS_002/4     TS_002            1            1 ...
4     TS_002/5     TS_002            1            2 ...
...                                                 ...
1423  TS_002/1424     TS_002            1           ...
1424  TS_002/1425     TS_002            1           ...
1425  TS_002/1426     TS_002            1           ...
1426  TS_002/1427     TS_002            1           ...
1427  TS_002/1428     TS_002            1           ...

[1428 rows x 1 columns]

    0
0   TS_002/1 TS_002 1 2 ...
1   TS_002/2 TS_002 1 1 ...
2   TS_002/3 TS_002 1 2 ...
3   TS_002/4 TS_002 1 1 ...
4   TS_002/5 TS_002 1 2 ...
...     ...
1423    TS_002/1424 TS_002 1 ...
1424    TS_002/1425 TS_002 1 ...
1425    TS_002/1426 TS_002 1 ...
1426    TS_002/1427 TS_002 1 ...
1427    TS_002/1428 TS_002 1 ...

1428 rows × 1 columns

can you provide more lines from the input file, for us to understand the file structure? — armamut
– armamut, Commented Jan 17, 2021 at 20:40

armamut · Accepted Answer · 2021-01-17 21:54:37Z

1

I think this would help you split columns by variable lenght spaces: use sep='\s+'

df = pd.read_csv(starfile_name,  ...., sep='\s+')
print(df)
>>>
         0       1   2   3   4       5       6       7         8        9   \
0  TS_002/1  TS_002   1   2   1  1733.0  3485.0   938.0 -1.088720 -1.08872   
1  TS_002/2  TS_002   1   1   1  1124.0   693.0  1096.0  0.411277 -1.08872   
2  TS_002/3  TS_002   1   2   1  1694.0  2329.0  1378.0  5.955277 -6.63272   

   ...                                                 14  \
0  ...  PseudoSubtomo/job052/Subtomograms/TS_002/1_dat...   
1  ...  PseudoSubtomo/job052/Subtomograms/TS_002/2_dat...   
2  ...  PseudoSubtomo/job052/Subtomograms/TS_002/3_dat...   

                                                  15  16          17  \
0  PseudoSubtomo/job052/Subtomograms/TS_002/1_wei...   1   92.905599   
1  PseudoSubtomo/job052/Subtomograms/TS_002/2_wei...   1  159.849821   
2  PseudoSubtomo/job052/Subtomograms/TS_002/3_wei...   1  127.794678   

          18          19   20         21        22   23  
0  28.438417   57.199867  1.0  1128367.0  0.017733  224  
1   4.120413  101.904501  1.0  1126854.0  0.183934   37  
2   4.085294  168.730698  1.0  1124178.0  0.184649   18  

[3 rows x 24 columns]

answered Jan 17, 2021 at 21:54

armamut

1,1166 silver badges17 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

ChrisvHoorn Over a year ago

Thank you! that did it!!

Collectives™ on Stack Overflow

Convert text file with header to pandas dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related