0

I have been struggling to convert a text file to a pandas Dataframe, so I can subsequently do calculations on the values and plot the coordinates.

The text file has the following format with a long header and then many rows. I put an example of part of the header and one row below. I wrote a small script to get the start and final line in the text file table part that I'm interested in.

starfile_name:

# version 30001

data_particles

loop_ 
_rlnTomoParticleName #1 
_rlnTomoName #2 
_rlnNormCorrection #21 
_rlnLogLikeliContribution #22 
_rlnMaxValueProbDistribution #23 
_rlnNrOfSignificantSamples #24 
  TS_002/1     TS_002            1            2            1  1733.000000  3485.000000   938.000000     -1.08872     -1.08872     0.411277   131.760000    89.920000    97.200000 PseudoSubtomo/job052/Subtomograms/TS_002/1_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/1_weights.mrc            1    92.905599    28.438417    57.199867     1.000000 1.128367e+06     0.017733          224 
  TS_002/2     TS_002            1            1            1  1124.000000   693.000000  1096.000000     0.411277     -1.08872     -1.08872    79.270000    86.780000   100.730000 PseudoSubtomo/job052/Subtomograms/TS_002/2_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/2_weights.mrc            1   159.849821     4.120413   101.904501     1.000000 1.126854e+06     0.183934           37 
  TS_002/3     TS_002            1            2            1  1694.000000  2329.000000  1378.000000     5.955277     -6.63272     -1.08872   -140.62000    88.860000    99.000000 PseudoSubtomo/job052/Subtomograms/TS_002/3_data.mrc PseudoSubtomo/job052/Subtomograms/TS_002/3_weights.mrc            1   127.794678     4.085294   168.730698     1.000000 1.124178e+06     0.184649           18 

I used the following lines to turn this into a DataFrame

#skip is the line number where the header and irrelevant part of the table ends 
#foot is the number of rows at the end of the table that I'm not interested in
pandas_table = pd.read_csv(starfile_name, engine='python', index_col=False, header=None,skiprows=int(skip), skipfooter=int(foot), sep="\t")
print(pandas_table)
df = pd.DataFrame(data=pandas_table)
df

It appears that the whole table is read as if it is just one column. I tried providing column tags, but they don't line up with the actual data. I also played around with the str.split() and squeeze() options, but I keep getting errors.

output:

                                                      0
0     TS_002/1     TS_002            1            2 ...
1     TS_002/2     TS_002            1            1 ...
2     TS_002/3     TS_002            1            2 ...
3     TS_002/4     TS_002            1            1 ...
4     TS_002/5     TS_002            1            2 ...
...                                                 ...
1423  TS_002/1424     TS_002            1           ...
1424  TS_002/1425     TS_002            1           ...
1425  TS_002/1426     TS_002            1           ...
1426  TS_002/1427     TS_002            1           ...
1427  TS_002/1428     TS_002            1           ...

[1428 rows x 1 columns]

    0
0   TS_002/1 TS_002 1 2 ...
1   TS_002/2 TS_002 1 1 ...
2   TS_002/3 TS_002 1 2 ...
3   TS_002/4 TS_002 1 1 ...
4   TS_002/5 TS_002 1 2 ...
...     ...
1423    TS_002/1424 TS_002 1 ...
1424    TS_002/1425 TS_002 1 ...
1425    TS_002/1426 TS_002 1 ...
1426    TS_002/1427 TS_002 1 ...
1427    TS_002/1428 TS_002 1 ...

1428 rows × 1 columns
1
  • 1
    can you provide more lines from the input file, for us to understand the file structure? Commented Jan 17, 2021 at 20:40

1 Answer 1

1

I think this would help you split columns by variable lenght spaces: use sep='\s+'

df = pd.read_csv(starfile_name,  ...., sep='\s+')
print(df)
>>>
         0       1   2   3   4       5       6       7         8        9   \
0  TS_002/1  TS_002   1   2   1  1733.0  3485.0   938.0 -1.088720 -1.08872   
1  TS_002/2  TS_002   1   1   1  1124.0   693.0  1096.0  0.411277 -1.08872   
2  TS_002/3  TS_002   1   2   1  1694.0  2329.0  1378.0  5.955277 -6.63272   

   ...                                                 14  \
0  ...  PseudoSubtomo/job052/Subtomograms/TS_002/1_dat...   
1  ...  PseudoSubtomo/job052/Subtomograms/TS_002/2_dat...   
2  ...  PseudoSubtomo/job052/Subtomograms/TS_002/3_dat...   

                                                  15  16          17  \
0  PseudoSubtomo/job052/Subtomograms/TS_002/1_wei...   1   92.905599   
1  PseudoSubtomo/job052/Subtomograms/TS_002/2_wei...   1  159.849821   
2  PseudoSubtomo/job052/Subtomograms/TS_002/3_wei...   1  127.794678   

          18          19   20         21        22   23  
0  28.438417   57.199867  1.0  1128367.0  0.017733  224  
1   4.120413  101.904501  1.0  1126854.0  0.183934   37  
2   4.085294  168.730698  1.0  1124178.0  0.184649   18  

[3 rows x 24 columns]
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! that did it!!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.