Parsing an irregular .dat file in Python

Question

I have a .dat file of coordinates (x,y and z), separated by a marker (an integer). Here's a snippet of it:

500
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0
0.06223    0.06222      0
0.04705    0.05386      0
0.03388    0.04528      0
0.02281    0.03663      0
0.01391    0.02808      0
42
0.00733    0.01969      0
0.00297    0.01152      0
0.01809    -0.01422     0
0.03068    -0.01687     0
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0
42
0.14166    0.09077      0
0.11918    0.08461      0
0.09838    0.07771      0
0.07937    0.07022      0

What's the best way to separate it in chunks (preferably, one array per interval between markers)?

It's just a fraction of the data, in reality there are a few thousand points.

Read it line by line, adding each line to a chunk until there is a line that contains only one number. Then start a new chunk. Represent each chunk as a list. — mkrieger1
– mkrieger1, Commented Jan 16, 2023 at 20:30
@mkrieger1 is it really the only alternative? Does read time increase with file size? — Lucas Pelizzari
– Lucas Pelizzari, Commented Jan 16, 2023 at 20:31
No, there are infinitely many alternatives. Of course the time increases with the file size. — mkrieger1
– mkrieger1, Commented Jan 16, 2023 at 20:31
@LucasPelizzarim, are those chunks uniformly consist of 9 lines? Does the input file always start with a marker line? — RomanPerekhrest
– RomanPerekhrest, Commented Jan 16, 2023 at 20:35
No, they arent always 9 lines. The file always starts with 500 and the blocks of coordinates start and end with 42, except for the last one. — Lucas Pelizzari
– Lucas Pelizzari, Commented Jan 16, 2023 at 20:46

RomanPerekhrest · Accepted Answer · 2023-01-17 07:40:42Z

I would suggest to apply the power of pandas and numpy libraries.

We start with loading the input file into dataframe with skipping the 1st row (skiprows=1) and explicitly specifying the number of columns via column names (names=['x','y','z']) meaning that marker lines will be treated as 1-column row with NaN values (like 42.00000 NaN NaN):

import pandas as pd
import numpy as np

coords = pd.read_table('test.dat', delim_whitespace=True, header=None,
                       engine='python', skiprows=1, names=['x','y','z'])

Then finding the positions of marker lines on which the coords dataframe will be splitted into chunks:

na_markers = coords.loc[coords['y'].isna()].index

Finally splitting and getting the needed numpy arrays:

coords = [chunk.dropna().to_numpy() for chunk in np.split(coords, na_markers)]

That's it, now coords contains a list of the needed coordinates "chunks":

[array([[0.14166, 0.09077, 0.     ],
       [0.11918, 0.08461, 0.     ],
       [0.09838, 0.07771, 0.     ],
       [0.07937, 0.07022, 0.     ],
       [0.06223, 0.06222, 0.     ],
       [0.04705, 0.05386, 0.     ],
       [0.03388, 0.04528, 0.     ],
       [0.02281, 0.03663, 0.     ],
       [0.01391, 0.02808, 0.     ]]), array([[ 0.00733,  0.01969,  0.     ],
       [ 0.00297,  0.01152,  0.     ],
       [ 0.01809, -0.01422,  0.     ],
       [ 0.03068, -0.01687,  0.     ],
       [ 0.14166,  0.09077,  0.     ],
       [ 0.11918,  0.08461,  0.     ],
       [ 0.09838,  0.07771,  0.     ],
       [ 0.07937,  0.07022,  0.     ]]), array([[0.14166, 0.09077, 0.     ],
       [0.11918, 0.08461, 0.     ],
       [0.09838, 0.07771, 0.     ],
       [0.07937, 0.07022, 0.     ]])]

Collectives™ on Stack Overflow

Parsing an irregular .dat file in Python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related