Read, Slice and Re-structure data file block-by-block in Python

Question

A text file generated by a Fortran program contains "blocks" data, that need to be reformatted (Python script).

Each "block" of data in this file corresponds to the "Time:" specified in the beginning of the block. All "blocks" have a fixed size and structure.

I need to extract the data from "Head" and "Moisture" columns corresponding to different "Depths" (0, -1, and -2) for each "Time:".

Note: The header at the beginning is not part of the repeating "blocks" of data.

Sample input file:

 ******* Program Simulation
 ******* 
 This is initial header information for Simulation                              
 Date:   1. 6.    Time:  15: 3:39
 Units: L = cm   , T = min  , M = mmol 

 Time:        0.0000

 Node      Depth      Head Moisture       K         
           [L]        [L]    [-]        [L/T]      

   1     0.0000     -37.743 0.0630   0.5090E-05  
   2    -1.0000     -36.123 0.0750   0.5090E-05  
   3    -2.0000     -33.002 0.0830   0.5090E-05  
end

 Time:      360.0000

 Node      Depth      Head Moisture       K         
           [L]        [L]    [-]        [L/T]     

   1     0.0000 -0.1000E+07 0.0450   0.1941E-32  
   2    -1.0000    -253.971 0.0457   0.4376E-10  
   3    -2.0000     -64.510 0.0525   0.2264E-06  
end

 Time:      720.0000

 Node      Depth      Head Moisture       K         
           [L]        [L]    [-]        [L/T]     

   1     0.0000 -0.1000E+07 0.0550   0.1941E-32  
   2    -1.0000    -282.591 0.0456   0.2613E-10 
   3    -2.0000     -71.829 0.0513   0.1229E-06  
end

Desired output:

Time        Head(Depth=0)   Head(Depth=-1)  Head(Depth=-2)  Moisture(Depth=0)   Moisture(Depth=-1)  Moisture(Depth=-2)
0.0000      -37.743         -36.123         -33.002         0.0630              0.0750              0.0830
360.0000    -0.1000E+07     -253.971        -64.510         0.0450              0.0457              0.0525
720.0000    -0.1000E+07     -282.591        -71.829         0.0550              0.0456              0.0513

How I read the input file block-by-block from each "Time:" to "end" keywords and reformat to the desired output?

Is this output (reformatted plain text) what you actually want/need, or would a more structured text (xml, JSON) or data object (numpy.array, list of lists, dictionary) be preferred? — heltonbiker
– heltonbiker, Commented Jun 5, 2012 at 16:40
Reformatted plain text or a CSV should be fine. I am going to load it into Excel for further analysis. — akashwani
– akashwani, Commented Jun 5, 2012 at 16:44
You should pay attention to how Excel would interpret/reformat number in scientific notation, since it might not work fine sometimes. — heltonbiker
– heltonbiker, Commented Jun 5, 2012 at 16:58
@heltonbiker thanks for the tip. In case excel can't import it properly, I'll reformat scientific notation using Python before writing the output file. — akashwani
– akashwani, Commented Jun 5, 2012 at 20:31
@akashwani I don't think we need complex structure when doing ETL style work, just keep the structure as simple as possible and then load them( maybe use C ) — zinking
– zinking, Commented Jun 6, 2012 at 2:05

Hugh Bothwell · Accepted Answer · 2012-06-06 02:14:38Z

1

Edit: I have made a couple of changes so it actually runs.

from itertools import chain

def get_lines(f, n=1):
    return [f.next() for i in xrange(n)]

class BlockReader(object):
    def __init__(self, f, n=1):
        self.f = f
        self.n = n
    def __iter__(self):
        return self
    def next(self):
        return [self.f.next() for i in xrange(self.n)]

fmt = "{:<12}" + "{:<16}"*6 + "\n"
cols = [
    "Time",
    "Head(Depth=0)",
    "Head(Depth=-1)",
    "Head(Depth=-2)",
    "Moisture(Depth=0)",
    "Moisture(Depth=-1)",
    "Moisture(Depth=-2)"
]

def main():
    with open("simulation.txt") as inf, open("result.txt","w") as outf:
        # throw away input header
        get_lines(inf, 5)
        # write output header
        outf.write(fmt.format(*cols))

        # read input file in ten-line chunks
        for block in BlockReader(inf, 10):
            # grab time value
            time = float(block[1].split()[1])

            # grab head and moisture columns
            data = (line.split()[2:4] for line in block[6:9])
            values = (map(float,dat) for dat in data)
            h,m = zip(*values)

            # write data to output file
            outf.write(fmt.format(*chain([time],h,m)))

if __name__=="__main__":
    main()

Output is

Time        Head(Depth=0)   Head(Depth=-1)  Head(Depth=-2)  Moisture(Depth=0)Moisture(Depth=-1)Moisture(Depth=-2)
0.0         -37.743         -36.123         -33.002         0.063           0.075           0.083           
360.0       -1000000.0      -253.971        -64.51          0.045           0.0457          0.0525          
720.0       -1000000.0      -282.591        -71.829         0.055           0.0456          0.0513

edited Jun 6, 2012 at 2:14

answered Jun 5, 2012 at 16:36

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

akashwani Over a year ago

I am trying to debug your code. The line time = float(block[1].split()[1]) gives this error IndexError: string index out of range. Any idea? @hugh-bothwell

Hugh Bothwell Over a year ago

@akashwani: yes, my error: I was thinking in terms of a file iterator returning ten-line lists, when what I actually wrote returned a ten-line list and then operated on each line. I have made a quick fix, and am now looking at a better (more Pythonic) version.

Hugh Bothwell Over a year ago

@akashwani: I have created a BlockReader class which operates as I had originally intended, returning ten-line chunks of the input file. Hope this helps!

akashwani Over a year ago

Thank you, I am using your code. It's definitely more Pythonic and well structured. Thanks again!

georg · Accepted Answer · 2012-06-05 16:33:08Z

Here's the parsing part:

import re

data = []

with open(xxxx) as f:
    for line in f:
        m = re.match(r'^\s+Time:\s+([\d.]+)', line)
        if m:
            data.append([float(m.group(1))])
        elif re.match(r'^\s+\d+', line):
            data[-1].append(map(float, line.strip().split()))

produces:

[[0.0,
  [1.0, 0.0, -37.743, 0.063, 5.09e-06],
  [2.0, -1.0, -36.123, 0.075, 5.09e-06],
  [3.0, -2.0, -33.002, 0.083, 5.09e-06]],
 [360.0,
  [1.0, 0.0, -1000000.0, 0.045, 1.941e-33],
  [2.0, -1.0, -253.971, 0.0457, 4.376e-11],
  [3.0, -2.0, -64.51, 0.0525, 2.264e-07]],
 [720.0,
  [1.0, 0.0, -1000000.0, 0.055, 1.941e-33],
  [2.0, -1.0, -282.591, 0.0456, 2.613e-11],
  [3.0, -2.0, -71.829, 0.0513, 1.229e-07]]]

it should be easy to print the desired table from this.

Justin Blank · Accepted Answer · 2012-06-05 16:04:54Z

0

If the file isn't too large, you can do:

f = open('somefile')
file = f.read()
blocks = file.split('Time:')[1:]

answered Jun 5, 2012 at 16:04

Justin Blank

1,8982 gold badges17 silver badges32 bronze badges

1 Comment

akashwani Over a year ago

The input file is large in size. How do I restructure each block into the desired output format.

Collectives™ on Stack Overflow

Read, Slice and Re-structure data file block-by-block in Python

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

4 Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related