Parsing formatted text file into CSV

Question

I have a good few hundred of these job metric definitions in a single file that I'm trying to parse into a formatted .csv document

Job Name                                                         Last Start           Last End             ST Run     Pri/Xit
________________________________________________________________ ____________________ ____________________ __ _______ ___
B9043CC_APP_DMLD_025_FR_xpabbdu1_D                               03/12/2014 18:21:32  03/12/2014 18:22:07  SU 49744331/3

  Status/[Event]  Time                 Ntry ES  ProcessTime           Machine
  --------------  --------------------- --  --  --------------------- ----------------------------------------
  [FORCE_STARTJOB]  03/12/2014 17:30:52    0  PD  03/12/2014 17:30:53
    < >
  STARTING        03/12/2014 17:30:53    1  PD  03/12/2014 17:30:53   ab-shared-batch
  RUNNING         03/12/2014 17:31:06    1  PD  03/12/2014 17:31:07   ab-shared-batch
  SUCCESS         03/12/2014 17:31:46    1  PD  03/12/2014 17:31:47
  [FORCE_STARTJOB]  03/12/2014 18:16:06    0  PD  03/12/2014 18:16:07
    < >
  STARTING        03/12/2014 18:16:07    2  PD  03/12/2014 18:16:07   ab-shared-batch-
  RUNNING         03/12/2014 18:16:19    2  PD  03/12/2014 18:16:20   ab-shared-batch-
  FAILURE         03/12/2014 18:17:02    2  PD  03/12/2014 18:17:03
  [*** ALARM ***]
    JOBFAILURE    03/12/2014 18:17:03    2  PD  03/12/2014 18:17:04
  [FORCE_STARTJOB]  03/12/2014 18:21:18    0  PD  03/12/2014 18:21:19
    < >
  STARTING        03/12/2014 18:21:19    3  PD  03/12/2014 18:21:19   ab-shared-batch-
  RUNNING         03/12/2014 18:21:32    3  PD  03/12/2014 18:21:32   ab-shared-batch-
  SUCCESS         03/12/2014 18:22:07    3  PD  03/12/2014 18:22:08

I would like my output to look at this:System Number Command Job name Box Job Name

System Number  Job Name                           Target Machiene    Status     Actual Start Date     Actual Start Time      Actual End Date    Actual End Time
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    SUCCESS       03/12/2014               17:30:53            03/12/2014         17:31:47
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    FAILURE       03/12/2014               18:16:07            03/12/2014         18:17:03
9043           B9043CC_APP_DMLD_025_FR_xpabbdu1_D ab-shared-batch    SUCCESS       03/12/2014               18:21:19            03/12/2014         18:22:08

The actual start/end times & actaul start/end dates are coming from the "Process time" column.I only want the data above and don't want any of the text including the "----" to be anywhere in the .csv file. As mentioned above, I have a few hundred of those definitions in a single file.

I know python has a built in csv module which I am using to write to the label colums:

import csv
import sys

infile = '/home/n5acc7/test/output/testtest.csv'
f = open(infile, 'wt')
try:
    writer = csv.writer(f)
    writer.writerow( ('System Number', 'Job Name' 'Target Machiene', 'Status', 'Actual Start Date' 'Actual Start Date', 'Actual End Time', 'Actual End Date', 'Actual End Time',) )
finally:
    f.close()

But from the parsing persepctive, I'm not sure where to start. I'm running python 2.4.3.

The csv module can read as well as write. Have you tried using the other portion of it? — Two-Bit Alchemist
– Two-Bit Alchemist, Commented Mar 20, 2014 at 16:39

Hugh Bothwell · Accepted Answer · 2014-03-21 15:05:34Z

2

Parsing this looks pretty straight-forward;

general logic:

read six lines (header)
get system number and batch name

until end of file:
    read five lines
    get machine name, status, start and end dates and times
    if status is FAILURE
        read two lines (clear error message)

and some actual code (although targeted at Python 2.7; you'll have to do some back-porting for Python 2.4, or switch to a more up-to-date Python):

INPUT = "/home/n5acc7/test/input/batch1.log"
OUTPUT = "/home/n5acc7/test/output/testtest.csv"

LINE = "{:<6} {:34} {:18} {:10} {:10} {:10} {:10} {:10}\n"

def get_lines(n, inf):
    return [next(inf) for _ in xrange(n)]

def read_header(inf):
    head = get_n_lines(6, inf)
    job_name = head[2].split(None, 1)[0]
    system_num = job_name[1:5]
    return system_num, job_name

def read_record(inf):
    record    = get_lines(5, inf)
    startline = record[2].split()
    sd, st, name = startline[5:8]
    endline   = record[4].split()
    status    = endline[0]
    ed, et    = endline[5:7]
    # skip failure message
    if status == "FAILURE":
        get_lines(2, inf)
    return name, status, sd, st, ed, et

def parse_jobfile(fname):
    with open(fname) as inf:
        try:
            batch = read_header(inf)
            while True:
                job = read_record(inf)
                yield batch + job
        except StopIteration:
            # end of file
            pass

def main():
    with open(OUTPUT, "w") as outf:
        outf.write(LINE.format("SysNum", "Job Name", "Target Machiene", "Status", "Start Date", "Start Time", "End Date", "End Time"))
        for result in parse_jobfile(INPUT):
            outf.write(LINE.format(*result))

if __NAME__=="__MAIN__":
    main()

edited Mar 21, 2014 at 15:05

answered Mar 20, 2014 at 17:51

Hugh Bothwell

57k9 gold badges91 silver badges103 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

Matt Over a year ago

Thanks! What exactly is get_header, though? @Hugh Bothwell

Hugh Bothwell Over a year ago

@Matt: it's a mistake I made when cleaning up the function names :-/ should have been read_header, fixed now.

Matt Over a year ago

Okay, that's what I thought. Thanks! @Hugh Bothwell

Matt Over a year ago

Also, I believe there is suppose to be anothe value in "sd, st, name = startline[5:8]". @Hugh Bothwell

Hugh Bothwell Over a year ago

@Matt: um, no? startline[5:8] gives you items 5, 6, and 7, which are the startdate, starttime, and machine name. Python slice syntax is a bit like range(), the last item (8) is not included.

|

user176692 · Accepted Answer · 2014-03-20 17:16:03Z

1

How are you with regular expressions? Python supports this. Perl is excellent for file processing. CSV files can be tab or comma delimited (the format has some variance), so if you have a file handle it's an incredibly easy format to write to. The language wouldn't have to be restricted to its CSV capabilities, as long as you are comfortable with it, or it is efficient for parsing. As far as regular expressions go, here are some links for intros (if you have more specific parsing scenarios you encounter once you determine your approach, can update this to address them):

Python re

perlreref There are more Perl ones, such as:

perlre

Understand basic Regex

edited Mar 20, 2014 at 17:16

answered Mar 20, 2014 at 16:55

user176692

8401 gold badge7 silver badges23 bronze badges

Collectives™ on Stack Overflow

Parsing formatted text file into CSV

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related