0

I have a data file that is essentially a fixed-width text file. There are a variable number of spaces and positioning in the text. I'm trying to parse the file into a list with python and can't figure out the appropriate regex (of course am open to non-regex options too).

Date              Run    By           Level 1        Level 2        Level 3        Level 4        Level 5        Level 6        Level 7        Level 8        Level 9


11-15-2014 12:27:43 AM 1    **                          259.0                                                                                                         
11-15-2014 7:47:09 AM 1    **            98.0                                                                                                                        
11-15-2014 3:45:07 PM 1    **                                         785.0                                                                                          
11-16-2014 12:27:43 AM 1    **                          245.0                                                                                                         
11-16-2014 7:51:36 AM 1    **            96.0                                                                                                                        
11-16-2014 3:43:12 PM 1    **                                         788.0                                                                                          
11-17-2014 12:27:43 AM 1    **                          248.0                                                                                                         
11-17-2014 7:51:21 AM 1    **           104.0                                                                                                                        
11-17-2014 12:45:57 PM 1    **            97.0          257.0          793.0                                                                                          
11-17-2014 3:46:33 PM 1    **                                         792.0                                                                                          
11-18-2014 12:32:31 AM 1    **                          253.0                                                                                                         
11-18-2014 7:50:31 AM 1    **           104.0                                                                                                                        
11-18-2014 3:48:43 PM 1    **                                         781.0                                                                                          
11-19-2014 12:30:36 AM 1    **                          260.0                                                                                                         
11-19-2014 8:40:26 AM 1    **           102.0                                                                                                                        
11-19-2014 3:47:45 PM 1    **                                         803.0                                                                                          
11-20-2014 12:28:40 AM 1    **                          243.0                                                                                                         
11-20-2014 7:53:38 AM 1    **           107.0                                                                                                                        
11-20-2014 3:43:55 PM 1    **                                         787.0                                                                                          
11-21-2014 1:03:45 AM 0    PS                          245.0                                                                                                         
11-21-2014 7:52:55 AM 1    **           101.0                                                                                                                        
11-21-2014 3:44:09 PM 1    **                                         789.0                                                                                          
11-22-2014 12:37:26 AM 1    **                          250.0                                                                                                         
11-22-2014 7:49:55 AM 1    **           103.0                                                                                                                        

Thus far I've tried:

for line in f:
    line = re.split(r' (?=[A-Z])| (?=[0-9])|            ',line)

However, I don't get even alignment of the columns. I need them to line up for use downstream.

The desired output is (sorry for the limited number of rows, parsing it manually is deadly!).

['Date', '', 'Run', 'By', 'Level 1', 'Level 2', 'Level 3', 'Level 4', 'Level 5', 'Level 6', 'Level 7', 'Level 8', 'Level 9','\r\n']
['\r\n']
['\r\n']
['11-15-2014', '12:27:43', 'AM 1', '**', '', '259.0', '', '', '', '', '', '', '', '\r\n']
['11-15-2014',  '7:47:09', 'AM 1', '**', '98.0', '', '', '', '', '', '', '', '', '\r\n']
['11-15-2014',  '3:45:07', 'PM 1', '**', '', '', '785.0', '', '', '', '', '', '', '\r\n']
...
...
['11-17-2014', '12:45:57', 'PM 1',  '**', '97.0', '257.0', '793.0', '', '', '', '', '', '', '\r\n']

In essence 13 items followed by a line break; combining date and time into a single field would be fine, mostly I need the dates and three levels to line up properly; there are only values for Level 1, Level 2, and Level 3. Values are usually a single level/row, but occasionally there are all three (as shown).

6
  • what is d expected output Commented Jan 15, 2015 at 16:41
  • 1
    I strongly recommend changing that file's layout, if possible. It's a mess, and splitting/parsing it will inevitably be hacky. Commented Jan 15, 2015 at 16:55
  • whats the logic behind your splitting ? show a desire output Commented Jan 15, 2015 at 16:58
  • Are there any tabs in the file? It looks like it's tab separated. Commented Jan 15, 2015 at 17:09
  • OP confirms these are spaces. The right answer is change file format to some standard format. But I'm sure some hacks are coming our way :) Commented Jan 15, 2015 at 17:12

4 Answers 4

1

Seems like the only part that has variable width is the date/time. I would suggest this:

m = re.match(r'(\d+-\d+-\d+ \d+:\d+:\d+) (.{4})(.{6})(.{16})(.{15})(.{15})', line)
if m:
    print [x.strip() for x in m.groups()]

Output:

['11-15-2014 12:27:43', 'AM 1', '**', '', '259.0', '']
['11-15-2014 7:47:09', 'AM 1', '**', '98.0', '', '']
['11-15-2014 3:45:07', 'PM 1', '**', '', '', '785.0']
['11-16-2014 12:27:43', 'AM 1', '**', '', '245.0', '']
['11-16-2014 7:51:36', 'AM 1', '**', '96.0', '', '']
['11-16-2014 3:43:12', 'PM 1', '**', '', '', '788.0']
['11-17-2014 12:27:43', 'AM 1', '**', '', '248.0', '']
['11-17-2014 7:51:21', 'AM 1', '**', '104.0', '', '']
['11-17-2014 12:45:57', 'PM 1', '**', '97.0', '257.0', '793.0']
['11-17-2014 3:46:33', 'PM 1', '**', '', '', '792.0']
['11-18-2014 12:32:31', 'AM 1', '**', '', '253.0', '']
['11-18-2014 7:50:31', 'AM 1', '**', '104.0', '', '']
['11-18-2014 3:48:43', 'PM 1', '**', '', '', '781.0']
['11-19-2014 12:30:36', 'AM 1', '**', '', '260.0', '']
['11-19-2014 8:40:26', 'AM 1', '**', '102.0', '', '']
['11-19-2014 3:47:45', 'PM 1', '**', '', '', '803.0']
['11-20-2014 12:28:40', 'AM 1', '**', '', '243.0', '']
['11-20-2014 7:53:38', 'AM 1', '**', '107.0', '', '']
['11-20-2014 3:43:55', 'PM 1', '**', '', '', '787.0']
['11-21-2014 1:03:45', 'AM 0', 'PS', '', '245.0', '']
['11-21-2014 7:52:55', 'AM 1', '**', '101.0', '', '']
['11-21-2014 3:44:09', 'PM 1', '**', '', '', '789.0']
['11-22-2014 12:37:26', 'AM 1', '**', '', '250.0', '']
['11-22-2014 7:49:55', 'AM 1', '**', '103.0', '', '']

(While it would be more typical to group the AM/PM with the time, I'm trying to follow the description of desired output)

Sign up to request clarification or add additional context in comments.

1 Comment

Ended up using this one. Works perfectly with everything I've thrown at it so far. Thanks!
1

I cannot say how reliable this is in a production environment, but it works on the example data.

Given:

txt='''\
Date              Run    By           Level 1        Level 2        Level 3        Level 4        Level 5        Level 6        Level 7        Level 8        Level 9


11-15-2014 12:27:43 AM 1    **                          259.0                                                                                                         
11-15-2014 7:47:09 AM 1    **            98.0                                                                                                                        
11-15-2014 3:45:07 PM 1    **                                         785.0                                                                                          
11-16-2014 12:27:43 AM 1    **                          245.0                                                                                                         
11-16-2014 7:51:36 AM 1    **            96.0                                                                                                                        
11-16-2014 3:43:12 PM 1    **                                         788.0                                                                                          
11-17-2014 12:27:43 AM 1    **                          248.0                                                                                                         
11-17-2014 7:51:21 AM 1    **           104.0                                                                                                                        
11-17-2014 12:45:57 PM 1    **            97.0          257.0          793.0                                                                                          
11-17-2014 3:46:33 PM 1    **                                         792.0                                                                                          
11-18-2014 12:32:31 AM 1    **                          253.0                                                                                                         
11-18-2014 7:50:31 AM 1    **           104.0                                                                                                                        
11-18-2014 3:48:43 PM 1    **                                         781.0                                                                                          
11-19-2014 12:30:36 AM 1    **                          260.0                                                                                                         
11-19-2014 8:40:26 AM 1    **           102.0                                                                                                                        
11-19-2014 3:47:45 PM 1    **                                         803.0                                                                                          
11-20-2014 12:28:40 AM 1    **                          243.0                                                                                                         
11-20-2014 7:53:38 AM 1    **           107.0                                                                                                                        
11-20-2014 3:43:55 PM 1    **                                         787.0                                                                                          
11-21-2014 1:03:45 AM 0    PS                          245.0                                                                                                         
11-21-2014 7:52:55 AM 1    **           101.0                                                                                                                        
11-21-2014 3:44:09 PM 1    **                                         789.0                                                                                          
11-22-2014 12:37:26 AM 1    **                          250.0                                                                                                         
11-22-2014 7:49:55 AM 1    **           103.0                     '''

Try:

import re

data=txt.splitlines()
header=data.pop(0)

for line in data:

    m=re.search(r'^([\d\-\s:]+)(AM|PM)\s+(\d)\s+(..)([\s\d\.]+)$', line)
    if m:
        l=[]
        l.append(m.group(1)+m.group(2))
        l.append(m.group(3))
        l.append(m.group(4))
        l.append([e.strip() for e in re.findall(r'(\s{15,16}|\s*\d+\.\d)', m.group(5))])
        print l

Prints:

['11-15-2014 12:27:43 AM', '1', '**', ['', '259.0', '', '', '', '', '', '']]
['11-15-2014 7:47:09 AM', '1', '**', ['98.0', '', '', '', '', '', '', '']]
['11-15-2014 3:45:07 PM', '1', '**', ['', '', '785.0', '', '', '', '', '']]
['11-16-2014 12:27:43 AM', '1', '**', ['', '245.0', '', '', '', '', '', '']]
['11-16-2014 7:51:36 AM', '1', '**', ['96.0', '', '', '', '', '', '', '']]
['11-16-2014 3:43:12 PM', '1', '**', ['', '', '788.0', '', '', '', '', '']]
['11-17-2014 12:27:43 AM', '1', '**', ['', '248.0', '', '', '', '', '', '']]
['11-17-2014 7:51:21 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-17-2014 12:45:57 PM', '1', '**', ['97.0', '257.0', '793.0', '', '', '', '', '']]
['11-17-2014 3:46:33 PM', '1', '**', ['', '', '792.0', '', '', '', '', '']]
['11-18-2014 12:32:31 AM', '1', '**', ['', '253.0', '', '', '', '', '', '']]
['11-18-2014 7:50:31 AM', '1', '**', ['104.0', '', '', '', '', '', '', '']]
['11-18-2014 3:48:43 PM', '1', '**', ['', '', '781.0', '', '', '', '', '']]
['11-19-2014 12:30:36 AM', '1', '**', ['', '260.0', '', '', '', '', '', '']]
['11-19-2014 8:40:26 AM', '1', '**', ['102.0', '', '', '', '', '', '', '']]
['11-19-2014 3:47:45 PM', '1', '**', ['', '', '803.0', '', '', '', '', '']]
['11-20-2014 12:28:40 AM', '1', '**', ['', '243.0', '', '', '', '', '', '']]
['11-20-2014 7:53:38 AM', '1', '**', ['107.0', '', '', '', '', '', '', '']]
['11-20-2014 3:43:55 PM', '1', '**', ['', '', '787.0', '', '', '', '', '']]
['11-21-2014 1:03:45 AM', '0', 'PS', ['', '245.0', '', '', '', '', '', '']]
['11-21-2014 7:52:55 AM', '1', '**', ['101.0', '', '', '', '', '', '', '']]
['11-21-2014 3:44:09 PM', '1', '**', ['', '', '789.0', '', '', '', '', '']]
['11-22-2014 12:37:26 AM', '1', '**', ['', '250.0', '', '', '', '', '', '']]
['11-22-2014 7:49:55 AM', '1', '**', ['103.0', '']]

Comments

0

This looks like a tsv format, or tab separated values. Try splitting lines on tabs:

for line in f:
    print line.split('\t')

If this is the case, you can use the csv module, setting tab as the separator.

Edit:

OP confirms this isn't a tsv. This is my suggestion:

headers = None
for line in input_file:
    splits = line.split('    ')
    if headers:
        print zip(headers, [v.strip() for v in splitter.split(line)])
        continue
    headers = splits

5 Comments

Sadly it is not delimited with tabs, those are spaces as shown.
@Minnow removed regex. Using four spaces as split criteria seems to do the trick.
Appreciate your efforts, but NameError: name 'headers' is not defined.
@Minnow Fixed. simply add headers = None, it's just to keep the headers. Use csv module to do this if it's possible to set the separator to four spaces.
The levels are aligned properly. e.g. top row of data, level 2=259.
0

I wouldn't use re.split(), but instead re.findall() with something like (\d{2}-\d{2}-\d{4})\s+(\d{,2}:\d{2}:\d{2})\s(\wM \d)\s+\*\*\s{10,15}([0-9.]*)\s{10,15}([0-9.]*)\s{10,15}([0-9.]*).

I recognize this is dirty, but since this doesn't seem to be a fixed length of spaces as delimiters, this might do the trick. Will cease to work if the numbers get large.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.