Python script to search and export results to .csv file

Question

I'm trying to do the following in Python, also using some bash scripting. Unless there is an easier way in Python.

I have a log file with data that looks like the following:

16:14:59.027003 - WARN - Cancel Latency: 100ms - OrderId: 311yrsbj - On Venue: ABCD
16:14:59.027010 - WARN - Ack Latency: 25ms - OrderId: 311yrsbl - On Venue: EFGH
16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL
16:14:59.027235 - WARN - Cancel Latency: 137ms - OrderId: 311yrsbp - On Venue: MNOP
16:14:59.027256 - WARN - Cancel Latency: 220ms - OrderId: 311yrsbr - On Venue: QRST
16:14:59.027293 - WARN - Ack Latency: 142ms - OrderId: 311yrsbt - On Venue: UVWX
16:14:59.027329 - WARN - Cancel Latency: 134ms - OrderId: 311yrsbv - On Venue: YZ  
16:14:59.027359 - WARN - Ack Latency: 75ms - OrderId: 311yrsbx - On Venue: ABCD
16:14:59.027401 - WARN - Cancel Latency: 66ms - OrderId: 311yrsbz - On Venue: ABCD
16:14:59.027426 - WARN - Cancel Latency: 212ms - OrderId: 311yrsc1 - On Venue: EFGH
16:14:59.027470 - WARN - Cancel Latency: 89ms - OrderId: 311yrsf7 - On Venue: IJKL  
16:14:59.027495 - WARN - Cancel Latency: 97ms - OrderId: 311yrsay - On Venue: IJKL

I need to extract the last entry from each line and then use each unique entry and search for every line and that it appears in and export it to a .csv file.

I've used the following bash script to get each unique entry: cat LogFile_date +%Y%m%d.msg.log | awk '{print $14}' | sort | uniq

Based on the above data in the log file, the bash script would return the following results:

ABCD
EFGH
IJKL
MNOP
QRST
UVWX
YZ

Now I would like to search (or grep) for each of those results in the same log file and return the top ten results. I have another bash script to do this, however, HOW DO I DO THIS USING A FOR LOOP? So, for x, where x = each entry above,

grep x LogFile_date +%Y%m%d.msg.log | awk '{print $7}' | sort -nr | uniq | head -10

Then return the results into a .csv file. The results would look like this (each field in a separate column):

Column-A  Column-B  Column-C  Column-D  
ABCD        2sxrb6ab    Cancel    46ms  
ABCD      2sxrb6af  Cancel    45ms  
ABCD      2sxrb6i2  Cancel    63ms  
ABCD      2sxrb6i3  Cancel    103ms  
EFGH      2sxrb6i4  Cancel    60ms  
EFGH      2sxrb6i7  Cancel    60ms  
IJKL      2sxrb6ie  Ack       74ms  
IJKL      2sxrb6if  Ack       74ms  
IJKL      2sxrb76s  Cancel    46ms  
MNOP      vcxrqrs5  Cancel    7651ms

I'm a beginner in Python and haven't done much coding since college (13 years ago). Any help would be greatly appreciated. Thanks.

How does your output correspond to your input?

Jon Clements
– Jon Clements

2013-04-02 08:30:27 +00:00
Commented Apr 2, 2013 at 8:30 — Jon Clements
– Jon Clements, Commented Apr 2, 2013 at 8:30

mariano · Accepted Answer · 2013-04-02 10:32:37Z

Say you've opened your file. What you want to do is record how many times each individual entry is in there, which is to say, each entry will result in one or more timings:

from collections import defaultdict

entries = defaultdict(list)
for line in your_file:
    # Parse the line and return the 'ABCD' part and time
    column_a, timing = parse(line)
    entries[column_a].append(timing)

When you're done, you have a dictionary like so:

{ 'ABCD': ['30ms', '25ms', '12ms'],
  'EFGH': ['12ms'],
  'IJKL': ['2ms', '14ms'] }

What you'll want to do now is transform this dictionary into another data structure ordered by len of its value (which is a list). Example:

In [15]: sorted(((k, v) for k, v in entries.items()), 
                key=lambda i: len(i[1]), reverse=True)
Out[15]: 
[('ABCD', ['30ms', '25ms', '12ms']),
 ('IJKL', ['2ms', '14ms']),
 ('EFGH', ['12ms'])]

Of course this is only illustrative and you might want to collect some more data in the original for loop.

Francis Chan · Accepted Answer · 2013-04-02 09:26:20Z

0

Maybe not no concise as you might think ... But I think this can solve your problem. I add some try...catch to better address real data.

import re
import os
import csv
import collections

# get all logfiles under current directory of course this pattern can be more
# sophisticated, but it's not our attention here, isn't it?
log_pattern = re.compile(r"LogFile_date[0-9]{8}.msg.log")
logfiles = [f for f in os.listdir('./') if log_pattern.match(f)]

# top n
nhead = 10
# used to parse useful fields
extract_pattern = re.compile(
    r'.*Cancel Latency: ([0-9]+ms) - OrderId: ([0-9a-z]+) - On Venue: ([A-Z]+)')
# container for final results
res = collections.defaultdict(list)

# parse out all interesting fields
for logfile in logfiles:
    with open(logfile, 'r') as logf:
        for line in logf:
            try:  # in case of blank line or line with no such fields.
                latency, orderid, venue = extract_pattern.match(line).groups()
            except AttributeError:
                continue
            res[venue].append((orderid, latency))

# write to csv
with open('res.csv', 'w') as resf:
    resc = csv.writer(resf, delimiter=' ')
    for venue in sorted(res.iterkeys()):  # sort by Venue
        entries = res[venue]
        entries.sort()  # sort by OrderId
        for i in range(0, nhead):
            try:
                resc.writerow([venue, entries[i][0], 'Cancel ' + entries[i][1]])
            except IndexError:  # nhead can not be satisfied
                break

answered Apr 2, 2013 at 9:26

Francis Chan

11 bronze badge

3 Comments

user2234571 Over a year ago

Probably something simple but I'm getting error: with open(logfile, 'r') as logf: ^ SyntaxError: invalid syntax

user2234571 Over a year ago

Thank you Francis Chan for the help. This worked nicely. Is there a way to write each field to a separate column in the .csv file with a corresponding heading for each column? Write now it is writing all 4 fields to the same column (column A). Also, I was looking to sort alphabetically by Venue and then sort by the 4th field descending (63ms, 64ms, 63ms, 62ms ...etc)? Again your help is really appreciated.

user2234571 Over a year ago

Also, I should have used a better example of my log file. There are two different types of "Latency", however I only showed one type which is "Cancel." It will actually be "Cancel" or "Ack". How can I include the correct preceeding word before Latency? 16:14:59.027003 - WARN - Ack Latency: 22ms - OrderId: 311yrsbj - On Venue: ABCD 16:14:59.027010 - WARN - Cancel Latency: 22ms - OrderId: 311yrsbl - On Venue: EFGH 16:14:59.027201 - WARN - Ack Latency: 22ms - OrderId: 311yrsbn - On Venue: IJKL 16:14:59.027235 - WARN - Cancel Latency: 22ms - OrderId: 311yrsbp - On Venue: MNOP

Collectives™ on Stack Overflow

Python script to search and export results to .csv file

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related