Analysing a text file in Python

Question

I have a text file that needs to be analysed. Each line in the file is of this form:

7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  

7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))

7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3

I need to skip the timestamp and the (slbfd) and only keep a count of the lines with the IN and OUT. Further, depending on the name in quotes, I need to increase a variable count for different variables if a line starts with OUT and decrease the variable count otherwise. How would I go about doing this in Python?

you can parse the line using a .split() command. Do you have any code attempting to parse a single line? Once you can parse one line you should be able to parse them all. After that its just a matter of checking the correct elements in each line with logic — Paul Seeb
– Paul Seeb, Commented Jun 22, 2012 at 14:22

Hooked · Accepted Answer · 2012-06-22 14:48:56Z

The other answers with regex and splitting the line will get the job done, but if you want a fully maintainable solution that will grow with you, you should build a grammar. I love pyparsing for this:

S ='''
7:06:32 (slbfd) IN: "lq_viz_server" aqeela@nabltas1  
7:08:21 (slbfd) UNSUPPORTED: "Slb_Internal_vlsodc" (PORT_AT_HOST_PLUS   ) Albahraj@nabwmps3  (License server system does not support this feature. (-18,327))
7:08:21 (slbfd) OUT: "OFM32" Albahraj@nabwmps3'''

from pyparsing import *
from collections import defaultdict

# Define the grammar
num = Word(nums)
marker = Literal(":").suppress()
timestamp = Group(num + marker + num + marker + num)
label = Literal("(slbfd)")
flag = Word(alphas)("flag") + marker
name = QuotedString(quoteChar='"')("name")

line    = timestamp + label + flag + name + restOfLine
grammar = OneOrMore(Group(line))

# Now parsing is a piece of cake!  
P = grammar.parseString(S)
counts = defaultdict(int)

for x in P:
    if x.flag=="IN": counts[x.name] += 1
    if x.flag=="OUT": counts[x.name] -= 1

for key in counts:
    print key, counts[key]

This gives as output:

lq_viz_server 1
OFM32 -1

Which would look more impressive if your sample log file was longer. The beauty of a pyparsing solution is the ability to adapt to a more complex query in the future (ex. grab and parse the timestamp, pull email address, parse error codes...). The idea is that you write the grammar independent of the query - you simply convert the raw text to a computer friendly format, abstracting away the parsing implementation away from it's usage.

Pigueiras · Accepted Answer · 2012-06-22 14:29:10Z

1

If I consider that the file is divided into lines (I don't know if it's true) you have to apply split() function to each line. You will have this:

["7:06:32", "(slbfd)", "IN:", "lq_viz_server", "aqeela@nabltas1"]

And then I think you have to be capable of apply any logic comparing the values that you need.

answered Jun 22, 2012 at 14:29

Pigueiras

19.5k10 gold badges67 silver badges87 bronze badges

Comments

Aprillion · Accepted Answer · 2012-06-22 14:31:49Z

1

i made some wild assumptions about your specification and here is a sample code to help you start:

objects = {}
with open("data.txt") as data:
    for line in data:
        if "IN:" in line or "OUT:" in line:
            try:
                name = line.split("\"")[1]
            except IndexError:
                print("No double quoted name on line: {}".format(line))
                name = "PARSING_ERRORS"
            if "OUT:" in line:
                diff = 1
            else:
                diff = -1
            try:
                objects[name] += diff
            except KeyError:
                objects[name] = diff
print(objects) # for debug only, not advisable to print huge number of names

answered Jun 22, 2012 at 14:31

Aprillion

22.4k6 gold badges59 silver badges94 bronze badges

Comments

snies · Accepted Answer · 2012-06-22 14:35:03Z

1

You have two options:

Use the .split() function of the string (as pointed out in the comments)
Use the re module for regular expressions.

I would suggest using the re module and create a pattern with named groups.

Recipe:

first create a pattern with re.compile() containing named groups
do a for loop over the file to get the lines use .match() od the
created pattern object on each line use .groupdict() of the
returned match object to access your values of interest

edited Jun 22, 2012 at 14:35

answered Jun 22, 2012 at 14:29

snies

3,5411 gold badge24 silver badges19 bronze badges

Comments

the wolf · Accepted Answer · 2012-06-22 15:34:34Z

0

In the mode of just get 'er done with the standard distribution, this works:

import re
from collections import Counter
# open your file as inF...
count=Counter()
for line in inF:
    match=re.match(r'\d+:\d+:\d+ \(slbfd\) (\w+): "(\w+)"', line)
    if match:
        if match.group(1) == 'IN': count[match.group(2)]+=1
        elif match.group(1) == 'OUT': count[match.group(2)]-=1

print(count)

Prints:

Counter({'lq_viz_server': 1, 'OFM32': -1})

edited Jun 22, 2012 at 15:34

answered Jun 22, 2012 at 15:29

the wolf

35.7k13 gold badges57 silver badges73 bronze badges

Collectives™ on Stack Overflow

Analysing a text file in Python

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related