Parsing specific lines in a log (in Python)

Question

I have this log text file:

omer| (stmt : 0) | adminT|  Connection id - 0
omer| (stmt : 0) | adminT|  Start Time - 2018-11-06 16:52:01
omer| (stmt : 0) | adminT|  Statement create or replace table amit (x date);
omer| (stmt : 0)| adminT|  Connection id - 0 - Executing - create or replace table amit (x date);
omer| (stmt : 0) | adminT|  Connection id - 0
omer| (stmt : 0) | adminT|  End Time - 2018-11-06 16:52:01
omer| (stmt : 0) | adminT|  SQL - create or replace table amit (x date);
omer| (stmt : 0) | adminT|  Success
admin| (stmt : 1) | adminT|  Connection id - 0
admin| (stmt : 1) | adminT|  Start Time - 2018-11-06 16:52:14
admin| (stmt : 1) | adminT|  Statement create or replace table amit (x int, y int);
admin| (stmt : 1)| adminT|  Connection id - 0 - Executing - create or replace table amit (x int, y int);
admin| (stmt : 1) | adminT|  Connection id - 0
admin| (stmt : 1) | adminT|  End Time - 2018-11-06 16:52:15
admin| (stmt : 2) | adminT|  Connection id - 0
admin| (stmt : 2) | adminT|  Start Time - 2018-11-06 16:52:19
admin| (stmt : 2) | adminT|  Statement create table amit (x int, y int);
admin| (stmt : 2) | adminT|  Connection id - 0
admin| (stmt : 2) | adminT|  End Time - 2018-11-06 16:52:22
admin| (stmt : 2) | adminT|  SQL - Can't create table 'public.amit' - a table with the same name already exists
admin| (stmt : 2) | adminT|  Failed

now I want to know the delta between start date to end date (as can be seen in the end of the line), next I want to know if the statement is successful or not (marked by Failed or Success). and then I want to calculate the delta from start time and end time, so this is the code I implemented:

def parse_log_file(log_file):
    print(len(""))
    my_path = os.path.abspath(os.path.dirname(__file__))
    path = os.path.join(my_path, log_file)
    max_delta = 0

    with open(path, 'r') as f:
        lines = f.readlines()[1:]

        for line in lines:
            elements = line.split('|')
            # strip the lines of surrounding spaces
            elements = [t.strip() for t in elements]
            statement_id = elements[6]
            if "Start Time" in elements[8] and statement_id in elements[6]:
                start_date = get_date_parsed(elements[8])
            if "End Time" in elements[8] and statement_id in elements[6]:
                end_date = get_date_parsed(elements[8])
                date_time_start_obj = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')
                date_time_end_obj = datetime.datetime.strptime(end_date, '%Y-%m-%d %H:%M:%S')
                delta = date_time_end_obj - date_time_start_obj
                if delta.seconds > max_delta:
                    max_delta = delta
                    print(max_delta)
                print("hello")


def get_date_parsed(date_str):
    res = date_str.split(' ')[3] + ' ' + date_str.split(' ')[4]
    return res

Now I want to know if there is a way to know if the next lines contain 'Success' so the date calculation would be valid.

and statement 1 neither succeeded nor failed. What does it mean? — Roy2012
– Roy2012, Commented Jun 15, 2020 at 14:18
@Roy2012 slowest successful statement X = statement id Expected results: “statement X was the slowest” and I ignore from statements which did not succedded — tupac shakur
– tupac shakur, Commented Jun 15, 2020 at 14:19
Kind of out of the scope of this question - have you considered using the slow-log of the database? — Roy2012
– Roy2012, Commented Jun 15, 2020 at 14:20
I am getting this log file I am not getting it straight from the database. — tupac shakur
– tupac shakur, Commented Jun 15, 2020 at 14:24

jupiterbjy · Accepted Answer · 2020-06-26 00:36:09Z

1

Updated codes to match your full log format following:

2018-11-06 16:54:43.350| on thread[140447603222272 c23]| IP[192.168.0.214:5000]| master| 192.168.0.244| sqream| (stmt : 30) | sqream|  Connection id - 23

Code:

import re
from datetime import datetime


class Event:
    __slots__ = ('start', 'statement', 'end', 'success', 'stmt')
    # This limits what attribute class can have. Originally class use Dictionary to save attributes,
    # but using __slots__ uses Tuple instead, and saves memory if there's lots of class instances.

    """
    Class isn't necessary, and longer than other solutions.
    But class gives you more control when expansion / changes are needed.
    """
    def __init__(self, start, statement, end, success):
        self.start = start.split('- ')[-1]
        self.statement = statement.split('Statement ')[-1].strip(';')
        self.end = end.split('- ')[-1]
        self.success = success.split()[-1]
        self.stmt = re.search(r"(?<=stmt : )[^)]*", statement).group(0)

    def __str__(self):
        """
        When str() or print() is called on this class instances - this will be output.
        """
        return f"Event starting at {self.start}, Took {self.delta_time} sec."

    def __repr__(self):
        """
        repr should returns string with data that is enough to recreate class instance.
        """
        output = [f"stmt     : {self.stmt}",
                  f"Took     : {self.delta_time} sec",
                  f"Statement: {self.statement}",
                  f"Status   : {self.success}",
                  f"Start    : {self.start}",
                  f"End      : {self.end}"]

        return '\n'.join(output)

    @property
    def delta_time(self):
        """
        Converting string to datetime object to perform delta time calculation.
        """
        date_format = "%Y-%m-%d %H:%M:%S"
        start = datetime.strptime(self.start, date_format)
        end = datetime.strptime(self.end, date_format)
        return (end - start).total_seconds()


def generate_events(file):
    def line_yield():
        """
        Generates line inside file without need to load whole file in memory.
        As generator is one-shot, using this to simplify pause / continue of
        line iteration.
        """
        for line_ in file:
            yield line_.strip("\n")

    find_list = ("Start Time", "Statement", "End Time")
    generator = line_yield()

    while True:
        group = []
        for target in find_list:
            for line in generator:  # our generator keeps state after the loop.
                if target in line:  # 'in' finds faster than regex.
                    group.append(line)
                    break

        for line in generator:  # now find either statement was Successful or not.
            if "Success" in line or "Failed" in line:
                group.append(line)
                break

        try:
            yield Event(*group)
        except TypeError:
            return


def find_slowest(log_file):
    formed = list(generate_events(log_file))
    sorted_output = sorted(formed, key=lambda event_: event_.delta_time)

    print("Recorded Events:")
    for output in sorted_output:
        print(output)

    late_runner = sorted_output[-1]

    print('\n< Slowest >')
    print(repr(late_runner))


with open("logfile.log", 'r') as log:
    find_slowest(log)

Results with full log file:

Recorded Events:
Event starting at 2018-11-06 16:52:01, Took 0.0 sec.
Event starting at 2018-11-06 16:52:19, Took 0.0 sec.
Event starting at 2018-11-06 16:52:27, Took 0.0 sec.
Event starting at 2018-11-06 16:52:28, Took 0.0 sec.
Event starting at 2018-11-06 16:52:30, Took 0.0 sec.
Event starting at 2018-11-06 16:52:33, Took 0.0 sec.
Event starting at 2018-11-06 16:52:38, Took 0.0 sec.
Event starting at 2018-11-06 16:52:54, Took 0.0 sec.
Event starting at 2018-11-06 16:53:04, Took 0.0 sec.
Event starting at 2018-11-06 16:53:05, Took 0.0 sec.
Event starting at 2018-11-06 16:53:18, Took 0.0 sec.
Event starting at 2018-11-06 16:53:32, Took 0.0 sec.
Event starting at 2018-11-06 16:53:36, Took 0.0 sec.
Event starting at 2018-11-06 16:53:51, Took 0.0 sec.
Event starting at 2018-11-06 16:53:55, Took 0.0 sec.
Event starting at 2018-11-06 16:53:56, Took 0.0 sec.
Event starting at 2018-11-06 16:54:03, Took 0.0 sec.
Event starting at 2018-11-06 16:54:07, Took 0.0 sec.
Event starting at 2018-11-06 16:54:27, Took 0.0 sec.
Event starting at 2018-11-06 16:54:36, Took 0.0 sec.
Event starting at 2018-11-06 16:52:14, Took 1.0 sec.
Event starting at 2018-11-06 16:53:25, Took 1.0 sec.
Event starting at 2018-11-06 16:53:40, Took 1.0 sec.
Event starting at 2018-11-06 16:54:21, Took 1.0 sec.

< Slowest >
stmt     : 27
Took     : 1.0 sec
Statement: drop table tati
Status   : Success
Start    : 2018-11-06 16:54:21
End      : 2018-11-06 16:54:22

Process finished with exit code 0

edited Jun 26, 2020 at 0:36

answered Jun 15, 2020 at 15:02

jupiterbjy

3,7603 gold badges20 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

tupac shakur Over a year ago

just one thing... I don't care about the name of statement I want to extract ``(stmt : 1)``` for example

jupiterbjy Over a year ago

Sure, in that case - try re.search(r"(?<=()[^)]*", string).group(0). This will return string inside first parentheses in given string, which will be "stmt : x". Example

jupiterbjy Over a year ago

In this example generator function 'lineYield' strips those parts as I thought it's not necessary, so you can't find strings preceding '|' bars in this code, will update accordingly.

tupac shakur Over a year ago

I've changed accordingly to what you suggested this is what I got: ``` statement stmt : 13 was the slowest Took : 64.0 sec ``` Lines I change were: self.statement = re.search(r"(?<=()[^)]*", statement).group(0) and find_list = ["Start Time", "stmt", "End Time"]

jupiterbjy Over a year ago

function 'find_list' is looking for keywords to fill out 'event' class. While 'Statement' keyword only appears AFTER "Start Time" ONCE, 'stmt' occurs on EVERY log lines. So - it's receiving any line right next to 'Start Time' line. That find_list is not ment to be edited.

|

Roy2012 · Accepted Answer · 2020-06-15 14:46:34Z

1

Here's a solution that is based on a set of regular expressions - one for each pattern you're looking for. In the end, I'm storing all the data in a pandas dataframe for analysis.

statement_id_re = re.compile(r"\(stmt : (\d+)\)")
end_re = re.compile(r"End Time - (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})$") 
start_re = re.compile(r"Start Time - (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})$") 
success_re = re.compile(r"\|\s+Success$")

all_statements = []

current_statement = {}
for line in file:
    statement_id = statement_id_re.search(line).groups()[0]
    start = start_re.search(line)
    end = end_re.search(line)
    success = success_re.search(line)
    if start:
        current_statement = {
            "id": statement_id, 
            "start": start.groups()[0]
        }
    elif success:
        current_statement["status"] = "success"
    elif end: 
        current_statement["end"] = end.groups()[0]
        all_statements.append(current_statement)
    else: 
        pass 

df = pd.DataFrame(all_statements)
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
df["duration"] = df.end - df.start

slowest = df.loc[df.duration.idxmin()]
print(f"The slowest statement is {slowest['id']} and it took {slowest['duration']}")

The result for your data is:

The slowest statement is 0 and it took 0 days 00:00:00

answered Jun 15, 2020 at 14:46

Roy2012

12.7k3 gold badges28 silver badges38 bronze badges

1 Comment

tupac shakur Over a year ago

You have a mistake overthere the result should be the slowest statement is 0 and it took 1 second 00:00:01

Collectives™ on Stack Overflow

Parsing specific lines in a log (in Python)

2 Answers 2

10 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

10 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related