0

I have this log text file:

omer| (stmt : 0) | adminT|  Connection id - 0
omer| (stmt : 0) | adminT|  Start Time - 2018-11-06 16:52:01
omer| (stmt : 0) | adminT|  Statement create or replace table amit (x date);
omer| (stmt : 0)| adminT|  Connection id - 0 - Executing - create or replace table amit (x date);
omer| (stmt : 0) | adminT|  Connection id - 0
omer| (stmt : 0) | adminT|  End Time - 2018-11-06 16:52:01
omer| (stmt : 0) | adminT|  SQL - create or replace table amit (x date);
omer| (stmt : 0) | adminT|  Success
admin| (stmt : 1) | adminT|  Connection id - 0
admin| (stmt : 1) | adminT|  Start Time - 2018-11-06 16:52:14
admin| (stmt : 1) | adminT|  Statement create or replace table amit (x int, y int);
admin| (stmt : 1)| adminT|  Connection id - 0 - Executing - create or replace table amit (x int, y int);
admin| (stmt : 1) | adminT|  Connection id - 0
admin| (stmt : 1) | adminT|  End Time - 2018-11-06 16:52:15
admin| (stmt : 2) | adminT|  Connection id - 0
admin| (stmt : 2) | adminT|  Start Time - 2018-11-06 16:52:19
admin| (stmt : 2) | adminT|  Statement create table amit (x int, y int);
admin| (stmt : 2) | adminT|  Connection id - 0
admin| (stmt : 2) | adminT|  End Time - 2018-11-06 16:52:22
admin| (stmt : 2) | adminT|  SQL - Can't create table 'public.amit' - a table with the same name already exists
admin| (stmt : 2) | adminT|  Failed

now I want to know the delta between start date to end date (as can be seen in the end of the line), next I want to know if the statement is successful or not (marked by Failed or Success). and then I want to calculate the delta from start time and end time, so this is the code I implemented:

def parse_log_file(log_file):
    print(len(""))
    my_path = os.path.abspath(os.path.dirname(__file__))
    path = os.path.join(my_path, log_file)
    max_delta = 0

    with open(path, 'r') as f:
        lines = f.readlines()[1:]

        for line in lines:
            elements = line.split('|')
            # strip the lines of surrounding spaces
            elements = [t.strip() for t in elements]
            statement_id = elements[6]
            if "Start Time" in elements[8] and statement_id in elements[6]:
                start_date = get_date_parsed(elements[8])
            if "End Time" in elements[8] and statement_id in elements[6]:
                end_date = get_date_parsed(elements[8])
                date_time_start_obj = datetime.datetime.strptime(start_date, '%Y-%m-%d %H:%M:%S')
                date_time_end_obj = datetime.datetime.strptime(end_date, '%Y-%m-%d %H:%M:%S')
                delta = date_time_end_obj - date_time_start_obj
                if delta.seconds > max_delta:
                    max_delta = delta
                    print(max_delta)
                print("hello")


def get_date_parsed(date_str):
    res = date_str.split(' ')[3] + ' ' + date_str.split(' ')[4]
    return res

Now I want to know if there is a way to know if the next lines contain 'Success' so the date calculation would be valid.

7
  • What's the expected output? Commented Jun 15, 2020 at 14:17
  • and statement 1 neither succeeded nor failed. What does it mean? Commented Jun 15, 2020 at 14:18
  • @Roy2012 slowest successful statement X = statement id Expected results: “statement X was the slowest” and I ignore from statements which did not succedded Commented Jun 15, 2020 at 14:19
  • Kind of out of the scope of this question - have you considered using the slow-log of the database? Commented Jun 15, 2020 at 14:20
  • I am getting this log file I am not getting it straight from the database. Commented Jun 15, 2020 at 14:24

2 Answers 2

1

Updated codes to match your full log format following:

2018-11-06 16:54:43.350| on thread[140447603222272 c23]| IP[192.168.0.214:5000]| master| 192.168.0.244| sqream| (stmt : 30) | sqream|  Connection id - 23

Code:

import re
from datetime import datetime


class Event:
    __slots__ = ('start', 'statement', 'end', 'success', 'stmt')
    # This limits what attribute class can have. Originally class use Dictionary to save attributes,
    # but using __slots__ uses Tuple instead, and saves memory if there's lots of class instances.

    """
    Class isn't necessary, and longer than other solutions.
    But class gives you more control when expansion / changes are needed.
    """
    def __init__(self, start, statement, end, success):
        self.start = start.split('- ')[-1]
        self.statement = statement.split('Statement ')[-1].strip(';')
        self.end = end.split('- ')[-1]
        self.success = success.split()[-1]
        self.stmt = re.search(r"(?<=stmt : )[^)]*", statement).group(0)

    def __str__(self):
        """
        When str() or print() is called on this class instances - this will be output.
        """
        return f"Event starting at {self.start}, Took {self.delta_time} sec."

    def __repr__(self):
        """
        repr should returns string with data that is enough to recreate class instance.
        """
        output = [f"stmt     : {self.stmt}",
                  f"Took     : {self.delta_time} sec",
                  f"Statement: {self.statement}",
                  f"Status   : {self.success}",
                  f"Start    : {self.start}",
                  f"End      : {self.end}"]

        return '\n'.join(output)

    @property
    def delta_time(self):
        """
        Converting string to datetime object to perform delta time calculation.
        """
        date_format = "%Y-%m-%d %H:%M:%S"
        start = datetime.strptime(self.start, date_format)
        end = datetime.strptime(self.end, date_format)
        return (end - start).total_seconds()


def generate_events(file):
    def line_yield():
        """
        Generates line inside file without need to load whole file in memory.
        As generator is one-shot, using this to simplify pause / continue of
        line iteration.
        """
        for line_ in file:
            yield line_.strip("\n")

    find_list = ("Start Time", "Statement", "End Time")
    generator = line_yield()

    while True:
        group = []
        for target in find_list:
            for line in generator:  # our generator keeps state after the loop.
                if target in line:  # 'in' finds faster than regex.
                    group.append(line)
                    break

        for line in generator:  # now find either statement was Successful or not.
            if "Success" in line or "Failed" in line:
                group.append(line)
                break

        try:
            yield Event(*group)
        except TypeError:
            return


def find_slowest(log_file):
    formed = list(generate_events(log_file))
    sorted_output = sorted(formed, key=lambda event_: event_.delta_time)

    print("Recorded Events:")
    for output in sorted_output:
        print(output)

    late_runner = sorted_output[-1]

    print('\n< Slowest >')
    print(repr(late_runner))


with open("logfile.log", 'r') as log:
    find_slowest(log)

Results with full log file:

Recorded Events:
Event starting at 2018-11-06 16:52:01, Took 0.0 sec.
Event starting at 2018-11-06 16:52:19, Took 0.0 sec.
Event starting at 2018-11-06 16:52:27, Took 0.0 sec.
Event starting at 2018-11-06 16:52:28, Took 0.0 sec.
Event starting at 2018-11-06 16:52:30, Took 0.0 sec.
Event starting at 2018-11-06 16:52:33, Took 0.0 sec.
Event starting at 2018-11-06 16:52:38, Took 0.0 sec.
Event starting at 2018-11-06 16:52:54, Took 0.0 sec.
Event starting at 2018-11-06 16:53:04, Took 0.0 sec.
Event starting at 2018-11-06 16:53:05, Took 0.0 sec.
Event starting at 2018-11-06 16:53:18, Took 0.0 sec.
Event starting at 2018-11-06 16:53:32, Took 0.0 sec.
Event starting at 2018-11-06 16:53:36, Took 0.0 sec.
Event starting at 2018-11-06 16:53:51, Took 0.0 sec.
Event starting at 2018-11-06 16:53:55, Took 0.0 sec.
Event starting at 2018-11-06 16:53:56, Took 0.0 sec.
Event starting at 2018-11-06 16:54:03, Took 0.0 sec.
Event starting at 2018-11-06 16:54:07, Took 0.0 sec.
Event starting at 2018-11-06 16:54:27, Took 0.0 sec.
Event starting at 2018-11-06 16:54:36, Took 0.0 sec.
Event starting at 2018-11-06 16:52:14, Took 1.0 sec.
Event starting at 2018-11-06 16:53:25, Took 1.0 sec.
Event starting at 2018-11-06 16:53:40, Took 1.0 sec.
Event starting at 2018-11-06 16:54:21, Took 1.0 sec.

< Slowest >
stmt     : 27
Took     : 1.0 sec
Statement: drop table tati
Status   : Success
Start    : 2018-11-06 16:54:21
End      : 2018-11-06 16:54:22

Process finished with exit code 0
Sign up to request clarification or add additional context in comments.

10 Comments

just one thing... I don't care about the name of statement I want to extract ``(stmt : 1)``` for example
Sure, in that case - try re.search(r"(?<=()[^)]*", string).group(0). This will return string inside first parentheses in given string, which will be "stmt : x". Example
In this example generator function 'lineYield' strips those parts as I thought it's not necessary, so you can't find strings preceding '|' bars in this code, will update accordingly.
I've changed accordingly to what you suggested this is what I got: ``` statement stmt : 13 was the slowest Took : 64.0 sec ``` Lines I change were: self.statement = re.search(r"(?<=()[^)]*", statement).group(0) and find_list = ["Start Time", "stmt", "End Time"]
function 'find_list' is looking for keywords to fill out 'event' class. While 'Statement' keyword only appears AFTER "Start Time" ONCE, 'stmt' occurs on EVERY log lines. So - it's receiving any line right next to 'Start Time' line. That find_list is not ment to be edited.
|
1

Here's a solution that is based on a set of regular expressions - one for each pattern you're looking for. In the end, I'm storing all the data in a pandas dataframe for analysis.

statement_id_re = re.compile(r"\(stmt : (\d+)\)")
end_re = re.compile(r"End Time - (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})$") 
start_re = re.compile(r"Start Time - (\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2})$") 
success_re = re.compile(r"\|\s+Success$")

all_statements = []

current_statement = {}
for line in file:
    statement_id = statement_id_re.search(line).groups()[0]
    start = start_re.search(line)
    end = end_re.search(line)
    success = success_re.search(line)
    if start:
        current_statement = {
            "id": statement_id, 
            "start": start.groups()[0]
        }
    elif success:
        current_statement["status"] = "success"
    elif end: 
        current_statement["end"] = end.groups()[0]
        all_statements.append(current_statement)
    else: 
        pass 

df = pd.DataFrame(all_statements)
df.start = pd.to_datetime(df.start)
df.end = pd.to_datetime(df.end)
df["duration"] = df.end - df.start

slowest = df.loc[df.duration.idxmin()]
print(f"The slowest statement is {slowest['id']} and it took {slowest['duration']}")

The result for your data is:

The slowest statement is 0 and it took 0 days 00:00:00

1 Comment

You have a mistake overthere the result should be the slowest statement is 0 and it took 1 second 00:00:01

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.