0

Say I have a file that looks like this:

'2021-06-23T08:02:08Z UTC [ db=dev LOG: BEGIN;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET datestyle TO ISO;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET TRANSACTION READ ONLY;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET STATEMENT_TIMEOUT TO 300000;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: /* hash: 8d9692aa66628f2ea5b0b9de8e4ea59b */

SELECT action,
       status,
       COUNT(*) AS num_req
FROM stl_datashare_changes_consumer
WHERE actiontime > getdate() - INTERVAL '1 day'
GROUP BY 1,2;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SELECT pg_catalog.stll_datashare_changes_consumer.action AS action, pg_catalog.stll_datashare_changes_consumer.status AS status, COUNT(*) AS num_req FROM pg_catalog.stll_datashare_changes_consumer WHERE pg_catalog.stll_datashare_changes_consumer.actiontime > getdate() - interval '1 day'::Interval GROUP BY 1, 2;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: COMMIT;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET query_group to ''
'2021-06-23T08:02:22Z UTC [ db=dev LOG: SELECT 1
'2021-06-23T08:02:30Z UTC [ db=dev LOG: /* hash: 64f5dca78e917617f51632257854cb2f */
WITH per_commit_info AS
(
         SELECT   date_trunc('day', startwork) AS day,
                  c.xid,
                  SUM(num_metadata_blocks_retained) AS sum_retained,
                  SUM(total_metadata_blocks)        AS sum_total,
                  AVG(num_metadata_blocks_retained) AS avg_retained,
                  AVG(total_metadata_blocks)        AS avg_total
         FROM     stl_commit_stats c,
                  stl_commit_internal_stats i
         WHERE    c.xid = i.xid
         < ...even more sql >;
'2021-06-23T08:02:30Z UTC [ db=dev LOG: SELECT per_commit_info.day AS day, COUNT(*) AS commits,

and I want to eventually get a data store that looks like this:

[
{
   'timestamp': '2021-06-23T08:02:08Z UTC',
    'db': 'dev',
   'query': 'LOG: BEGIN;',
},
{
   'timestamp': '2021-06-23T08:02:08Z UTC',
    'db': 'dev',
   'query': 'LOG: <Extremely long query string',
},

]

Some of the problems here are that the queries can be multiline and so newlines are not nec

So I have a regex pattern that looks like this:

"(?P<query_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z UTC) \[ db=(?P<db>\w*) LOG:(?P<query_text>.*)",

which I think is close to right. How do I use this to capture all of the matching groups in this file. Can anyone help with this code?

Is the code something like this:

import re
pattern = re.compile(<my pattenr>)

for i, line in enumerate(open(<my file>)):
    for match in re.finditer(pattern, line):
       <add matching group to empty array after making a dictionary>

Is it something like that? One thing to note is that some of the queries do not end in a semi-colon!

6
  • I have and the problem is how to handle the newlines for the query. Your comment is bit unhelpful Commented Jun 24, 2021 at 14:11
  • Can you explain a bit more? I just don't follow what that means Commented Jun 24, 2021 at 14:53
  • So al ot of my scores are from asking good questions and people upvoting them? Just because I have a high score doesn't mean I know the answer to the questions I ask. This is meant to help me along right? Commented Jun 24, 2021 at 14:58
  • simplified the question, removed some columns Commented Jun 24, 2021 at 15:12
  • so you want those entries without db information to be excluded? Commented Jun 24, 2021 at 15:28

1 Answer 1

1

Assuming that queries end with semicolon, you can change the regex part for the query_text in the following way:

(?P<query_text>[\w|\W]*?;)
  • The \W captures any non-word characters, including newlines.
  • The *? turns it into a lazy match so that it stops at the first encountered semicolon

See https://regex101.com/r/5URVDX/1

If you want to also match those entries without db, make that part optional:

(db=(?P<db>\w+) )?

https://regex101.com/r/9fYedt/1

If a query can span multiple lines, you can’t iterate the file line by line, so you have to read the whole file into memory:

for match in re.finditer(pattern, open(<my file>).read()):
    #do your stuff

That said, I can see in your example that there are queries not ending with a semicolon. You need to define a terminating character and adjust your file/regex accordingly.

Sign up to request clarification or add additional context in comments.

6 Comments

sorry the db=dev is always there
the non terminating character is annoying
I think the reason there are queries with non-terminating queries is because some of the text editors allow queries to end without a semi colon. This is going to cause a problem here. Is there any way around this? What else can I use as a regex?
You could use a positive lookahead that matches the date in the next line (which of course doesn’t match the last line in the file then): (?=\n'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z UTC) regex101.com/r/Qh603J/1
If it's incomplete and leaves the last line out, that won't work
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.