How to get all the matching groups in a file using regex in python

Question

Say I have a file that looks like this:

'2021-06-23T08:02:08Z UTC [ db=dev LOG: BEGIN;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET datestyle TO ISO;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET TRANSACTION READ ONLY;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET STATEMENT_TIMEOUT TO 300000;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: /* hash: 8d9692aa66628f2ea5b0b9de8e4ea59b */

SELECT action,
       status,
       COUNT(*) AS num_req
FROM stl_datashare_changes_consumer
WHERE actiontime > getdate() - INTERVAL '1 day'
GROUP BY 1,2;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SELECT pg_catalog.stll_datashare_changes_consumer.action AS action, pg_catalog.stll_datashare_changes_consumer.status AS status, COUNT(*) AS num_req FROM pg_catalog.stll_datashare_changes_consumer WHERE pg_catalog.stll_datashare_changes_consumer.actiontime > getdate() - interval '1 day'::Interval GROUP BY 1, 2;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: COMMIT;
'2021-06-23T08:02:08Z UTC [ db=dev LOG: SET query_group to ''
'2021-06-23T08:02:22Z UTC [ db=dev LOG: SELECT 1
'2021-06-23T08:02:30Z UTC [ db=dev LOG: /* hash: 64f5dca78e917617f51632257854cb2f */
WITH per_commit_info AS
(
         SELECT   date_trunc('day', startwork) AS day,
                  c.xid,
                  SUM(num_metadata_blocks_retained) AS sum_retained,
                  SUM(total_metadata_blocks)        AS sum_total,
                  AVG(num_metadata_blocks_retained) AS avg_retained,
                  AVG(total_metadata_blocks)        AS avg_total
         FROM     stl_commit_stats c,
                  stl_commit_internal_stats i
         WHERE    c.xid = i.xid
         < ...even more sql >;
'2021-06-23T08:02:30Z UTC [ db=dev LOG: SELECT per_commit_info.day AS day, COUNT(*) AS commits,

and I want to eventually get a data store that looks like this:

[
{
   'timestamp': '2021-06-23T08:02:08Z UTC',
    'db': 'dev',
   'query': 'LOG: BEGIN;',
},
{
   'timestamp': '2021-06-23T08:02:08Z UTC',
    'db': 'dev',
   'query': 'LOG: <Extremely long query string',
},

]

Some of the problems here are that the queries can be multiline and so newlines are not nec

So I have a regex pattern that looks like this:

"(?P<query_date>\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z UTC) \[ db=(?P<db>\w*) LOG:(?P<query_text>.*)",

which I think is close to right. How do I use this to capture all of the matching groups in this file. Can anyone help with this code?

Is the code something like this:

import re
pattern = re.compile(<my pattenr>)

for i, line in enumerate(open(<my file>)):
    for match in re.finditer(pattern, line):
       <add matching group to empty array after making a dictionary>

Is it something like that? One thing to note is that some of the queries do not end in a semi-colon!

I have and the problem is how to handle the newlines for the query. Your comment is bit unhelpful — Jwan622
– Jwan622, Commented Jun 24, 2021 at 14:11
Can you explain a bit more? I just don't follow what that means — Jwan622
– Jwan622, Commented Jun 24, 2021 at 14:53
So al ot of my scores are from asking good questions and people upvoting them? Just because I have a high score doesn't mean I know the answer to the questions I ask. This is meant to help me along right? — Jwan622
– Jwan622, Commented Jun 24, 2021 at 14:58
so you want those entries without db information to be excluded? — x squared
– x squared, Commented Jun 24, 2021 at 15:28

x squared · Accepted Answer · 2021-06-24 16:19:15Z

1

Assuming that queries end with semicolon, you can change the regex part for the query_text in the following way:

(?P<query_text>[\w|\W]*?;)

The \W captures any non-word characters, including newlines.
The *? turns it into a lazy match so that it stops at the first encountered semicolon

See https://regex101.com/r/5URVDX/1

If you want to also match those entries without db, make that part optional:

(db=(?P<db>\w+) )?

https://regex101.com/r/9fYedt/1

If a query can span multiple lines, you can’t iterate the file line by line, so you have to read the whole file into memory:

for match in re.finditer(pattern, open(<my file>).read()):
    #do your stuff

That said, I can see in your example that there are queries not ending with a semicolon. You need to define a terminating character and adjust your file/regex accordingly.

edited Jun 24, 2021 at 16:19

answered Jun 24, 2021 at 15:37

x squared

3,3841 gold badge29 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Jwan622 Over a year ago

sorry the db=dev is always there

Jwan622 Over a year ago

the non terminating character is annoying

Jwan622 Over a year ago

I think the reason there are queries with non-terminating queries is because some of the text editors allow queries to end without a semi colon. This is going to cause a problem here. Is there any way around this? What else can I use as a regex?

x squared Over a year ago

You could use a positive lookahead that matches the date in the next line (which of course doesn’t match the last line in the file then): (?=\n'\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}Z UTC) regex101.com/r/Qh603J/1

Jwan622 Over a year ago

If it's incomplete and leaves the last line out, that won't work

|

Collectives™ on Stack Overflow

How to get all the matching groups in a file using regex in python

1 Answer 1

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related