Parse log entry with regex into multiple pieces

Question

I have this line of text here, which will always be the same (except the message at the end):

2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1\r\nMESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.\r\n"

I have 3 functions which should return parts of the log entry:

public function get_log_file_entry_time( string $entry ): string {
    
}

public function get_log_file_entry_level( string $entry ): string {

}

public function get_log_file_entry_message( string $entry ): string {

}

I've first tried using explode with a whitespace as delimiter, which works but not the best way since the log message can be very long in some cases.

I'm not that RegEx expert, but I've found the following combination to match the first two pieces: ([^\s]+) ([A-Z]+)

This returns me the timestamp and the level. Now I'm struggling to get the message after the second group - maybe my nesting is not perfect at all. Any advice would make me happy!

Notice

The message will start after the first whitespace after the logging level. For example:

Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1\r\nMESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.\r\n"

If the message is the part before the pipe char, then perhaps like this ^(\S+)\h([A-Z]+)\h([^|]+) regex101.com/r/CyMiDJ/1 — The fourth bird
– The fourth bird, Commented Dec 8, 2021 at 22:16
The pipe is part of the message! The message will begin after the log level. — Mr. Jo
– Mr. Jo, Commented Dec 8, 2021 at 22:18
So you want to match the rest of the string, including the newlines? Is this the only string or are there more strings with the same format? Matching the rest of the string can be like this (?s)^(\S+)\h+([A-Z]+)\h+(.+) regex101.com/r/WkuRgY/1 but if there are more lines that start with a date and time it will over match it. — The fourth bird
– The fourth bird, Commented Dec 8, 2021 at 22:22
You might use a pattern like ^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*) for multiple lines regex101.com/r/V8wUYy/1 — The fourth bird
– The fourth bird, Commented Dec 8, 2021 at 22:27
Are tab characters used as delimiters? or are all of the parts separated by a single space? [^\s] is more elegantly written as \S, but if all delimiters are single spaces, then [^ ] is also appropriate. — mickmackusa
– mickmackusa ♦, Commented Dec 9, 2021 at 1:03

mickmackusa · Accepted Answer · 2021-12-09 01:05:24Z

You can use 3 capture groups, where the 3rd group contains the rest of the line, followed by all lines that do not start with a date time like pattern.

You can make the pattern a bit more specific for group 1, and to match the rest of the lines that do not start with the group 1 pattern, you can recurse the first sub pattern using (?1)

^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*)

In parts, the pattern matches:

^ Start of string
(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}) Capture group 1, match a date and time like pattern
\h+ Match 1+ horizontal whitespace chars
([A-Z]+) Capture group 2, match 1+ uppercase chars A-Z
\h+ Match 1+ horizontal whitespace chars
( Capture group 3
- .* Match the rest of the ine
- (?:\R(?!(?1)).*)* Optionally repeat matching a newline and the rest of the line asserting that what is directly to the right from the current position does not match sub pattern 1 (the pattern group 1)
) Close group 3

See a regex demo and a PHP demo.

For example with 2 lines, both starting with the same pattern:

$re = '/^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*)/m';
$str = '2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"';

preg_match_all($re, $str, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
    print_r($match);
}

Output

Array
(
    [0] => 2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
    [1] => 2021-12-08T18:18:38+00:00
    [2] => INFO
    [3] => Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
)
Array
(
    [0] => 2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
    [1] => 2021-12-08T18:18:38+00:00
    [2] => INFO
    [3] => Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
)

drew010 · Accepted Answer · 2021-12-08 22:43:04Z

2

Here's a simple method with explode() and its limit parameter.

list($date, $severity, $message) = explode(' ', $str, 3);

var_dump($date, $severity, $message);
/*
string(25) "2021-12-08T18:18:38+00:00"
string(4) "INFO"
string(170) "Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1 MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.""
*/

As long as the spaces before the message are constant, and none of the parts leading up to it can contain spaces, this will work. If any part before the message has spaces some of the time then this will not work consistently.

answered Dec 8, 2021 at 22:43

drew010

70.3k11 gold badges144 silver badges174 bronze badges

5 Comments

Mr. Jo Over a year ago

Love the idea! Since I might add some whitespace in the future, I will choose the RegEx one, but from the current point of view this is genius!

mickmackusa Over a year ago

How would whitespace damage this clean, direct technique?

drew010 Over a year ago

@mickmackusa Only extra whitespace in the first 2 fields could affect it. e.g.: DATE INFO Here is a message (good), but if the second "field" might be more than 2 words like DATE INFO EXTRA Here is another message then the 3rd field would be combined with the message. The regex could optionally capture the middle, or fail if it doesn't match, whereas this will always return something, but each element might not match the expected if extra spaces are introduced.

mickmackusa Over a year ago

So you haven't represented the complexity of your project data in your question?

Mr. Jo Over a year ago

@mickmackusa I'm talking about ideas that may be used in the future, but my question reflects the current status and not just ideas.

Collectives™ on Stack Overflow

Parse log entry with regex into multiple pieces

2 Answers 2

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related