3

I have this line of text here, which will always be the same (except the message at the end):

2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1\r\nMESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.\r\n"

I have 3 functions which should return parts of the log entry:

public function get_log_file_entry_time( string $entry ): string {
    
}

public function get_log_file_entry_level( string $entry ): string {

}

public function get_log_file_entry_message( string $entry ): string {

}

I've first tried using explode with a whitespace as delimiter, which works but not the best way since the log message can be very long in some cases.

I'm not that RegEx expert, but I've found the following combination to match the first two pieces: ([^\s]+) ([A-Z]+)

This returns me the timestamp and the level. Now I'm struggling to get the message after the second group - maybe my nesting is not perfect at all. Any advice would make me happy!

Notice

The message will start after the first whitespace after the logging level. For example:

Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1\r\nMESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.\r\n"

9
  • 1
    If the message is the part before the pipe char, then perhaps like this ^(\S+)\h([A-Z]+)\h([^|]+) regex101.com/r/CyMiDJ/1 Commented Dec 8, 2021 at 22:16
  • The pipe is part of the message! The message will begin after the log level. Commented Dec 8, 2021 at 22:18
  • 1
    So you want to match the rest of the string, including the newlines? Is this the only string or are there more strings with the same format? Matching the rest of the string can be like this (?s)^(\S+)\h+([A-Z]+)\h+(.+) regex101.com/r/WkuRgY/1 but if there are more lines that start with a date and time it will over match it. Commented Dec 8, 2021 at 22:22
  • 1
    You might use a pattern like ^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*) for multiple lines regex101.com/r/V8wUYy/1 Commented Dec 8, 2021 at 22:27
  • 1
    Are tab characters used as delimiters? or are all of the parts separated by a single space? [^\s] is more elegantly written as \S, but if all delimiters are single spaces, then [^ ] is also appropriate. Commented Dec 9, 2021 at 1:03

2 Answers 2

5

You can use 3 capture groups, where the 3rd group contains the rest of the line, followed by all lines that do not start with a date time like pattern.

You can make the pattern a bit more specific for group 1, and to match the rest of the lines that do not start with the group 1 pattern, you can recurse the first sub pattern using (?1)

^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*)

In parts, the pattern matches:

  • ^ Start of string
  • (\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2}) Capture group 1, match a date and time like pattern
  • \h+ Match 1+ horizontal whitespace chars
  • ([A-Z]+) Capture group 2, match 1+ uppercase chars A-Z
  • \h+ Match 1+ horizontal whitespace chars
  • ( Capture group 3
    • .* Match the rest of the ine
    • (?:\R(?!(?1)).*)* Optionally repeat matching a newline and the rest of the line asserting that what is directly to the right from the current position does not match sub pattern 1 (the pattern group 1)
  • ) Close group 3

See a regex demo and a PHP demo.

For example with 2 lines, both starting with the same pattern:

$re = '/^(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\+\d{2}:\d{2})\h+([A-Z]+)\h+(.*(?:\R(?!(?1)).*)*)/m';
$str = '2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"';

preg_match_all($re, $str, $matches, PREG_SET_ORDER);

foreach ($matches as $match) {
    print_r($match);
}

Output

Array
(
    [0] => 2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
    [1] => 2021-12-08T18:18:38+00:00
    [2] => INFO
    [3] => Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
)
Array
(
    [0] => 2021-12-08T18:18:38+00:00 INFO Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
    [1] => 2021-12-08T18:18:38+00:00
    [2] => INFO
    [3] => Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1
MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.
"
)
Sign up to request clarification or add additional context in comments.

Comments

2

Here's a simple method with explode() and its limit parameter.

list($date, $severity, $message) = explode(' ', $str, 3);

var_dump($date, $severity, $message);
/*
string(25) "2021-12-08T18:18:38+00:00"
string(4) "INFO"
string(170) "Produktbestand erfolgreich von Collmex abgerufen | "STOCK_AVAILABLE;23;1;363;PCE;-1 MESSAGE;S;204020;Daten?bertragung erfolgreich. Es wurden 1 Datens?tze verarbeitet.""
*/

As long as the spaces before the message are constant, and none of the parts leading up to it can contain spaces, this will work. If any part before the message has spaces some of the time then this will not work consistently.

5 Comments

Love the idea! Since I might add some whitespace in the future, I will choose the RegEx one, but from the current point of view this is genius!
How would whitespace damage this clean, direct technique?
@mickmackusa Only extra whitespace in the first 2 fields could affect it. e.g.: DATE INFO Here is a message (good), but if the second "field" might be more than 2 words like DATE INFO EXTRA Here is another message then the 3rd field would be combined with the message. The regex could optionally capture the middle, or fail if it doesn't match, whereas this will always return something, but each element might not match the expected if extra spaces are introduced.
So you haven't represented the complexity of your project data in your question?
@mickmackusa I'm talking about ideas that may be used in the future, but my question reflects the current status and not just ideas.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.