preg_split on regex line start

Question

I'm trying to format the following file;

[30-05-2013 15:45:54] A A
[26-06-2013 14:44:44] B A
[26-06-2013 14:44:44] C A
[26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('\n') won't work because
I need the complete message
[26-06-2013 14:44:44] E A
[26-06-2013 14:44:44] F A
[26-06-2013 14:44:44] G A

Expected output:

Array
(
    [0] => [30-05-2013 15:45:54] A A
    [1] => [26-06-2013 14:44:44] B A
    [2] => [26-06-2013 14:44:44] C A
    [3] => [26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so 
            explode('\n') won't work because
            I need the complete message
    [4] => [26-06-2013 14:44:44] E A
    ...
)

Based on How do I include the split delimiter in results for preg_split()? I tried to use a positive lookbehind to persist the timestamps and came up with Regex101:

(?<=\[)(.+)(?<=\])(.+)

Which is used in the following PHP code;

#!/usr/bin/env php
<?php

    class Chat {

        function __construct() {

            // Read chat file
            $this->f = file_get_contents(__DIR__ . '/testchat.txt');

            // Split on '[\d]'
            $r = "/(?<=\[)(.+)(?<=\])(.+)/";
            $l = preg_split($r, $this->f, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

            var_dump(count($l));
            var_dump($l);
        }
    }
$c = new Chat();

This is giving me the following output;

array(22) {
  [0]=>
  string(1) "["
  [1]=>
  string(20) "30-05-2013 15:45:54]"
  [2]=>
  string(4) " A A"
  [3]=>
  string(2) "
["
  [4]=>
  string(20) "26-06-2013 14:44:44]"
  [5]=>
  string(4) " B A"
  [6]=>
  string(2) "
["
  [7]=>
  string(20) "26-06-2013 14:44:44]"
  [8]=>
  string(4) " C A"
  [9]=>
  string(2) "
["
  [10]=>
  string(20) "26-06-2013 14:43:16]"
  [11]=>
  string(87) " Some lines are so large, they take multiple lines, so explode('\n') won't work because"
  [12]=>
  string(30) "
I need the complete message
["

Question

Why is the first [ being ignored?
How should I change the regex to get the desired output?
Why are there sill empty strings with PREG_SPLIT_NO_EMPTY?

Does this work for you - (\[.*?\])([^\[]+) ?

Pedro Lobito
– Pedro Lobito

2020-04-26 18:15:00 +00:00
Commented Apr 26, 2020 at 18:15 — Pedro Lobito
– Pedro Lobito, Commented Apr 26, 2020 at 18:15

Wiktor Stribiżew · Accepted Answer · 2020-04-26 18:22:08Z

2

With preg_split, you may use

'~\R+(?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}])~'

See the regex demo

Details

\R+ - 1+ line break chars
(?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}]) - a positive lookahead that, immediately to the right of the current location,requires
- \[ - a [ char
- \d{2}-\d{2}-\d{4} - a date-like pattern, 2 digits, hyphen, 2 digits, hyphen and 2 digits
- - a space
- \d{2}:\d{2}:\d{2}] - a time-like pattern, 2 digits, :, 2 digits, :, 2 digits.

PHP demo:

$text = "[30-05-2013 15:45:54] A A
[26-06-2013 14:44:44] B A
[26-06-2013 14:44:44] C A
[26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('\n') won't work because
I need the complete message
[26-06-2013 14:44:44] E A
[26-06-2013 14:44:44] F A
[26-06-2013 14:44:44] G A";

print_r(preg_split('~\R+(?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}])~', $text));

Output:

Array
(
    [0] => [30-05-2013 15:45:54] A A
    [1] => [26-06-2013 14:44:44] B A
    [2] => [26-06-2013 14:44:44] C A
    [3] => [26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('
') won't work because
I need the complete message
    [4] => [26-06-2013 14:44:44] E A
    [5] => [26-06-2013 14:44:44] F A
    [6] => [26-06-2013 14:44:44] G A
)

Just in case you need to get more details than just split you may use a matching approach with

'~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\s*+(.*?)(?=\s*^\[(?1)]|\z)~ms'

See the regex demo, use it as

preg_match_all('~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\s*+(.*?)(?=\s*^\[(?1)]|\z)~ms', $text, $matches)

It will match

^ - start of a line
\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})] - the datetime details (captured into Group 1)
\s*+ - 0+ whitespaces (possessively)
(.*?) - any 0+ chars as few as possible up to the first occurrence of
(?=\s*^\[(?1)]|\z) - the lookahead matches a location that is immediately followed with
- \s* - 0+ whitespaces
- ^ - start of line
- \[(?1)] - [, Group 1 pattern, ]
- | -or
- \z - the very end of string.

edited Apr 26, 2020 at 18:22

answered Apr 26, 2020 at 18:15

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

0stone0 Over a year ago

Perfect! Many thanks for the details! Do you happen to know if it's possible to split the date and text into an array so? [0] => [ '[30-...]', 'A A' ]

Wiktor Stribiżew Over a year ago

@0stone0 I think I have just answered this question. Just thought you would need that.

Wiktor Stribiżew Over a year ago

@0stone0 I have come up with a much more efficient version of the second regex, '~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\h*+(.*(?:\R(?!\[(?1)]).*)*)~m', see here

0stone0 Over a year ago

Works great with speed improvement! Thanks!

Pedro Lobito · Accepted Answer · 2020-04-26 18:23:05Z

0

Late answer, but you can also use:

$text =  file_get_contents("testchat.txt");

preg_match_all('/(\[.*?\])([^\[]+)/im', $text, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[0]); $i++) {
    $date = $matches[1][$i];
    $line = $matches[2][$i];
    print("$date $line");
}

answered Apr 26, 2020 at 18:23

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

1 Comment

Wiktor Stribiżew Over a year ago

There is a potential problem with this solution, namely, if the text after a timestamp can contain [ it will stop too early or will split too much, see demo.

Collectives™ on Stack Overflow

preg_split on regex line start

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related