1

I'm trying to format the following file;

[30-05-2013 15:45:54] A A
[26-06-2013 14:44:44] B A
[26-06-2013 14:44:44] C A
[26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('\n') won't work because
I need the complete message
[26-06-2013 14:44:44] E A
[26-06-2013 14:44:44] F A
[26-06-2013 14:44:44] G A

Expected output:

Array
(
    [0] => [30-05-2013 15:45:54] A A
    [1] => [26-06-2013 14:44:44] B A
    [2] => [26-06-2013 14:44:44] C A
    [3] => [26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so 
            explode('\n') won't work because
            I need the complete message
    [4] => [26-06-2013 14:44:44] E A
    ...
)


Based on How do I include the split delimiter in results for preg_split()? I tried to use a positive lookbehind to persist the timestamps and came up with Regex101:

(?<=\[)(.+)(?<=\])(.+)

Which is used in the following PHP code;

#!/usr/bin/env php
<?php

    class Chat {

        function __construct() {

            // Read chat file
            $this->f = file_get_contents(__DIR__ . '/testchat.txt');

            // Split on '[\d]'
            $r = "/(?<=\[)(.+)(?<=\])(.+)/";
            $l = preg_split($r, $this->f, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

            var_dump(count($l));
            var_dump($l);
        }
    }
$c = new Chat();

This is giving me the following output;

array(22) {
  [0]=>
  string(1) "["
  [1]=>
  string(20) "30-05-2013 15:45:54]"
  [2]=>
  string(4) " A A"
  [3]=>
  string(2) "
["
  [4]=>
  string(20) "26-06-2013 14:44:44]"
  [5]=>
  string(4) " B A"
  [6]=>
  string(2) "
["
  [7]=>
  string(20) "26-06-2013 14:44:44]"
  [8]=>
  string(4) " C A"
  [9]=>
  string(2) "
["
  [10]=>
  string(20) "26-06-2013 14:43:16]"
  [11]=>
  string(87) " Some lines are so large, they take multiple lines, so explode('\n') won't work because"
  [12]=>
  string(30) "
I need the complete message
["

Question

  1. Why is the first [ being ignored?
  2. How should I change the regex to get the desired output?
  3. Why are there sill empty strings with PREG_SPLIT_NO_EMPTY?
1
  • Does this work for you - (\[.*?\])([^\[]+) ? Commented Apr 26, 2020 at 18:15

2 Answers 2

2

With preg_split, you may use

'~\R+(?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}])~'

See the regex demo

Details

  • \R+ - 1+ line break chars
  • (?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}]) - a positive lookahead that, immediately to the right of the current location,requires
    • \[ - a [ char
    • \d{2}-\d{2}-\d{4} - a date-like pattern, 2 digits, hyphen, 2 digits, hyphen and 2 digits
    • - a space
    • \d{2}:\d{2}:\d{2}] - a time-like pattern, 2 digits, :, 2 digits, :, 2 digits.

PHP demo:

$text = "[30-05-2013 15:45:54] A A
[26-06-2013 14:44:44] B A
[26-06-2013 14:44:44] C A
[26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('\n') won't work because
I need the complete message
[26-06-2013 14:44:44] E A
[26-06-2013 14:44:44] F A
[26-06-2013 14:44:44] G A";

print_r(preg_split('~\R+(?=\[\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2}])~', $text));

Output:

Array
(
    [0] => [30-05-2013 15:45:54] A A
    [1] => [26-06-2013 14:44:44] B A
    [2] => [26-06-2013 14:44:44] C A
    [3] => [26-06-2013 14:43:16] Some lines are so large, they take multiple lines, so explode('
') won't work because
I need the complete message
    [4] => [26-06-2013 14:44:44] E A
    [5] => [26-06-2013 14:44:44] F A
    [6] => [26-06-2013 14:44:44] G A
)

Just in case you need to get more details than just split you may use a matching approach with

'~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\s*+(.*?)(?=\s*^\[(?1)]|\z)~ms'

See the regex demo, use it as

preg_match_all('~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\s*+(.*?)(?=\s*^\[(?1)]|\z)~ms', $text, $matches)

It will match

  • ^ - start of a line
  • \[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})] - the datetime details (captured into Group 1)
  • \s*+ - 0+ whitespaces (possessively)
  • (.*?) - any 0+ chars as few as possible up to the first occurrence of
  • (?=\s*^\[(?1)]|\z) - the lookahead matches a location that is immediately followed with
    • \s* - 0+ whitespaces
    • ^ - start of line
    • \[(?1)] - [, Group 1 pattern, ]
    • | -or
    • \z - the very end of string.
Sign up to request clarification or add additional context in comments.

4 Comments

Perfect! Many thanks for the details! Do you happen to know if it's possible to split the date and text into an array so? [0] => [ '[30-...]', 'A A' ]
@0stone0 I think I have just answered this question. Just thought you would need that.
@0stone0 I have come up with a much more efficient version of the second regex, '~^\[(\d{2}-\d{2}-\d{4} \d{2}:\d{2}:\d{2})]\h*+(.*(?:\R(?!\[(?1)]).*)*)~m', see here
Works great with speed improvement! Thanks!
0

Late answer, but you can also use:

$text =  file_get_contents("testchat.txt");

preg_match_all('/(\[.*?\])([^\[]+)/im', $text, $matches, PREG_PATTERN_ORDER);
for ($i = 0; $i < count($matches[0]); $i++) {
    $date = $matches[1][$i];
    $line = $matches[2][$i];
    print("$date $line");
}

1 Comment

There is a potential problem with this solution, namely, if the text after a timestamp can contain [ it will stop too early or will split too much, see demo.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.