1

A legacy PHP system reads a huge log file (~5gb) directly to a variable in memory and doing some processing.

EDIT: About reading 5gb to memory being highly not recommended and other suggestions please trust that this has to stay the same due to some legacy design that we can not change.

Now I need to process the data by another service that takes max 1000 lines per call.

I have tried following two approaches and both are working.

1- Split the whole string at new line char into an array, then use array_chunk to split that array into sub-arrays and then take each sub array and implode to generate a string

$logFileStr; // a variable that already contains 5gb file as string
$logLines = explode(PHP_EOL, $logFileStr);
$lineGroups = array_chunk($logLines, 1000);
foreach($lineGroups as $lineGroup)
{
    $linesChunk = implode(PHP_EOL, $lineGroup);

    $archiveService->store($linesChunk);
}

Pros: it is fast as everything works in memory Cons: A lot of overwork involved & needs a lot of memory

2- initially write the contents of string variable to a local temp file. Then use exec function to split the file

split -l 1000 localfile 

that produces a large number of files 1000 lines each. Then I can simply read the files recursively and process each file as a single string.

Pros: it is simpler and easier to maintain

Cons: Disk I/O gets involved which is slow and a lot of write read overhead

My question is, as I already have a variable with whole string in memory, how can I read chunks of 1000 lines each from that variable in an iteratable way so that I can avoid the writing to disk or generating a new array and re-merging overheads?

0

2 Answers 2

0

One way to solve this problem would be using the following steps:

  1. Parse the string as a character array in a loop.
  2. Count the number of newline characters.
  3. For every 1000th newline, extract the substring that starts from where the previous substring ended and ends at the current newline.

I created a sample php code that follows the above steps:

<?php
$str = "line1\nline2\nline3\nline4\nline5\n"; // Sample string
$max_new_lines = 2; // Max number of lines. Replace this with 1000
$str_length = strlen($str);
$new_line_count = 0;
$str_chunk = "";
$start = 0;

// Loop through every character of the string
for ($i = 0; $i < $str_length; ++$i) {
  if ($str[$i] == "\n") {
    ++$new_line_count;

    // If we reached the max number of newlines, extract the substring
    if (($new_line_count % $max_new_lines) == 0) {
      $str_chunk = substr($str, $start, $i - $start);
      $start = $i + 1;
      // echo "\n\nchunk:\n" . $str_chunk;
    }
  }
}

// Extract the remaining lines
$str_chunk = substr($str, $start, $i - $start);
// echo "\n\nchunk:\n" . $str_chunk;
Sign up to request clarification or add additional context in comments.

Comments

0

After some more researching I stumbled upon this question php explode every third instance of character and after some modification to the answer posted there (https://stackoverflow.com/a/1275110/7260022) I came up with this snippet that for the time being works better than my previous approaches.

$logFileStr; // a variable that already contains 5gb file as string

$chunks = preg_split('/((?:[^\n]*\n){1000})/', $logFileStr, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

print_r($chunks);

on a test string the result looks like this (split at 3 )

Array
(
 [0] => 13923
        27846
        311769

 [1] => 831384
        935307
        1039230

 [2] => 1558845
        1662768
        1766691

 [3] => 1870614

)

Explanation of the regex is as below

?: will match without creating capture groups

[^\n] matches with anything that is not new line

the * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)

{1000} Quantifier — Matches exactly 1000 times

flag PREG_SPLIT_DELIM_CAPTURE will add the new line character in the result set too.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.