3

This is a question about solving a particular problem I am struggling with, I am parsing a long list of text data, line by line for a business app in PHP (cron script on the CLI). The file follows the format:

    HD: Some text here {text here too}

    DC: A description here
    DC: the description continues here
    DC: and it ends here.

    DT: 2012-08-01

    HD: Next header here {supplemental text}

    ... this repeats over and over for a few hundred megs

I have to read each line, parse out the HD: line and grab the text on this line. I then compare this text against data stored in a database. When a match is found, I want to then record the following DC: lines that succeed the matched HD:.

Pseudo code:

    while ( the_file_pointer_isnt_end_of_file) {
        line = getCurrentLineFromFile
        title = parseTitleFrom(line)
        matched = searchForMatchInDB(line)
        if ( matched ) {
            recordTheDCLines  // <- Best way to do this?
        }
    }

My problem is that because I am reading line by line, what is the best way to trigger the script to start saving DC lines, and then when they are finished save them to the database?

I have a vague idea, but have yet to properly implement it. I would love to hear the communities ideas\suggestions!

Thank you.

4
  • 2
    With loading and saving line by line you will have massive overhead. I would read/write in chunks. Commented Dec 8, 2012 at 20:49
  • I see your point. This is how I was told to do it, and so I have to implement it this way; however, thanks for the suggestion. I may work it in for optimization! Commented Dec 8, 2012 at 22:59
  • sed/awk to read and parse then call your php script to check the database and update if needed. Commented Dec 9, 2012 at 1:44
  • What database are you using? Commented Dec 9, 2012 at 9:36

4 Answers 4

5

Separate the problem -- one script plows through and reads and stuffs the interesting stuff into some sort of data store. Second script pulls from the data store and processes the records. I suspect this will be much faster than doing it in the same script for no other reason than the 2nd script effectively multi-threads the app.

0
2

Write a two functions or a class LineReader with the following functions:

  • string GetNextLine() : reads next line from file
  • string PeekLine() : gets the next line from file, but don't move the file pointer

(you can implement this easily by a line buffer consisting of a string variable holding one line in advance; GetNextLine has to make use of that buffer as well as PeekLine).

Then, the implementation of recordDCLines should be something like

 while(substr(PeekLine(),0,3)=="DC:")
 {
    line=GetNextLine();
    // process line, append it to a buffer
 }
 // here, store the found DC block

EDIT: some pseudo code, I am not experienced in PHP, but I hope you get the general idea:

 void OpenFile()
 {
     // do stuff here to open file
     // ...
     $nextline = getNextLineFromFile();
     $endoffile = false;
 }

 string GetNextLine()
 {
      if(isset($nextline))
      {
         $result=$nextline;
         if(!noMoreLinesAvailable())
             $nextline = getNextLineFromFile();
         else
             unset($nextline);
      } 
      else 
      {
         $endoffile=true;
         $result ="";
      }
      return $result;
 }

 string PeekLine()
 {
     return $nextline;
 }
2
  • Thank you very much for your answer. This is a wonderful suggestion, and I will investigate it more. Could you explain this a bit more: (you can implement this easily by a line buffer consisting of a string variable holding one line in advance; GetNextLine has to make use of that buffer as well as PeekLine)? Commented Dec 8, 2012 at 22:57
  • @Jarrod; see my edit Commented Dec 9, 2012 at 14:01
1

Implement a basic state machine. As you are reading lines, note the last 'command' (dc, dt, etc). When you get a 'HD', do your lookups. When you are in a DC state, you know to accumulate the message until the next item isnt a DC entry, at which point you do a write.

0
0

You could consider writing a PHP extension in C or C++ for that purpose; you could then use low-level, but efficient syscalls (e.g. mmap(2), read(2) into a large buffer, readahead(2), etc...)

You could also delegate to a helper program written in C.

1
  • Thanks for the suggestion. Unfortunately this is delegated work and I have to implement it as I was told! Commented Dec 8, 2012 at 22:58

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.