1

I currently have a regex like this:

/^From: ((?!\n\n).)*\nSubject:.+/msu

with the point of matching a block that looks like this:

From: John Smith
Cc: Jane Smith
Subject: cat videos

(ie- where they're in a contiguous block) but not if there is a blank line breaking up the block, like this:

From: John Smith

Subject: cat videos

but I've been finding that my PHP script that uses this is sometimes segfaulting. I was able to mitigate the segfaults by setting pcre.recursion_limit to a lower number (I used 8000), but it occurs to me that what I'm trying to do should be doable without a great deal of recursion. Am I using a horribly inefficient method of catching the \n\n ?

3
  • check this out stackoverflow.com/questions/1722453/… Commented Aug 1, 2013 at 3:48
  • 1
    Yeah, that expression can do ton of backtracking, the bigger the input, naturally, the more backtracking. Have you tried splitting the string? Maybe splitting to [^\n]\nSubject or something similar. Commented Aug 1, 2013 at 3:50
  • Your regex is clearly not the problem. Commented Aug 1, 2013 at 3:50

3 Answers 3

2

This is just a terrible use for a single regex. In addition to the performance problems you're having, it's going to fail at straightforward problems like messages with the "Subject:" line appearing before "From:". If you want to parse a RFC822 email header, then you really should be parsing it.

Find the empty line terminator of the header. Join lines beginning with whitespace to the previous line (i.e. replace newline-followed-by-whitespace with a space). Split each line at the first colon and snip leading and trailing whitespace from each side.

Or find an appropriate library to do that for you.

Sign up to request clarification or add additional context in comments.

Comments

1

You should not use regex to parse mail message reliably. Better use a PHP Mime Mail Parser for this task. Using Mime Mail Parser code will be as simple as:

require_once('MimeMailParser.class.php');

$path = 'path/to/mail.txt';

$Parser = new MimeMailParser();
$Parser->setPath($path);

$to       = $Parser->getHeader('to');
$from     = $Parser->getHeader('from');
$subject  = $Parser->getHeader('subject');
$textBody = $Parser->getMessageBody('text');
$htmlBody = $Parser->getMessageBody('html');

Comments

0

I would use simply "not a newline":

/^From:[^\n]*\nSubject:.+/msu

1 Comment

In the end I didn't change my regex, as the recursion_limit is preventing any more crashes. But if it hadn't, I probably would have used something like similar to this simplification (though this appears to only match where From and Subject are on consecutive lines). Ultimately, I probably would have used /^From:.*\nSubject:.+/msu --which could match some things I wouldn't want, but pretty rarely. (the other answers tried to assume I was parsing actual headers, which I'm not)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.