simplify a regex to reduce recursion

Question

I currently have a regex like this:

/^From: ((?!\n\n).)*\nSubject:.+/msu

with the point of matching a block that looks like this:

From: John Smith
Cc: Jane Smith
Subject: cat videos

(ie- where they're in a contiguous block) but not if there is a blank line breaking up the block, like this:

From: John Smith

Subject: cat videos

but I've been finding that my PHP script that uses this is sometimes segfaulting. I was able to mitigate the segfaults by setting pcre.recursion_limit to a lower number (I used 8000), but it occurs to me that what I'm trying to do should be doable without a great deal of recursion. Am I using a horribly inefficient method of catching the \n\n ?

Yeah, that expression can do ton of backtracking, the bigger the input, naturally, the more backtracking. Have you tried splitting the string? Maybe splitting to [^\n]\nSubject or something similar. — acdcjunior
– acdcjunior, Commented Aug 1, 2013 at 3:50

Andy Ross · Accepted Answer · 2013-08-01 20:06:35Z

2

This is just a terrible use for a single regex. In addition to the performance problems you're having, it's going to fail at straightforward problems like messages with the "Subject:" line appearing before "From:". If you want to parse a RFC822 email header, then you really should be parsing it.

Find the empty line terminator of the header. Join lines beginning with whitespace to the previous line (i.e. replace newline-followed-by-whitespace with a space). Split each line at the first colon and snip leading and trailing whitespace from each side.

Or find an appropriate library to do that for you.

edited Aug 1, 2013 at 20:06

answered Aug 1, 2013 at 4:39

Andy Ross

12.1k1 gold badge38 silver badges33 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

anubhava · Accepted Answer · 2013-08-01 05:19:40Z

1

You should not use regex to parse mail message reliably. Better use a PHP Mime Mail Parser for this task. Using Mime Mail Parser code will be as simple as:

require_once('MimeMailParser.class.php');

$path = 'path/to/mail.txt';

$Parser = new MimeMailParser();
$Parser->setPath($path);

$to       = $Parser->getHeader('to');
$from     = $Parser->getHeader('from');
$subject  = $Parser->getHeader('subject');
$textBody = $Parser->getMessageBody('text');
$htmlBody = $Parser->getMessageBody('html');

answered Aug 1, 2013 at 5:19

anubhava

790k67 gold badges603 silver badges671 bronze badges

Comments

Bohemian · Accepted Answer · 2013-08-01 04:26:24Z

0

I would use simply "not a newline":

/^From:[^\n]*\nSubject:.+/msu

answered Aug 1, 2013 at 4:26

Bohemian♦

427k103 gold badges603 silver badges750 bronze badges

1 Comment

dlo Over a year ago

In the end I didn't change my regex, as the recursion_limit is preventing any more crashes. But if it hadn't, I probably would have used something like similar to this simplification (though this appears to only match where From and Subject are on consecutive lines). Ultimately, I probably would have used /^From:.*\nSubject:.+/msu --which could match some things I wouldn't want, but pretty rarely. (the other answers tried to assume I was parsing actual headers, which I'm not)

Collectives™ on Stack Overflow

simplify a regex to reduce recursion

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related