I'm getting a list of inputs from the user of supposely valid perl regexp values. Examples could be:
\b[Bb]anana\b\s*Apples[BANANA]\s+
Is there a safe way to validate these strings?
First, consider how much you want to let users do with a pattern. A Perl regex can run arbitrary code.
But, to validate that you can use a string as a pattern without it causing a fatal error, you can use the qr// operator to compile the string and return the regex. If there's a problem, the qr gives you a fatal error that you can catch with eval:
my $pattern = eval { qr/$input/ };
If you get back undef, the pattern was not valid. And, despite the comments in the question, there are infinite ways to make invalid patterns. I know because I type them in by hand all the time and I haven't run out of ways to mess up :)
This does not apply the pattern to a string, but you can use $pattern to make the match:
if( $pattern ) {
$target =~ $pattern; # or $target =~ m/$pattern/
}
use re 'eval', so you're relatively safe... just have to worry about the DoS potential.perl -le '/(?{ print "Hello World" })/' outputs "Hello World". There wasn't anything else I needed to do to allow that.perl -le 'my $input = q{(?{print "Hello World"})}; $pat = qr/$input/; $_ =~ $pat;'(?{...}) or (??{...}) group that appears in a regex literal vs. one that was introduced from a variable.If you need to be completely paranoid about what's being given to you, you can use the Safe module to retrict the opcodes that are available to the eval() context.
You would add or subtract from permit_only() to suit your needs.
sub safestringeval ($) {
require Safe;
my $safe = Safe->new;
$safe->permit_only(qw/:base_core anonhash anonlist gvsv gv gelem padsv padav padhv padany/);
return $safe->reval($_[0], 1);
}
$regex = safestringeval('qr{'.$input.'}');
I couldn't actually remember the use case for this so I looked it up. :) It was to allow inputed strings to contain live escape sequences.
Well, validating a regular expression requires knowledge of the kind of inputs you are expecting. There's a direct relationship between the regexp operators and the sets of strings that are accepted by the automaton.
The problem here is that normally the set of strings is not well known or is badly specified, for example:
The essential set of operators in regex are the basic language character set (which provides the symbols to operate on) and the operators that make things complex: this are the alternative | (select one alternative or the other), concatenation (there's no symbol for this, as two regexps are just put together to mean the set of strings comming from one set, and followed by a string, this time coming from the second set) and closure, symbolized by a * (this last meaning allowing any repetition ---including none--- of strings comming from the previous set).
Absolutely all regular expressions can be handled as a (mostly, more complicated) expression that employs only these three operators and no more. For example, the + operator can be handled by repeating the regexp it is applied to, and adding the * to the second instance (surrounding by parenthesis to group it all) The ? optional suffix can be handled by the follwing rule (regexp)? == (regexp|) (the alternative of using it or not)
| means an alternative... you provide two sets of strings and the result set is the union of boths sets. This means that a string will be accepted if it belongs to one or any of the sets.This set of rules will give you the complete set of strings that make your regular expression. This can coincide (or not) to which you have in mind... but if what you have in mind is poorly defined, so it will be your regular expression.
So, as a conclussion, you are asking for a general procedure to test your own mind and how do you design your regular expressions. There's a theorem (called the Pumping theorem) that is used in the demonstration of the equivalence of regular expressions and finite state automata. This is a very important achievement, because it allows you to use regular expressions for efficient single pass, string recognition. If you dig on this, you will find that is possible to write a tool that, from a regular expression can build systematically the full set of strings that will be accepted by some regexp. This has a problem although, is that many of those regexp create infinite sets of strings, and this means the algorithm is not going to finish in a finite time.
As a final comment I can tell you that this makes regular expressions a very powerfull tool to select a string. You can detect with regular expressions complex things like being a string of digits that make a number that is multiple of 23 in it's decimal form, for example, or to validate the number of a credit card for transcription errors.
eval: perlmonks.org/bare/?node_id=146701eval { qr/$pat/ }