2

I'm getting a list of inputs from the user of supposely valid perl regexp values. Examples could be:

  • \b[Bb]anana\b
  • \s*Apples[BANANA]\s+

Is there a safe way to validate these strings?

6
  • 2
    Any character string can be interpreted as regex. So what kind of validation do you have in mind? Commented Nov 8, 2021 at 22:14
  • 2
    @PM77-1 There are some invalid regular expressions, for instance mismatched brackets, quantifiers with nothing before them, etc. Commented Nov 8, 2021 at 22:17
  • @Barmar yeah, but admittedly hardly anything. Most everything works out to a pattern that means something. Commented Nov 8, 2021 at 22:18
  • See how you can do it using eval: perlmonks.org/bare/?node_id=146701 Commented Nov 8, 2021 at 22:20
  • 4
    No idea what this was closed. Voted to reopen. The answer is: eval { qr/$pat/ } Commented Nov 8, 2021 at 23:05

3 Answers 3

7

First, consider how much you want to let users do with a pattern. A Perl regex can run arbitrary code.

But, to validate that you can use a string as a pattern without it causing a fatal error, you can use the qr// operator to compile the string and return the regex. If there's a problem, the qr gives you a fatal error that you can catch with eval:

my $pattern = eval { qr/$input/ };

If you get back undef, the pattern was not valid. And, despite the comments in the question, there are infinite ways to make invalid patterns. I know because I type them in by hand all the time and I haven't run out of ways to mess up :)

This does not apply the pattern to a string, but you can use $pattern to make the match:

if( $pattern ) {
    $target =~ $pattern;  # or $target =~ m/$pattern/
    }
Sign up to request clarification or add additional context in comments.

5 Comments

Sure, the number of "invalid" patterns is infinite, but it still has density zero in the set of all patterns. As for the safety aspects, Perl won't allow the arbitrary-code stuff to be interpolated into a pattern from a variable unless you ask for that with use re 'eval', so you're relatively safe... just have to worry about the DoS potential.
For zero density, we sure run into a lot of invalid patterns in practice. No one really cares about the set of all patterns, but even then, for every valid pattern you show me, I can show you more than one unique invalid pattern. And, when it comes to user input, most of us know that given the chance, they will find the rarest of ways to do it wrong.
As for arbitrary code, perl -le '/(?{ print "Hello World" })/' outputs "Hello World". There wasn't anything else I needed to do to allow that.
Yes, but try perl -le 'my $input = q{(?{print "Hello World"})}; $pat = qr/$input/; $_ =~ $pat;'
It distinguishes between a (?{...}) or (??{...}) group that appears in a regex literal vs. one that was introduced from a variable.
1

If you need to be completely paranoid about what's being given to you, you can use the Safe module to retrict the opcodes that are available to the eval() context.

You would add or subtract from permit_only() to suit your needs.


sub safestringeval ($) {
    require Safe;
    my $safe = Safe->new;
    $safe->permit_only(qw/:base_core anonhash anonlist gvsv gv gelem padsv padav padhv padany/);
    return $safe->reval($_[0], 1);
}

$regex = safestringeval('qr{'.$input.'}');


I couldn't actually remember the use case for this so I looked it up. :) It was to allow inputed strings to contain live escape sequences.

Comments

0

Well, validating a regular expression requires knowledge of the kind of inputs you are expecting. There's a direct relationship between the regexp operators and the sets of strings that are accepted by the automaton.

The problem here is that normally the set of strings is not well known or is badly specified, for example:

The essential set of operators in regex are the basic language character set (which provides the symbols to operate on) and the operators that make things complex: this are the alternative | (select one alternative or the other), concatenation (there's no symbol for this, as two regexps are just put together to mean the set of strings comming from one set, and followed by a string, this time coming from the second set) and closure, symbolized by a * (this last meaning allowing any repetition ---including none--- of strings comming from the previous set).

Absolutely all regular expressions can be handled as a (mostly, more complicated) expression that employs only these three operators and no more. For example, the + operator can be handled by repeating the regexp it is applied to, and adding the * to the second instance (surrounding by parenthesis to group it all) The ? optional suffix can be handled by the follwing rule (regexp)? == (regexp|) (the alternative of using it or not)

  • The | means an alternative... you provide two sets of strings and the result set is the union of boths sets. This means that a string will be accepted if it belongs to one or any of the sets.
  • The catenation means the set is constructing by evaluating the cartesian product of both sets. You need to make pairs of strings, taking one from the first set and the second from the other set, and all the possible sets are formed this way.
  • The closure implies building a string as a sequence of (possibly zero) instances from the same set.... this means that you can concatenate any instances from the set is it applied.

This set of rules will give you the complete set of strings that make your regular expression. This can coincide (or not) to which you have in mind... but if what you have in mind is poorly defined, so it will be your regular expression.

So, as a conclussion, you are asking for a general procedure to test your own mind and how do you design your regular expressions. There's a theorem (called the Pumping theorem) that is used in the demonstration of the equivalence of regular expressions and finite state automata. This is a very important achievement, because it allows you to use regular expressions for efficient single pass, string recognition. If you dig on this, you will find that is possible to write a tool that, from a regular expression can build systematically the full set of strings that will be accepted by some regexp. This has a problem although, is that many of those regexp create infinite sets of strings, and this means the algorithm is not going to finish in a finite time.

As a final comment I can tell you that this makes regular expressions a very powerfull tool to select a string. You can detect with regular expressions complex things like being a string of digits that make a number that is multiple of 23 in it's decimal form, for example, or to validate the number of a credit card for transcription errors.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.