4

I want to store a series of pre-tested regexes in a config file, and read and apply them at runtime.

However, because they're commonly packed with escape characters, by the time I've loaded them up into memory, and populated them into a dict, they've been escaped to death.

How can I preserve the integrity of my regex definitions, so that they will re.compile?

Alternately, given that many of the post-escape strings end up in a form with \x00 characters, how do I convert these back into a form that will be consumed correctly by re.compile?

e.g. I have written in a file, the regex "\btest\b". If I want to put this into a re.compile, I can force it to do so with re.compile(r"\btest\b"). However, I don't want to write this code by hand, I want to lift it from a file, and process it as a variable (I've got 000's of these to deal with here) .

There doesn't seem to be a way to r a string variable, and so I'm left trying to compile with '\x08test\x08', which doesn't do what I want it to.

This must be a fairly regular issue - how do others deal with this problem?

9
  • 2
    If you write \btest\b as literal text in a file and then read it in, it will be equal to r'\btest\b' Commented Oct 25, 2018 at 14:04
  • 1
    how about just open the file iterate through the lines and put every line into the re.compile ? Could you post an example how your file looks like? Commented Oct 25, 2018 at 14:10
  • Being "escaped to death" is just a human problem though. The program would still be reading the files just fine as intended. e.g. \btest\b would be read as \\btest\\b and would re.compile() just fine. Commented Oct 25, 2018 at 14:12
  • That's not what I'm finding - perhaps thats because I'm reading the file via a csv library, json, or pandas. I don't intend to write the io-level code for what should be a simple config parser. It seems as though there are numerous ways that python starts with "\btest\b" (which re.compile accepts) but once it ends up in the "\08test\08" form (which re.compile does not - at least, it doesn't interpret that the same way) there seems to be no simple way to perform the reverse operation. Commented Oct 25, 2018 at 14:13
  • @Idlehands if I could force the literal to be "escaped" into "\\btest\\b" then yes, that would work I guess - unfortunately, that doesn't seem to be what's happening. Commented Oct 25, 2018 at 14:16

1 Answer 1

4

Like the comment says, there is no need to do anything special.

Imagine a utf-8 encoded text file named regexps.txt with one regex on each line, then creating a list of compiled regexps from that file would be something like:

with open('regexps.txt', encoding='utf8') as f:
    compiled_regexps = [re.compile(line) for line in f]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.