2

I'm writing a quick Python script to do a bit of inspection on some of our Hibernate mapping files. I'm trying to use this bit of Python to get the table name of a POJO, whether or not its class path is fully defined:

searchObj = re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)

However - say pojo is 'MyObject' - the regex is not matching it to this line:

<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">

If I print the string (while stopped in Pdb) I'm searching with, I see this:

'<class name="(.*\\\\.|)MyObject".*table="(.*?)"'

I'm quite confused as to what's going wrong here. For one, I was under the impression that the 'r' prefix made it so that the backslashes wouldn't be escaped. Even so, if I remove one of the backslashes such that my search string is this:

searchObj = re.search(r'<class name="(.*\.|)' + pojo + '".*table="(.*?)"', contents)

And the string searched becomes

'<class name="(.*\\.|)MyObject".*table="(.*?)"'

It still doesn't return a match. What's going wrong here? The regex expression I'm intending to use works on regex101.com (with just one backslash in the apparently problematic area.) Any idea what is going wrong here?

1
  • @Keatinge - Why isn't it matching, then, when I just have one backslash? The regex should be good based on that site. Also - if it's just MyObject, it matches, but if it's com.place.package.MyObject, it doesn't. Commented Jun 10, 2016 at 19:24

2 Answers 2

3

Given this:

re.search(r'<class name="(.*\\.|)' + pojo + '".*table="(.*?)"', contents)

The the first part of the pattern is interpreted like this:

1. class name="    a literal string beginning with c and ending with "
2. (               the beginning of a group
3.   .*                zero or more of any characters
4.   \\                a literal single slash
5.   .                 any single character
6. OR
7.                     nothing
8. )               end of the group

Since the string you're searching for does not have a literal backslash, it won't match.

If what you intend is for \\. to mean "a literal period", you need a single backslash since it is inside a raw string: \.

Also, ending the group with a pipe seems weird. I'm not sure what you think that's accomplishing. If you mean to say "any number of characters ending in a dot, or nothing", you can do that with (.*\.)?, since the ? means "zero or one of the preceding match".

This seems to work for me:

import re
contents1 = '''<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
contents2 = '''<class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">'''
pojo="MyObject"

pattern = r'<class name="(.*\.)?' + pojo + '.*table="(.*?)"'

assert(re.search(pattern, contents1))
assert(re.search(pattern, contents2))
Sign up to request clarification or add additional context in comments.

4 Comments

Yeah, ending it with a pipe and doing what you said function the same way. As for the rest of your comment, in the second half of my post I showed that it also does not worth with just r'\.' either, so that's not the problem I'm running into.
@TrevorThackston: the backslash is definitely at least part of the problem. I've updated my answer with a working example, showing that the pattern matches both a fully qualified name and a relative name.
You're right - I just mean that I said I had already tried it. Anyways, I figured out the problem it was having on my real dataset. A few of the tables have 'mutable="false"' after the table, so it was getting confused and showing errors. Figured out how to fix that, though.
@TrevorThackston: that's one of the many reasons why trying to parse HTML with regular expressions is a bad idea.
2

On Pythex, I tried this regex:

<class name="(.*)\.MyObject" table="([^"]*)"

on this string:

<class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">

and got these two match captures:

  1. com.place.package
  2. my_cool_object

So I think in your case, this line

searchObj = re.search(r'<class name="(.*)\.' + pojo + '"table="([^"]*)"', contents)

will produce the result you want.


About the confusing backslashes – you add two and then four show up, on the Python documentation 7.2. re — Regular expression operations it explains that r'' is “raw string notation”, used to circumvent Python’s regular character escaping, which uses a backslash. So:

  • '\\' means “a string composed of one backslash”, since the first backslash in the string escapes the second backslash. Python sees the first backslash and thinks, ‘the next character is a special one’; then it sees the second and says, ‘the special character is an actual backslash’. It’s stored as a single character \. If you ask Python to print this, it will escape the output and show you "\\".
  • r'\\' means “a string composed of two actual backslashes. It’s stored as character \ followed by character \. If you ask Python to print this, it will escape the output and show you "\\\\".

3 Comments

The problem is with the first part of the regex, not the second, and you changed the first part such that it works on qualified path names only.
@TrevorThackston I don’t know what you mean by ‘qualified path names’, so please clarify. Give me a couple examples of things you want the first part to match.
I would like it to match these two: <class name="com.place.package.MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true"> and <class name="MyObject" table="my_cool_object" dynamic-insert="true" dynamic-update="true">

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.