6

I have a selenium/python project, which uses a regex match to find html elements. These element attributes sometime includes the danish/norwegian characters ÆØÅ. The problem is in this snippet below:

if (re.match(regexp_expression, compare_string)):
    result = True
else :
    result = False

Both the regex_expression and compare_string are manipulated before the regex match is executed. If i print them before the code snippet above is executed, and also print the result, I get the following output:

Regex_expression: [^log på$]
compare string: [log på]
result = false

I put brackets on to make sure that there were no whitespaces. They are only part of the print statement, and not part of the String variables.

If I however try to reproduce the problem in a seperate script, like this:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

regexp_expression  = "^log på$"
compare_string = "log på"

if (re.match(regexp_expression, compare_string)):
    print("result true")
    result = True
else :
    print("result = false")
    result = False

Then the result is true.

How can this be? To make it even stranger, it worked earlier, and I am not sure what I edited, that made it go boom...

Full module of the regex compare method is here below. I have not coded this myself, so I am not a 100% familiar with the reason of all the replace statements, and String manipulation, but I would think it shouldn't matter, when I can check the Strings just before the failing match method in the bottom...

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

def regexp_compare(regexp_expression, compare_string):
    #final int DOTALL
    #try:    // include try catch for "PatternSyntaxException" while testing/including a new symbol in this method..

    #catch(PatternSyntaxException e):
    #    System.out.println("Regexp>>"+regexp_expression)
    #    e.printStackTrace()
    #*/


    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        print("return 1")
        return True                

    if(not compare_string or not regexp_expression):
        print("return 2")
        return False                

    regexp_expression = regexp_expression.lower()
    compare_string = compare_string.lower()

    if(not regexp_expression.strip()): 
        regexp_expression = ""

    if(not compare_string.strip() and (not regexp_expression.strip() or regexp_expression.strip().lower() == "*".lower()) or (regexp_expression.strip().lower() == ".*".lower())):
        regexp_expression = ""
    else:

        regexp_expression = regexp_expression.replace("\\","\\\\")
        regexp_expression = regexp_expression.replace("\\.","\\\\.")
        regexp_expression = regexp_expression.replace("\\*", ".*")
        regexp_expression = regexp_expression.replace("\\(", "\\\\(")
        regexp_expression = regexp_expression.replace("\\)", "\\\\)")           
        regexp_expression_arr = regexp_expression.split("|")
        regexp_expression = ""

        for i in range(0, len(regexp_expression_arr)):
            if(not(regexp_expression_arr[i].startswith("^"))):
                regexp_expression_arr[i] = "^"+regexp_expression_arr[i]

            if(not(regexp_expression_arr[i].endswith("$"))):
                regexp_expression_arr[i] = regexp_expression_arr[i]+"$"

            regexp_expression = regexp_expression_arr[i] if regexp_expression == "" else regexp_expression+"|"+regexp_expression_arr[i]  




    result = None        

    print("Regex_expression: [" + regexp_expression+"]")
    print("compare string: [" + compare_string+"]")

    if (re.match(regexp_expression, compare_string)):
        print("result true")
        result = True
    else :
        print("result = false")
        result = False

    print("return result")
    return result
3
  • ^log på$ is not a good use of regexes. If you don't have a pattern, why not simply using ==? Commented Jul 6, 2015 at 11:25
  • The thing is, in my case it is redundant with the ^*$, but this is a general util class used for several matches, and I'm not the author of it. I guess there are reasons for the regex syntax in other cases. Like if one needs to check if a single html class is present in an element's full class string. This time I just happen to match the buttons text instead of classes. Commented Jul 6, 2015 at 11:27
  • Just realized that I got the syntax of the regex string wrong. I though it meant "starts with OR ends with", but that OR is an AND. I would agree that that is a waste of a regex. I don't know why they chose to code it that way... Commented Jul 7, 2015 at 12:57

1 Answer 1

3

It's likely that your are comparing a unicode string to a non unicode string.

For example, in the following:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import re

regexp_expression  = "^log på$"
compare_string = u"log på"

if (re.match(regexp_expression, compare_string)):
    print("result true")
    result = True
else :
    print("result = false")
    result = False

You will get the output False. So there is likely a point in your manipulation where something is not unicode.

The same false will result with the following too:

regexp_expression  = u"^log på$"
compare_string = "log på"
Sign up to request clarification or add additional context in comments.

1 Comment

this is great. I just used this approach: stackoverflow.com/questions/4987327/…. Turns out my regex match is unicode, and the compare string is an ordinary string. They are already that way before they are sent to the regex_compare method. Thanks!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.