Python 'str.contains' function not returning correct values

Question

I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. Below is a reproducible example for reference.

import pandas as pd

# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
                   'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})

# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]

# Below is how the new dataframe looks like
print(new_df)
                      URL  code
0              www.abc.de     1
1  https://www.abc.fr/-de     1
6              www.abc.de     1

Below are the dtypes for reference.

print(new_df.dtypes)
URL     object
code     int64
dtype: object

# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]

# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []

Below are my questions. 1) Why is the 'URL' column appearing first even though I defined the 'code' column first?

2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de? In R, I would simply use the below code to get the desired result easily.

new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]

Desired output should be as below.

# Desired output for new_df
                   URL  code
https://www.abc.fr/-de     1

Any guidance on this would be really appreciated.

Regarding first question, dictionary ordering is not guaranteed. Use df = df[['code', 'URL']] to ensure correct ordering. More background: stackoverflow.com/questions/39980323/… — Alexander
– Alexander, Commented Jan 19, 2018 at 7:44
Thanks @Alexander. This information on dictionary is really helpful. Thanks for an alternate way of creating a dataframe too. Really appreciate it. I will keep this in mind whenever ordering is important. I am sure a lot of people new to Python would find this very helpful, just like I did. — Code_Sipra
– Code_Sipra, Commented Jan 19, 2018 at 7:56

cs95 · Accepted Answer · 2018-01-19 08:07:59Z

3

Why is the 'URL' column appearing first even though I defined the 'code' column first?

This is a consequence of the fact that dictionaries are not ordered. Columns are read in and created in any order, depending on the random hash initialization of the python interpreter.

What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de?

You'd need to escape the ., because that's a special regex meta-character.

df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]

                      URL  code
1  https://www.abc.fr/-de     1

This may not be succifient if de can be found anywhere after the TLD (and not at the very end). Here's a general solution addressing that limitation -

p = '''.*       # match anything, greedily  
       \.       # literal dot
       de       # "de"
       (?!.*    # negative lookahead
       \.       # literal dot (should not be found)
       )'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]

                      URL  code
1  https://www.abc.fr/-de     1

edited Jan 19, 2018 at 8:07

answered Jan 19, 2018 at 7:38

cs95

406k106 gold badges745 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Code_Sipra Over a year ago

Thanks @COLDSPEED. +1 for the answers. What does the $ sign at the end of r'\.de$' doing? Sorry I am new to Python and still learning a lot. will your solution work if the .de is not at the end? Something like this - www.abc.de/txt. If this was one of the values, it should not be included since it has the .de pattern.

cs95 Over a year ago

@Code_Sipra In that case, use r'\.de' only. The $ is an end of line anchor. I'll remove it from my answer.

Alexander Over a year ago

You also want to exclude cases like www.de.com or www.delta.com. Ideally look for de after final split on .

Alexander Over a year ago

df[~df.URL.str.split('.').apply(lambda s: s[-1].startswith('de'))] should do the trick, excluding the code=1 filter.

Alexander Over a year ago

@cᴏʟᴅsᴘᴇᴇᴅ Looks fine.

|

Collectives™ on Stack Overflow

Python 'str.contains' function not returning correct values

1 Answer 1

11 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

11 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related