2

I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. Below is a reproducible example for reference.

import pandas as pd

# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
                   'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})

# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]

# Below is how the new dataframe looks like
print(new_df)
                      URL  code
0              www.abc.de     1
1  https://www.abc.fr/-de     1
6              www.abc.de     1

Below are the dtypes for reference.

print(new_df.dtypes)
URL     object
code     int64
dtype: object

# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]

# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []

Below are my questions. 1) Why is the 'URL' column appearing first even though I defined the 'code' column first?

2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de? In R, I would simply use the below code to get the desired result easily.

new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]

Desired output should be as below.

# Desired output for new_df
                   URL  code
https://www.abc.fr/-de     1

Any guidance on this would be really appreciated.

3
  • r"\.de", escape the dot. Commented Jan 19, 2018 at 7:37
  • 2
    Regarding first question, dictionary ordering is not guaranteed. Use df = df[['code', 'URL']] to ensure correct ordering. More background: stackoverflow.com/questions/39980323/… Commented Jan 19, 2018 at 7:44
  • Thanks @Alexander. This information on dictionary is really helpful. Thanks for an alternate way of creating a dataframe too. Really appreciate it. I will keep this in mind whenever ordering is important. I am sure a lot of people new to Python would find this very helpful, just like I did. Commented Jan 19, 2018 at 7:56

1 Answer 1

3

Why is the 'URL' column appearing first even though I defined the 'code' column first?

This is a consequence of the fact that dictionaries are not ordered. Columns are read in and created in any order, depending on the random hash initialization of the python interpreter.


What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de?

You'd need to escape the ., because that's a special regex meta-character.

df[df.code.eq(1) & ~df.URL.str.contains(r'\.de$', case=True)]

                      URL  code
1  https://www.abc.fr/-de     1

This may not be succifient if de can be found anywhere after the TLD (and not at the very end). Here's a general solution addressing that limitation -

p = '''.*       # match anything, greedily  
       \.       # literal dot
       de       # "de"
       (?!.*    # negative lookahead
       \.       # literal dot (should not be found)
       )'''
df[df.code.eq(1) & ~df.URL.str.contains(p, case=True, flags=re.VERBOSE)]

                      URL  code
1  https://www.abc.fr/-de     1 
Sign up to request clarification or add additional context in comments.

11 Comments

Thanks @COLDSPEED. +1 for the answers. What does the $ sign at the end of r'\.de$' doing? Sorry I am new to Python and still learning a lot. will your solution work if the .de is not at the end? Something like this - www.abc.de/txt. If this was one of the values, it should not be included since it has the .de pattern.
@Code_Sipra In that case, use r'\.de' only. The $ is an end of line anchor. I'll remove it from my answer.
You also want to exclude cases like www.de.com or www.delta.com. Ideally look for de after final split on .
df[~df.URL.str.split('.').apply(lambda s: s[-1].startswith('de'))] should do the trick, excluding the code=1 filter.
@cᴏʟᴅsᴘᴇᴇᴅ Looks fine.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.