I am trying to subset a dataframe using 'pandas' if the column matches a particular pattern. Below is a reproducible example for reference.
import pandas as pd
# Create Dataframe having 10 rows and 2 columns 'code' and 'URL'
df = pd.DataFrame({'code': [1,1,2,2,3,4,1,2,2,5],
'URL': ['www.abc.de','https://www.abc.fr/-de','www.abc.fr','www.abc.fr','www.abc.co.uk','www.abc.es','www.abc.de','www.abc.fr','www.abc.fr','www.abc.it']})
# Create new dataframe by filtering out all rows where the column 'code' is equal to 1
new_df = df[df['code'] == 1]
# Below is how the new dataframe looks like
print(new_df)
URL code
0 www.abc.de 1
1 https://www.abc.fr/-de 1
6 www.abc.de 1
Below are the dtypes for reference.
print(new_df.dtypes)
URL object
code int64
dtype: object
# Now I am trying to exclude all those rows where the 'URL' column does not have .de as the pattern. This should retain only the 2nd row in new_df from above output
new_df = new_df[~ new_df['URL'].str.contains(r".de", case = True)]
# Below is how the output looks like
print(new_df)
Empty DataFrame
Columns: [URL, code]
Index: []
Below are my questions.
1) Why is the 'URL' column appearing first even though I defined the 'code' column first?
2) What is wrong in my code when I am trying to remove all those rows where the 'URL' column does not have the pattern .de? In R, I would simply use the below code to get the desired result easily.
new_df <- new_df[grep(".de",new_df$URL, fixed = TRUE, invert = TRUE), ]
Desired output should be as below.
# Desired output for new_df
URL code
https://www.abc.fr/-de 1
Any guidance on this would be really appreciated.
r"\.de", escape the dot.df = df[['code', 'URL']]to ensure correct ordering. More background: stackoverflow.com/questions/39980323/…