5

Suppose I have a dataframe,

data
id  URL
1   www.pandora.com
2   m.jcpenney.com
3   www.youtube.com
4   www.facebook.com

I want to create a new column based on a condition that if the URL contains some particular word. Suppose if it contains 'youtube', I want my column value as youtube. So I tried the following,

data['test'] = 'other'

so once we do that we have,

data['test']
other
other
other
other

then I tried this,

data[data['URL'].str.contains("youtub") == True]['test'] = 'Youtube'
data[data['URL'].str.contains("face") == True]['test'] = 'Facebook'

Though this runs without any error, the value of the test column, doesn't change. It still has other only for all the columns. When I run this statement, ideally 3rd row alone show change to 'Youtube' and 4th to 'Facebook'. But it doesn't change. Can anybody tell me what mistake I am doing here?

3 Answers 3

17

I think you can use loc with boolean mask created by contains:

print data['URL'].str.contains("youtub")
0    False
1    False
2     True
3    False
Name: URL, dtype: bool

data.loc[data['URL'].str.contains("youtub"),'test'] = 'Youtube'
data.loc[data['URL'].str.contains("face"),'test'] = 'Facebook'
print data
   id               URL      test
0   1   www.pandora.com       NaN
1   2    m.jcpenney.com       NaN
2   3   www.youtube.com   Youtube
3   4  www.facebook.com  Facebook
Sign up to request clarification or add additional context in comments.

2 Comments

This one works. Just a small correction, when we run this, we would get an error "ValueError: cannot index with vector containing NA / NaN values". So just need to add, ==True as in the condition given above.
This is a very elegant solution to a question with many possible answers, upvote.
9

i would do it in one shot:

replacements = {
  r'.*youtube.*': 'Youtube',
  r'.*face.*': 'Facebook',
  r'.*pandora.*': 'Pandora'
}

df['text'] = df.URL.replace(replacements, regex=True)
df.loc[df.text.str.contains('\.'), 'text'] = 'other'
print(df)

Output:

                 URL      text
id
1    www.pandora.com   Pandora
2     m.jcpenney.com     other
3    www.youtube.com   Youtube
4   www.facebook.com  Facebook

Comments

2

Given that you probably want to check if the host name matches (rather than any word in the url), you could split the string on the dot and check if the second item (host name) is in your list.

targets = ['pandora', 'youtube', 'facebook']
data['target_url'] = [url[1] if url[1] in targets else None 
                      for url in data.URL.str.split('.')]

data
   id               URL target_url
0   1   www.pandora.com    pandora
1   2    m.jcpenney.com       None
2   3   www.youtube.com    youtube
3   4  www.facebook.com   facebook

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.