Assign value to a pandas dataframe column based on string condition

Question

Suppose I have a dataframe,

data
id  URL
1   www.pandora.com
2   m.jcpenney.com
3   www.youtube.com
4   www.facebook.com

I want to create a new column based on a condition that if the URL contains some particular word. Suppose if it contains 'youtube', I want my column value as youtube. So I tried the following,

data['test'] = 'other'

so once we do that we have,

data['test']
other
other
other
other

then I tried this,

data[data['URL'].str.contains("youtub") == True]['test'] = 'Youtube'
data[data['URL'].str.contains("face") == True]['test'] = 'Facebook'

Though this runs without any error, the value of the test column, doesn't change. It still has other only for all the columns. When I run this statement, ideally 3rd row alone show change to 'Youtube' and 4th to 'Facebook'. But it doesn't change. Can anybody tell me what mistake I am doing here?

jezrael · Accepted Answer · 2016-04-18 18:32:35Z

17

I think you can use loc with boolean mask created by contains:

print data['URL'].str.contains("youtub")
0    False
1    False
2     True
3    False
Name: URL, dtype: bool

data.loc[data['URL'].str.contains("youtub"),'test'] = 'Youtube'
data.loc[data['URL'].str.contains("face"),'test'] = 'Facebook'
print data
   id               URL      test
0   1   www.pandora.com       NaN
1   2    m.jcpenney.com       NaN
2   3   www.youtube.com   Youtube
3   4  www.facebook.com  Facebook

answered Apr 18, 2016 at 18:32

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

haimen Over a year ago

This one works. Just a small correction, when we run this, we would get an error "ValueError: cannot index with vector containing NA / NaN values". So just need to add, ==True as in the condition given above.

Hatt Over a year ago

This is a very elegant solution to a question with many possible answers, upvote.

MaxU - stand with Ukraine · Accepted Answer · 2016-04-18 18:56:09Z

9

i would do it in one shot:

replacements = {
  r'.*youtube.*': 'Youtube',
  r'.*face.*': 'Facebook',
  r'.*pandora.*': 'Pandora'
}

df['text'] = df.URL.replace(replacements, regex=True)
df.loc[df.text.str.contains('\.'), 'text'] = 'other'
print(df)

Output:

                 URL      text
id
1    www.pandora.com   Pandora
2     m.jcpenney.com     other
3    www.youtube.com   Youtube
4   www.facebook.com  Facebook

edited Apr 18, 2016 at 18:56

answered Apr 18, 2016 at 18:37

MaxU - stand with Ukraine

212k37 gold badges402 silver badges437 bronze badges

Comments

Alexander · Accepted Answer · 2016-04-18 19:00:13Z

2

Given that you probably want to check if the host name matches (rather than any word in the url), you could split the string on the dot and check if the second item (host name) is in your list.

targets = ['pandora', 'youtube', 'facebook']
data['target_url'] = [url[1] if url[1] in targets else None 
                      for url in data.URL.str.split('.')]

data
   id               URL target_url
0   1   www.pandora.com    pandora
1   2    m.jcpenney.com       None
2   3   www.youtube.com    youtube
3   4  www.facebook.com   facebook

answered Apr 18, 2016 at 19:00

Alexander

111k32 gold badges212 silver badges208 bronze badges

Collectives™ on Stack Overflow

Assign value to a pandas dataframe column based on string condition

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related