Replace column based on string

Question

I'm trying to replace column "Names" by a new variable "Gender" based on the first letters that we find in column name.

INPUT:

df['Name'].value_counts()

OUTPUT:

Mr. Gordon Hemmings     1
Miss Jane Wilkins       1
Mrs. Audrey North       1
Mrs. Wanda Sharp        1
Mr. Victor Hemmings     1
                       ..
Miss Heather Abraham    1
Mrs. Kylie Hart         1
Mr. Ian Langdon         1
Mr. Gordon Watson       1
Miss Irene Vance        1

Name: Name, Length: 4999, dtype: int64

Now, see the Miss, Mrs., and Miss? The first question that comes to mind is: how many different words there are?

INPUT

df.Name.str.split().str[0].value_counts(dropna=False)

Mr.     3351
Mrs.     937
Miss     711
NaN        1

Name: Name, dtype: int64

Now I'm trying to:

    #Replace missing value

df['Name'].fillna('Mr.', inplace=True)

# Create Column Gender
df['Gender'] = df['Name']

for i in range(0, df[0]):  


    A = df['Name'].values[i][0:3]=="Mr." 
    df['Gender'].values[i] = A

df.loc[df['Gender']==True, 'Gender']="Male"
df.loc[df['Gender']==False, 'Gender']="Female"

del df['Name'] #Delete column 'Name'

df

But I'm missing something since I get the following error:

KeyError: 0

David Erickson · Accepted Answer · 2021-04-01 12:17:24Z

1

The KeyError is because you don't have a column called 0. However, I would ditch that code and try something more efficient.

You can use np.where with str.contains to search for names with Mr. after using fillna(). Then, just drop the Name column.:

df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
df

Full example:

df = pd.DataFrame({'Name': {0: 'Mr. Gordon Hemmings',
  1: 'Miss Jane Wilkins',
  2: 'Mrs. Audrey North',
  3: 'Mrs. Wanda Sharp',
  4: 'Mr. Victor Hemmings'},
 'Value': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1}})
print(df)
df['Name'] = df['Name'].fillna('Mr.')
df['Gender'] = np.where(df['Name'].str.contains('Mr\.'), 'Male', 'Female')
df = df.drop('Name', axis=1)
print('\n')
print(df)
                  Name  Value
0  Mr. Gordon Hemmings      1
1    Miss Jane Wilkins      1
2    Mrs. Audrey North      1
3     Mrs. Wanda Sharp      1
4  Mr. Victor Hemmings      1


   Value  Gender
0      1    Male
1      1  Female
2      1  Female
3      1  Female
4      1    Male

edited Apr 1, 2021 at 12:17

answered Apr 1, 2021 at 11:46

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

jps17183 Over a year ago

That didn't work... I get this: df['Gender'].value_counts() Male 4289 Female 711 Name: Gender, dtype: int64 But that is wrong... It seems he just diferentiated Miss, when he should return true only to "Mr." and False otherwise.

David Erickson Over a year ago

@jps17183 I forgot that . is a regex charachter, so you need to escape it with /.

Collectives™ on Stack Overflow

Replace column based on string

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related