Pandas Split column into multiple columns by multiple string delimiters

Question

I have a dataframe:

id    info
1     Name: John Age: 12 Sex: Male
2     Name: Sara Age: 22 Sex: Female
3     Name: Mac Donald Age: 32 Sex: Male

I'm looking to split the info column into 3 columns such that i get the final output as:

id  Name      Age   Sex
1   John      12   Male
2   Sara      22   Female
3 Mac Donald  32   Male

I tried using pandas split function.

df[['Name','Age','Sex']] = df.info.split(['Name'])

I might have to do this multiple times to get desired output.

Is there a better way to achieve this?

PS: The info column also contains NaN values

Rakesh · Accepted Answer · 2020-09-14 07:05:30Z

6

Using Regex with named groups.

Ex:

df = pd.DataFrame({"Col": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: Female', 'Name: Mac Donald Age: 32 Sex: Male']})
df = df['Col'].str.extract(r"Name:\s*(?P<Name>[A-Za-z\s]+)\s*Age:\s*(?P<Age>\d+)\s*Sex:\s*(?P<Sex>Male|Female)") # Or if spacing is standard use df['Col'].str.extract(r"Name: (?P<Name>[A-Za-z\s]+) Age: (?P<Age>\d+) Sex: (?P<Sex>Male|Female)")
print(df)

Output:

          Name Age     Sex
0        John   12    Male
1        Sara   22  Female
2  Mac Donald   32    Male

answered Sep 14, 2020 at 7:05

Rakesh

82.9k17 gold badges86 silver badges122 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

David Erickson Over a year ago

This solution creates whitespace. If you do df['Name'].to_list(), you will return: ['John ', 'Sara ', 'Mac Donald '] A simple str.strip() could fix this as the regex would get more intense.

David Erickson · Accepted Answer · 2020-09-14 07:13:33Z

2

The regex is pretty tough to write / read, so you could replace with , for where you want separate into new columns and use str.split() and pass expand=True. You will need to set the result back to three new columns that you create with df[['Name', 'Age', 'Sex']]:

df[['Name', 'Age', 'Sex']] = (df['info'].replace(['Name: ', ' Age: ', ' Sex: '], ['',',',','], regex=True)
                              .str.split(',', expand=True))
df

Out[1]: 
   id                                info        Name Age     Sex
0   1        Name: John Age: 12 Sex: Male        John  12    Male
1   2      Name: Sara Age: 22 Sex: Female        Sara  22  Female
2   3  Name: Mac Donald Age: 32 Sex: Male  Mac Donald  32    Male

answered Sep 14, 2020 at 7:13

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Comments

Vishnudev Krishnadas · Accepted Answer · 2020-09-14 08:49:37Z

2

A quick oneliner can be

df[['Name', 'Age', 'Sex']] = df['info'].str.split('\s?\w+:\s?', expand=True).iloc[:, 1:]

Split using someword: and then add new columns.

edited Sep 14, 2020 at 8:49

answered Sep 14, 2020 at 7:18

Vishnudev Krishnadas

11k2 gold badges29 silver badges58 bronze badges

8 Comments

David Erickson Over a year ago

Aaah, I forgot about \w+. Nice! However, this creates whitespace, so you either need to do str.strip() or improve the regex. For starters, you can do \w+: instead of \w+:, but that only gets rid of the last space.

Vishnudev Krishnadas Over a year ago

Strip won't help and this doesn't create whitespace. The extra column is due to the fact that the string starts with the split string. @DavidErickson

David Erickson Over a year ago

If you do df['Name'].to_list() or df['Age'].to_list(), you will see that there is a ' ' before and after each string

Vishnudev Krishnadas Over a year ago

Oh! You mean after I get the result. Yeah, strip should take care of that.

David Erickson Over a year ago

Yes, for example df['Age'].to_list() returns: [' 12 ', ' 22 ', ' 32 '] There is a similar issue in Rakesh's answer, bit only for the Name column.

|

paytam · Accepted Answer · 2020-09-14 07:33:06Z

0

  def process_row(row):
        items = row.info.split(' ')
        row['Name']=str(items[1]).strip()
        row['Age']=str(items[3]).strip()
        row['Sex']=str(items[5]).strip()
        return row

  df=pd.DataFrame({"info": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: 
     Female', 'Name: Mac Donald Age: 32 Sex: Male']})
  df['Name']=pd.NA #empty cell
  df['Age']=pd.NA #empty cell
  df['Sex']=pd.NA #empty cell

  df[['info','Name','Age','Sex']]=df.apply(process_row, axis=1, result_type="expand")

answered Sep 14, 2020 at 7:33

paytam

3476 silver badges16 bronze badges

Collectives™ on Stack Overflow

Pandas Split column into multiple columns by multiple string delimiters

4 Answers 4

1 Comment

Comments

8 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

8 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related