5

I have a dataframe:

id    info
1     Name: John Age: 12 Sex: Male
2     Name: Sara Age: 22 Sex: Female
3     Name: Mac Donald Age: 32 Sex: Male

I'm looking to split the info column into 3 columns such that i get the final output as:

id  Name      Age   Sex
1   John      12   Male
2   Sara      22   Female
3 Mac Donald  32   Male

I tried using pandas split function.

df[['Name','Age','Sex']] = df.info.split(['Name'])

I might have to do this multiple times to get desired output.

Is there a better way to achieve this?

PS: The info column also contains NaN values

4 Answers 4

6

Using Regex with named groups.

Ex:

df = pd.DataFrame({"Col": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: Female', 'Name: Mac Donald Age: 32 Sex: Male']})
df = df['Col'].str.extract(r"Name:\s*(?P<Name>[A-Za-z\s]+)\s*Age:\s*(?P<Age>\d+)\s*Sex:\s*(?P<Sex>Male|Female)") # Or if spacing is standard use df['Col'].str.extract(r"Name: (?P<Name>[A-Za-z\s]+) Age: (?P<Age>\d+) Sex: (?P<Sex>Male|Female)")
print(df)

Output:

          Name Age     Sex
0        John   12    Male
1        Sara   22  Female
2  Mac Donald   32    Male
Sign up to request clarification or add additional context in comments.

1 Comment

This solution creates whitespace. If you do df['Name'].to_list(), you will return: ['John ', 'Sara ', 'Mac Donald '] A simple str.strip() could fix this as the regex would get more intense.
2

The regex is pretty tough to write / read, so you could replace with , for where you want separate into new columns and use str.split() and pass expand=True. You will need to set the result back to three new columns that you create with df[['Name', 'Age', 'Sex']]:

df[['Name', 'Age', 'Sex']] = (df['info'].replace(['Name: ', ' Age: ', ' Sex: '], ['',',',','], regex=True)
                              .str.split(',', expand=True))
df

Out[1]: 
   id                                info        Name Age     Sex
0   1        Name: John Age: 12 Sex: Male        John  12    Male
1   2      Name: Sara Age: 22 Sex: Female        Sara  22  Female
2   3  Name: Mac Donald Age: 32 Sex: Male  Mac Donald  32    Male

Comments

2

A quick oneliner can be

df[['Name', 'Age', 'Sex']] = df['info'].str.split('\s?\w+:\s?', expand=True).iloc[:, 1:]

Split using someword: and then add new columns.

8 Comments

Aaah, I forgot about \w+. Nice! However, this creates whitespace, so you either need to do str.strip() or improve the regex. For starters, you can do \w+: instead of \w+:, but that only gets rid of the last space.
Strip won't help and this doesn't create whitespace. The extra column is due to the fact that the string starts with the split string. @DavidErickson
If you do df['Name'].to_list() or df['Age'].to_list(), you will see that there is a ' ' before and after each string
Oh! You mean after I get the result. Yeah, strip should take care of that.
Yes, for example df['Age'].to_list() returns: [' 12 ', ' 22 ', ' 32 '] There is a similar issue in Rakesh's answer, bit only for the Name column.
|
0
  def process_row(row):
        items = row.info.split(' ')
        row['Name']=str(items[1]).strip()
        row['Age']=str(items[3]).strip()
        row['Sex']=str(items[5]).strip()
        return row

  df=pd.DataFrame({"info": ['Name: John Age: 12 Sex: Male', 'Name: Sara Age: 22 Sex: 
     Female', 'Name: Mac Donald Age: 32 Sex: Male']})
  df['Name']=pd.NA #empty cell
  df['Age']=pd.NA #empty cell
  df['Sex']=pd.NA #empty cell

  df[['info','Name','Age','Sex']]=df.apply(process_row, axis=1, result_type="expand")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.