pandas DataFrame conditional string split

Question

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:

(A/Egypt/84/2001(H1N2))
A/Brazil/1759/2004(H3N2)
A/Argentina/126/2004

I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:

df['Strain Name'] = df['Original Name'].str.split("(")

However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.

Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?

Well, it looks like after futzing around, I found that doing a .apply(lambda x: max(x, key=len)) did the trick, since I was basically looking for the longest string in the split. — ericmjl
– ericmjl, Commented Oct 20, 2014 at 14:12
To moderators: should I leave this post around still? Please feel free to edit/modify/delete anything that isn't useful to the community. — ericmjl
– ericmjl, Commented Oct 20, 2014 at 14:13
What are you splitting the string on? Will your longest-string approach break on viruses like A/New York/107/2003? Depending on where you access the viruses from, this may or may not be a concern, because some databases offer the option to replace spaces with "_". — iayork
– iayork, Commented Oct 20, 2014 at 14:18
@iayork: I'm splitting the string on a parentheses "(", because I wanted to remove the parentheses from the strain name. — ericmjl
– ericmjl, Commented Oct 20, 2014 at 14:23
@EdChum: Thanks for the note. I will post an answer + code, according to your recommendations. — ericmjl
– ericmjl, Commented Oct 20, 2014 at 14:23

ericmjl · Accepted Answer · 2014-10-20 14:27:54Z

1

So, based on EdChum's recommendation, I'll post my answer here.

Minimal data frame required for tackling this problem:

Index    Strain Name               Year
0        (A/Egypt/84/2001(H1N2))   2001
1        A/Brazil/1759/2004(H3N2)  2004
2        A/Argentina/126/2004      2004

Code for getting the strain names only, without parentheses or anything else inside the parentheses:

df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))

This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

answered Oct 20, 2014 at 14:27

ericmjl

14.9k13 gold badges57 silver badges83 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

DSM Over a year ago

Something like df["Strain"].str.strip("()").str.split("(").str[0] should also work, I think. Could also write a regex and use extract, although here that's probably more trouble than it's worth..

ericmjl Over a year ago

I will, but I have to wait 2 days in order for that to happen. Thanks though, Ed!

Collectives™ on Stack Overflow

pandas DataFrame conditional string split

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related