0

I have a column of influenza virus names within my DataFrame. Here is a representative sampling of the name formats present:

  1. (A/Egypt/84/2001(H1N2))
  2. A/Brazil/1759/2004(H3N2)
  3. A/Argentina/126/2004

I am only interested in getting out A/COUNTRY/NUMBER/YEAR from the strain names, e.g. A/Brazil/1759/2004. I have tried doing:

df['Strain Name'] = df['Original Name'].str.split("(")

However, if I try accessing .str[0], then I miss out case #1. If I do .str[1], I miss out case 2 and 3.

Is there a solution that works for all three cases? Or is there some way to apply a condition in string splits, without iterating over each row in the data frame?

5
  • Well, it looks like after futzing around, I found that doing a .apply(lambda x: max(x, key=len)) did the trick, since I was basically looking for the longest string in the split. Commented Oct 20, 2014 at 14:12
  • To moderators: should I leave this post around still? Please feel free to edit/modify/delete anything that isn't useful to the community. Commented Oct 20, 2014 at 14:13
  • What are you splitting the string on? Will your longest-string approach break on viruses like A/New York/107/2003? Depending on where you access the viruses from, this may or may not be a concern, because some databases offer the option to replace spaces with "_". Commented Oct 20, 2014 at 14:18
  • @iayork: I'm splitting the string on a parentheses "(", because I wanted to remove the parentheses from the strain name. Commented Oct 20, 2014 at 14:23
  • @EdChum: Thanks for the note. I will post an answer + code, according to your recommendations. Commented Oct 20, 2014 at 14:23

1 Answer 1

1

So, based on EdChum's recommendation, I'll post my answer here.

Minimal data frame required for tackling this problem:

Index    Strain Name               Year
0        (A/Egypt/84/2001(H1N2))   2001
1        A/Brazil/1759/2004(H3N2)  2004
2        A/Argentina/126/2004      2004

Code for getting the strain names only, without parentheses or anything else inside the parentheses:

df['Strain Name'] = df['Strain Name'].str.split('(').apply(lambda x: max(x, key=len))

This code works for the particular case spelled here, as the trick is that the isolate's "strain name" is the longest string after splitting by the opening parentheses ("(") value.

Sign up to request clarification or add additional context in comments.

2 Comments

Something like df["Strain"].str.strip("()").str.split("(").str[0] should also work, I think. Could also write a regex and use extract, although here that's probably more trouble than it's worth..
I will, but I have to wait 2 days in order for that to happen. Thanks though, Ed!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.