2

I have a pandas df:

  name    time
1  a      1 year 2 months
2  b      4 years 1 month
3  c      3 years 1 month

I want to end with:

  name    years   months
1  a      1       2
2  b      4       1
3  c      3       1

I can get as far as:

  name    time
1  a      [1, 2]
2  b      [4, 1]
3  c      [3, 1]

but I can't figure out how to split the lists into columns.

2 Answers 2

4
df = pd.DataFrame({'name': ['a', 'b', 'c'], 
                   'time': ['1 year 2 months', '4 years 1 month', '3 years 1 month']})

# Split the time column and take the first and third elements to extract the values.
df[['years', 'months']] = df.time.str.split(expand=True).iloc[:, [0, 2]].astype(int)

>>> df
   name             time  years months
0     a  1 year 2 months      1      2
1     b  4 years 1 month      4      1
2     c  3 years 1 month      3      1

You can use del df['time'] when you're ready to drop that column.

Sign up to request clarification or add additional context in comments.

4 Comments

Absolutely works. Minor issue with my data, by using this I found that there are a few entries where time is "1-11 months" without a year before them. Any thoughts on how to handle that?
Add this as a preprocessing step: mask = df.time.str.contains('years') df.time.loc[~mask] = '0 years ' + df.time.loc[~mask]
Works perfect. Thank you very much, I've learnt a ton with this project.
Just noticed something, this solution doesn't take into account the 1 year vs 2+ years case. Just changed the mask to 'year' instead of 'years' and it works.
2

You could use str.findall to find digits in your time columns and then with str.join and str.split you could get your result:

In [240]: df.time.str.findall('\d').str.join('_').str.split('_', expand=True)
Out[240]:
   0  1
0  1  2
1  4  1
2  3  1

df[['years', 'months']] = df.time.str.findall('\d').str.join('_').str.split('_', expand=True)

In [245]: df
Out[245]:
  name             time years months
0    a  1 year 2 months     1      2
1    b  4 years 1 month     4      1
2    c  3 years 1 month     3      1

It's a bit faster then @Alexander's solution, and I think more general. From timing:

In [6]: %timeit df.time.str.split(expand=True).iloc[:, [0, 2]]
1000 loops, best of 3: 1.6 ms per loop

In [8]: %timeit df.time.str.findall('\d').str.join('_').str.split('_', expand=True)
1000 loops, best of 3: 1.43 ms per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.