Splitting lists into columns

Question

I have a pandas df:

  name    time
1  a      1 year 2 months
2  b      4 years 1 month
3  c      3 years 1 month

I want to end with:

  name    years   months
1  a      1       2
2  b      4       1
3  c      3       1

I can get as far as:

  name    time
1  a      [1, 2]
2  b      [4, 1]
3  c      [3, 1]

but I can't figure out how to split the lists into columns.

Alexander · Accepted Answer · 2016-03-31 15:42:56Z

4

df = pd.DataFrame({'name': ['a', 'b', 'c'], 
                   'time': ['1 year 2 months', '4 years 1 month', '3 years 1 month']})

# Split the time column and take the first and third elements to extract the values.
df[['years', 'months']] = df.time.str.split(expand=True).iloc[:, [0, 2]].astype(int)

>>> df
   name             time  years months
0     a  1 year 2 months      1      2
1     b  4 years 1 month      4      1
2     c  3 years 1 month      3      1

You can use del df['time'] when you're ready to drop that column.

edited Mar 31, 2016 at 15:42

answered Mar 31, 2016 at 0:53

Alexander

111k32 gold badges212 silver badges208 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

M Arroyo Over a year ago

Absolutely works. Minor issue with my data, by using this I found that there are a few entries where time is "1-11 months" without a year before them. Any thoughts on how to handle that?

Alexander Over a year ago

Add this as a preprocessing step: mask = df.time.str.contains('years') df.time.loc[~mask] = '0 years ' + df.time.loc[~mask]

M Arroyo Over a year ago

Works perfect. Thank you very much, I've learnt a ton with this project.

M Arroyo Over a year ago

Just noticed something, this solution doesn't take into account the 1 year vs 2+ years case. Just changed the mask to 'year' instead of 'years' and it works.

Anton Protopopov · Accepted Answer · 2016-03-31 07:52:03Z

You could use str.findall to find digits in your time columns and then with str.join and str.split you could get your result:

In [240]: df.time.str.findall('\d').str.join('_').str.split('_', expand=True)
Out[240]:
   0  1
0  1  2
1  4  1
2  3  1

df[['years', 'months']] = df.time.str.findall('\d').str.join('_').str.split('_', expand=True)

In [245]: df
Out[245]:
  name             time years months
0    a  1 year 2 months     1      2
1    b  4 years 1 month     4      1
2    c  3 years 1 month     3      1

It's a bit faster then @Alexander's solution, and I think more general. From timing:

In [6]: %timeit df.time.str.split(expand=True).iloc[:, [0, 2]]
1000 loops, best of 3: 1.6 ms per loop

In [8]: %timeit df.time.str.findall('\d').str.join('_').str.split('_', expand=True)
1000 loops, best of 3: 1.43 ms per loop

Collectives™ on Stack Overflow

Splitting lists into columns

2 Answers 2

4 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related