Pandas: Add new column with several values to groupby dataframe

Question

for my dataframe, I want to add a new column for every single unique value in another column. The new column consists of several datetime entries that every unique value of the other column should get.

Example:

Original Df:

New Column DF:

Date
2015/01/01
2015/02/01
2015/03/01

Resulting Df:

ID    Date
1     2015/01/01
      2015/02/01
      2015/03/01
2     2015/01/01
      2015/02/01
      2015/03/01
3     2015/01/01
      2015/02/01
      2015/03/01

I tried to stick to this solution: https://stackoverflow.com/a/12394122/3856569 But it gives me the following error: Length of values does not match length of index

Anyone has a simple solution to do that? Thanks a lot!

MaxU - stand with Ukraine · Accepted Answer · 2016-04-23 13:33:00Z

1

UPDATE: replicating ids 6 times:

In [172]: %paste
data = """\
id
1
2
3
"""
df = pd.read_csv(io.StringIO(data))
# repeat each ID 6 times
df = pd.DataFrame(df['id'].tolist()*6, columns=['id'])

start_date = pd.to_datetime('2015-01-01')

df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
               .transform(lambda x: pd.date_range(start_date,
                                                  freq='1D',
                                                  periods=len(x)))
df.sort_values(by=['id','date'])
## -- End pasted text --
Out[172]:
    id       date
0    1 2015-01-01
3    1 2015-01-02
6    1 2015-01-03
9    1 2015-01-04
12   1 2015-01-05
15   1 2015-01-06
1    2 2015-01-01
4    2 2015-01-02
7    2 2015-01-03
10   2 2015-01-04
13   2 2015-01-05
16   2 2015-01-06
2    3 2015-01-01
5    3 2015-01-02
8    3 2015-01-03
11   3 2015-01-04
14   3 2015-01-05
17   3 2015-01-06

OLD more generic answer:

prepare sample DF:

start_date = pd.to_datetime('2015-01-01')

data = """\
id
1
2
2
3
1
2
3
2
1
"""
df = pd.read_csv(io.StringIO(data))

In [200]: df
Out[200]:
   id
0   1
1   2
2   2
3   3
4   1
5   2
6   3
7   2
8   1

Solution:

In [201]: %paste
df['date'] = start_date
df['date'] = df.groupby('id', as_index=False)\
               .transform(lambda x: pd.date_range(start_date,
                                                  freq='1D',
                                                  periods=len(x)))
## -- End pasted text --

In [202]: df
Out[202]:
   id       date
0   1 2015-01-01
1   2 2015-01-01
2   2 2015-01-02
3   3 2015-01-01
4   1 2015-01-02
5   2 2015-01-03
6   3 2015-01-02
7   2 2015-01-04
8   1 2015-01-03

Sorted:

In [203]: df.sort_values(by='id')
Out[203]:
   id       date
0   1 2015-01-01
4   1 2015-01-02
8   1 2015-01-03
1   2 2015-01-01
2   2 2015-01-02
5   2 2015-01-03
7   2 2015-01-04
3   3 2015-01-01
6   3 2015-01-02

edited Apr 23, 2016 at 13:33

answered Apr 23, 2016 at 12:19

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

TheDude Over a year ago

Thanks for the reply, but the result is not exactly what I want. In your final data frame id '2' got one date (2015-01-04) more assigned, whereas id '3' has one date missing. And is there a way without having the original dataframe prepared like you did with several occurences of the id's? I just have one single occurence of the id's in the dataframe. So I'm not able to assign several dates prior grouping the dataframe

MaxU - stand with Ukraine Over a year ago

@TheDude, so you just want to replicate each ID three times and add three consequent dates to them - correct?

TheDude Over a year ago

Yes, I got consequent dates (6 different dates in total), which should be assigned to every ID (~50.000 unique values).

MaxU - stand with Ukraine Over a year ago

@TheDude, so you want to have 6*50.000 = 300.000 rows at the end?

lanery · Accepted Answer · 2016-04-24 08:57:36Z

1

A rather straightforward numpy approach, making use of repeat and tile:

import numpy as np
import pandas as pd

N     = 3  # arbitrary number of IDs/dates
ID    = np.arange(N) + 1
dates = pd.date_range('20160101', periods=N)

df = pd.DataFrame({'ID'    : np.repeat(ID, N),
                   'dates' : np.tile(dates, N)})

Resulting DataFrame:

In [1]: df
Out[1]:
   ID      dates
0   1 2016-01-01
1   1 2016-01-02
2   1 2016-01-03
3   2 2016-01-01
4   2 2016-01-02
5   2 2016-01-03
6   3 2016-01-01
7   3 2016-01-02
8   3 2016-01-03

Update

Assuming you already have a DataFrame of IDs, as pointed out by MaxU, you can tile the IDs

df = pd.DataFrame({'ID'    : np.tile(df['ID'], N),
                   'dates' : np.tile(dates, N)})
# now df needs sorting
df = df.sort_values(by=['ID', 'dates'])

Resulting DataFrame:

In [5]: df
Out[5]:
   ID      dates
0   1 2016-01-01
3   1 2016-01-01
6   1 2016-01-01
1   2 2016-01-02
4   2 2016-01-02
7   2 2016-01-02
2   3 2016-01-03
5   3 2016-01-03
8   3 2016-01-03

edited Apr 24, 2016 at 8:57

answered Apr 23, 2016 at 17:46

lanery

5,3844 gold badges32 silver badges45 bronze badges

1 Comment

MaxU - stand with Ukraine Over a year ago

i guess, your solution will be much faster than mine:). I would also use np.tile(df['id'],N) instead of np.repeat(ID, N), because OP already has a DF, containing IDs

Collectives™ on Stack Overflow

Pandas: Add new column with several values to groupby dataframe

2 Answers 2

4 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related