1

I have df that looks like this

df:

id   dob
1    7/31/2018
2    6/1992

I want to generate 88799 random dates to go into column dob in the dataframe, between the dates of 1960-01-01 to 1990-12-31 while keeping the format mm/dd/yyyy no time stamp.

How would I do this?

I tried:

date1 = (1960,01,01)
date2 = (1990,12,31)

for i range(date1,date2):
    df.dob = i

1 Answer 1

8

I would figure out how many days are in your date range, then select 88799 random integers in that range, and finally add that as a timedelta with unit='d' to your minimum date:

min_date = pd.to_datetime('1960-01-01')
max_date = pd.to_datetime('1990-12-31')

d = (max_date - min_date).days + 1

df['dob'] = min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')

>>> df.head()
         dob
0 1963-03-05
1 1973-06-07
2 1970-08-24
3 1970-05-03
4 1971-07-03

>>> df.tail()
             dob
88794 1965-12-10
88795 1968-08-09
88796 1988-04-29
88797 1971-07-27
88798 1980-08-03

EDIT You can format your dates using .strftime('%m/%d/%Y'), but note that this will slow down the execution significantly:

df['dob'] = (min_date + pd.to_timedelta(pd.np.random.randint(d,size=88799), unit='d')).strftime('%m/%d/%Y')

>>> df.head()
          dob
0  02/26/1969
1  04/09/1963
2  08/29/1984
3  02/12/1961
4  08/02/1988
>>> df.tail()
              dob
88794  02/13/1968
88795  02/05/1982
88796  07/03/1964
88797  06/11/1976
88798  11/17/1965
Sign up to request clarification or add additional context in comments.

4 Comments

Could use strftime to format the date as OP asked
@sacul thank you, how could I format the date on the fly ?
@sacuL. Could I please check a couple of points? In the line pd.np.random.randint do we need to include the pd.np.random.randint or could we just write np.random.randint? I couldn't see any difference in my result when I included the pd. or not. Also for the line d = (max_date - min_date).days + 1, can you explain the use of .days here? I understand that we are using days as a unit of time (hence unit='d' later in the code) however I don't fully understand why I need to include .days here as d is just a max integer value for randint? My code fails if I don't include it. Many thanks
@mmTmmR pd.np.random.randint is there just so you don't have to explicitly import numpy by import numpy as np, but it is exactly the same as saying np.random.randint if you have imported numpy already. For the d = (max_date - min_date).days + 1, that is just to get a list of valid integers. .days gives an integer of the number of days in the range max_date - min_date

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.