1

Hi I have a dataset in the following format:

enter image description here

Code for replicating the data:

import pandas as pd
d1 = {'Year': 
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
  'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)

I input the numbers as a string to show blank cells

Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).

Where the column value refers to expenditure on FMCG products. What I would like to do is the following:

Part 1 (Beginning and end points)

a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008 is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).

b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).

Please refer to the following diagram for the highlighted cell with the solutions.

enter image description here

Part II - (Rows in between)

Next I would like to do two things (I would need to do the following three steps separately, not all at one time):

For individuals like Subject_A, locate two observations that come one after the other (30 and 35).

i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.

for eg:

enter image description here

ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:

enter image description here

The final dataset will use (a), (b) & (i) or (a), (b) & (ii)

Final data I [using a,b and i]

enter image description here

Final data II [using a,b and ii]

enter image description here

I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.

Follow up question and Issues

Thanks @Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?

import pandas as pd
d1 = {'Year': 
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
  'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)

Issues For the current code, I found an issue for part II (ii). This is the output that I get:

enter image description here

This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?

2
  • I think it would be better to split this into three questions and have each question stand alone. Commented Mar 20, 2018 at 21:25
  • I realize this might have been a very long question. I will keep this in mind while asking future questions here. My apologies. I should have asked separate questions for each part. Commented Mar 21, 2018 at 0:23

1 Answer 1

1

Here a code solution for subject A. Should work with the other subjects:

d1 = {'Year': 
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
  'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
d1 = pd.DataFrame(d1)
## Create a variable named date
d1['date']= pd.to_datetime(d1['Year']+'/'+d1['Month']+'/'+d1['Day'])
# convert to float, to calculate mean
d1['Subject_A'] = d1['Subject_A'].replace('',np.nan).astype(float)
# index of the not null rows
subja = d1['Subject_A'].notnull()

### max and min index row with notnull value
max_id_subja = d1.loc[subja,'date'].idxmax()
min_id_subja = d1.loc[subja,'date'].idxmin()
### max and min date for Sub A with notnull value
max_date_subja = d1.loc[subja,'date'].max()
min_date_subja = d1.loc[subja,'date'].min()
### value for max and min date
max_val_subja = d1.loc[max_id_subja,'Subject_A']
min_val_subja = d1.loc[min_id_subja,'Subject_A']
#### Cutoffs
min_cutoff = min_date_subja-pd.Timedelta(6, unit='M')
max_cutoff = max_date_subja+pd.Timedelta(3, unit='M')

## PART I.a
d1.loc[(d1['date']<min_date_subja) & (d1['date']>min_cutoff),'Subject_A'] = min_val_subja
## PART I.b
d1.loc[(d1['date']>max_date_subja) & (d1['date']<max_cutoff),'Subject_A'] = max_val_subja
## PART II
d1_2i = d1.copy()
d1_2ii = d1.copy()

lower_date = min_date_subja
lower_val = min_val_subja.copy()
next_dates_index = d1_2i.loc[(d1['date']>min_date_subja) & subja].index
for N in next_dates_index:
    next_date = d1_2i.loc[N,'date']
    next_val =  d1_2i.loc[N,'Subject_A']
    #PART II.i
    d1_2i.loc[(d1['date']>lower_date) & (d1['date']<next_date),'Subject_A'] = np.mean([lower_val,next_val])
    #PART II.ii
    mean_time_a = pd.Timedelta((next_date-lower_date).days/2, unit='d')
    d1_2ii.loc[(d1['date']>lower_date) & (d1['date']<=lower_date+mean_time_a),'Subject_A'] = lower_val
    d1_2ii.loc[(d1['date']>lower_date+mean_time_a) & (d1['date']<=next_date),'Subject_A'] = next_val
    lower_date = next_date
    lower_val = next_val
print(d1_2i)
print(d1_2ii)
Sign up to request clarification or add additional context in comments.

3 Comments

Thanks @Juan. The code works perfectly for Part I (a) & I (b). For part 2 (i) however, the current setup works when there are two observations. I believe this is because the current code uses the max and min values. Would you be able to help in case there are more than 2 observations? Also the code did not work fully for part 2 (ii). I will attach the result and the follow up question in my original question since I cannot attach the jpg here. Thanks a lot once again.
I edited the answer to include more than 2 observations, but only tested in subject A.
Works like a charm. I tested it out on several different columns. Thank you so much. @Juan

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.