Hi I have a dataset in the following format:
Code for replicating the data:
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
I input the numbers as a string to show blank cells
Where the first three columns denotes date (Year, Month and Day) and the following columns represent individuals (My actual data file consists of about 300 such rows and about 1000 subjects. I presented a subset of the data here).
Where the column value refers to expenditure on FMCG products. What I would like to do is the following:
Part 1 (Beginning and end points)
a) For each individual locate the first observation and duplicate the value of the first observation for atleast the previous six months. For example: Subject C's 1st observation is on the 10th of August 2008. In that case I would want all the rows from June 10, 2008 to be equal to 65 for Subject C (Roughly 2/12/2008 is the cutoff date. SO we leave the 3rd cell from the top for Subject_C's column blank).
b) Locate last observation and repeat the last observation for the following 3 months. For example for Subject_A, we repeat 35 twice (till 6th November 2008).
Please refer to the following diagram for the highlighted cell with the solutions.
Part II - (Rows in between)
Next I would like to do two things (I would need to do the following three steps separately, not all at one time):
For individuals like Subject_A, locate two observations that come one after the other (30 and 35).
i) Use the average of the two observations. In this case we would have 32.5 in the four rows without caring about time.
for eg:
ii) Find the total time between two observations and take the mean of the time. For the 1st half of the time period assign the first value and for the 2nd half assign the second value. For example - for subject 1, the total days between 01/22/208 and 08/10/2008 is 201 days. For the first 201/2 = 100.5 days assign the value of 30 to Subject_A and for the remaining value assign 35. In this case the columns for Subject_A and Subject_C will look like:
The final dataset will use (a), (b) & (i) or (a), (b) & (ii)
Final data I [using a,b and i]
Final data II [using a,b and ii]
I would appreciate any help with this. Thanks in advance. Please let me know if the steps are unclear.
Follow up question and Issues
Thanks @Juan for the initial answer. Here's my follow up question. Suppose that Subject_A has more than 2 observations (code for the example data below). Would we be able to extend this code to incorporate more than 2 observations?
import pandas as pd
d1 = {'Year':
['2008','2008','2008','2008','2008','2008','2008','2008','2008','2008'],
'Month':['1','1','2','6','7','8','8','11','12','12'],
'Day':['6','22','6','18','3','10','14','6','16','24'],
'Subject_A':['','30','','45','','35','','','',''],
'Subject_B':['','','','','','','','40','',''],
'Subject_C': ['','','','','','65','','50','','']}
d1 = pd.DataFrame(d1)
Issues For the current code, I found an issue for part II (ii). This is the output that I get:
This is actually on the right track. The two cells above 35 does not seem to get updated. Is there something wrong on my end? Also the same question as before, would we be able to extend it to the case of >2 observations?






