0

I've got about 7.5 million rows of data in the following format:

ndc_description               effective_date        ...                             
12-HR DECONGEST 120MG CAPLET  2015-08-19            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2015-07-22            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2015-06-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA
Some Other drug               2016-11-21            2015-08-26          G   NaN     NaN     1   0.36062     36800005452     Y   C/I     EA
                              2016-07-23            2015-08-12          G   NaN     NaN     1   0.37681     36800005452     Y   C/I     EA
                              2016-05-17            2015-07-15          G   NaN     NaN     1   0.36651     36800005452     Y   C/I     EA

ndc_description and effective_date are a multi-indice.

I have an additional dataset that I'm merging with the above. They'll be merged by the columns ndc_description and effective_date (the other columns shown are purely for demonstration that other data of various types exists in the dataset).

Current problem: The dates in each dataset don't match up. In the dataset above, they're (mostly) on a weekly basis, but that can't be guaranteed. In the other dataset, there's also no guaranteed regularity. Therefore, I suppose I'll need to add rows for all dates in between those listed in the effective_date column above so I can then make the merge on ndc_description and effective_date. Is this the best way to go about this process? Because of the amount of data involved, I'd like to optimize the code involved before I end up running all the data through.

Potential solutions: I've seen that .resample() may be of value here, but I've not yet been able to get it to work. Something like this: Cleaned_Price_Data.effective_date.resample('1D', fill_method = 'ffill', level = 1)

I think it would also improve efficiency to include the max & min dates somewhere in the above so that it doesn't ffill anything more than necessary. Also, including the value in the ndc_description so that dates added to the effective_date column aren't duplicated for each identical ndc_description value.

EDIT: Here's some code illustrating the current state of my dataframe and how it should look after transformations are complete. I'm trying to transform a dataframe such as this:

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-17', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

pre_transform = pd.DataFrame('-', idx, col)
pre_transform

To one like this (note dates are added):

idx = pd.MultiIndex.from_product([['drug_a', 'drug_b', 'drug_c'],
                                  ['2015-08-19', '2015-08-18', '2015-08-17', '2015-08-16', '2015-08-15', '2015-08-14']],
                                 names=['drug_name', 'effective_date'])
col = ['other_data_1', 'other_data_2', 'other_data_3']

post_change = pd.DataFrame('-', idx, col)
post_change

EDIT 2: I've come up with the following code (via Parfait's answer here) which seems to do the trick:

def expand_dates(ser):
    return pd.DataFrame({'effective_date': pd.date_range(ser['effective_date'].min(), ser['effective_date'].max(), freq='D')})

price_cols = list(Cleaned_Price_Data.columns)

all_effective_dates = Cleaned_Price_Data.groupby(['ndc']).apply(expand_dates).reset_index().merge(Cleaned_Price_Data, how = 'left')[price_cols].ffill()

However, at 55million rows, the file's quite bloated, and I'll be attempting to merge this with another dataset. Any attempts at optimizing this (or suggesting a more efficient alternative) would be greatly appreciated.

2
  • Can you create minimal, complete, and verifiable example? Also add sample of additional dataset and expected output? Commented Oct 30, 2019 at 6:19
  • @jezrael -added reproducible code illustrating the current status of my dataframe, and how I'd like it to end up. The problem itself isn't verifiable as it's more of the "how do I do this" vein, than a "I get this error, what do I do" nature. Commented Oct 30, 2019 at 17:12

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.