2

Here is the code for sample simulated data. Actual data can have varying start and end dates.

import pandas as pd
import numpy as np  

dates = pd.date_range("20100121", periods=3653)   
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))    
dfb=df.resample('B').apply(lambda x:x[-1])

From the dfb, I want to select the rows that contain values for all the days of the month. In dfb, 2010 January and 2020 January have incomplete data. So I would like data from 2010 Feb till 2019 December.

For this particular dataset, I could do

df_out=dfb['2010-02':'2019-12']

But please help me with a better solution

Edit-- Seems there is plenty of confusion in the question. I want to omit rows that does not begin with first day of the month and rows that does not end on last day of the month. Hope that's clear.

4
  • could you elaborate on "contain values for all the days of the month"? do you mean every day in a month has data? Commented Feb 21, 2022 at 8:17
  • Yes, everyday in a month has data. So if data starts from 2013-3-13, the subset data should start from next month. It's assumed that the data is continuous after the start date. Commented Feb 21, 2022 at 8:46
  • If by "incomplete data" you mean NAN, you can drop rows with NAN values. Doesn't it solve your problem? Commented Feb 21, 2022 at 8:51
  • There is no NaN. Someone gives me this data. It starts from mid of some month1 and ends at mid of another month12. I want to subset data from beginning of month2 and end of month 11. Commented Feb 21, 2022 at 8:59

2 Answers 2

1

When you say "better" solution - I assume you mean make the range dynamic based on input data.

OK, since you mention that your data is continuous after the start date - it is a safe assumption that dates are sorted in increasing order. With this in mind, consider the code:

import pandas as pd
import numpy as np  
from datetime import date, timedelta

dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])

# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
   new_month = fd.month + 1
   if ( fd.month == 12 ):
      new_month = 1
   first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
   first_day_of_next_month = fd

# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
   last_day_of_prev_month = ld  # keeps the index if month is changed
else:
   last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)


df_out=dfb[first_day_of_next_month:last_day_of_prev_month]

There is another way to use dateutil.relativedelta but you will need to install python-dateutil module. The above solution attempts to do it without using any extra modules.

Sign up to request clarification or add additional context in comments.

Comments

1

I assume that in the general case the table is chronologically ordered (if not use .sort_index). The idea is to extract the year and month from the date and select only the lines where (year, month) is not equal to the first and last lines.

dfb['year'] = dfb.index.year  # col#1
dfb['month'] = dfb.index.month  # col#2

first_month = (dfb['year']==dfb.iloc[0, 1])  & (dfb['month']==dfb.iloc[0, 2])   
last_month  = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2]) 

dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.