1

I am new to python and have the following problem:

I have daily measurements that I imported into a pd.series object. The problem is that the data is such that there are always 31 measurements per month, irrespective of whether it is January or February. In case the respective month has less than 31 days, the measurements are set to zero for those days that exceed the last day of that month. However, missing data within a month is also set to zero. The data for the months of April and May looks like this.

1990-04-01    25.870
1990-04-01    26.205
1990-04-01    12.283
1990-04-01    19.630
1990-04-01    19.239
1990-04-01    23.614
1990-04-01    40.891
1990-04-01    41.152
1990-04-01    35.935
1990-04-01    25.682
1990-04-01    21.674
1990-04-01    15.818
1990-04-01    11.413
1990-04-01    16.522
1990-04-01    33.543
1990-04-01    28.727
1990-04-01    18.043
1990-04-01    10.326
1990-04-01    19.159
1990-04-01    21.848
1990-04-01    35.250
1990-04-01    39.152
1990-04-01    31.522
1990-04-01    23.152
1990-04-01    13.250
1990-04-01    20.705
1990-04-01    27.304
1990-04-01    24.478
1990-04-01    33.674
1990-04-01    32.591
1990-04-01     0.000
1990-05-01    40.370
1990-05-01    41.609
1990-05-01    47.478
1990-05-01    40.682
1990-05-01    42.587
1990-05-01    38.826
1990-05-01    35.543
1990-05-01    30.955
1990-05-01    23.543
1990-05-01     7.857
1990-05-01     0.000
1990-05-01     0.000
1990-05-01     0.000
1990-05-01     0.000
1990-05-01     0.000
1990-05-01     0.000
1990-05-01    54.133
1990-05-01    41.114
1990-05-01    44.739
1990-05-01    43.848
1990-05-01    26.739
1990-05-01    21.318
1990-05-01    26.750
1990-05-01    54.864
1990-05-01    33.000
1990-05-01    33.304
1990-05-01    34.304
1990-05-01    20.886
1990-05-01    20.250
1990-05-01    24.804
1990-05-01    28.091
Length: 62

Is there a way to drop those entries that do not belong to the respective months and provide a new time-index for each day? I need to keep the zeros within a month.

1
  • I'm taking a bit of a punt here, because I don't have pandas installed, but what happens if you try s[['1990-04-01', '1990-04-02']] ? (Assuming your series is called s, of course.) If it gives you something sensible then the problem is now a much more tractable one of generating a list of all the valid indices. Commented Nov 22, 2013 at 17:19

1 Answer 1

1

First I would replace the 0s with NaN (to symbolise missing data):

s.replace(0, np.nan, inplace=True)

One way is to create a function which acts on each group (month):

def f(s_month):
    date = s_month.index[0]
    # create the month long range
    rng = pd.date_range(date, date + pd.offsets.MonthEnd(1), freq='D')
    # throw away results longer than month length
    s_month = s_month.iloc[0:len(rng)]
    # reindex those remaining
    s_month.index = rng
    return s_month

Note: this requires that you have a DatetimeIndex, i.e. s.index = pd.to_datetime(s.index).

In [11]: s_month = s.loc['1990-04-01']

In [12]: f(s_month)
Out[12]: 
1990-04-01    25.870
1990-04-02    26.205
1990-04-03    12.283
1990-04-04    19.630
...
1990-04-28    24.478
1990-04-29    33.674
1990-04-30    32.591
Freq: D, Name: Value, dtype: float64

Use this with a groupby apply:

In [13]: res = s.groupby(s.index).apply(f)

In [14]: res
Out[14]: 
1990-04-01  1990-04-01    25.870
            1990-04-02    26.205
            1990-04-03    12.283
            1990-04-04    19.630
...

Correct the MultiIndex:

In [15]: res.index = res.index.droplevel(0)

In [16]: res
Out[16]: 
1990-04-01    25.870
1990-04-02    26.205
1990-04-03    12.283
...
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.