6

I have timeseries data (epoch, values) which i have transformed into (datetime, values), which is stored in Numpy arrays. Now i wish to find the indexes of the first row corresponding to a given day. Thus, only a single index per day is needed.

Following is a purely Python function which is very slow.

def day_wise_datetime(datetimes,dataseries):
    unique_dates=[]
    unique_indices=[]
    for i in range(len(datetimes)):
        if datetimes[i].day not in unique_dates:
           unique_dates.append(datetimes[i])
           unique_indices.append(i)
    return [unique_dates,unique_indices]

Numpy provides a unique method, but it says that it cannot sort datetime. So what Numpy based technique can be used for the same.

I know that Pandas is recommended, but while i am learning it, would like to know if some NumPy/SciPy solution might suffice.

EDIT The value in datetimes variable are like. I have just sliced the first five elements.

[datetime.datetime(2011, 4, 18, 18, 52, 9),
datetime.datetime(2011, 4, 18, 18, 52, 10),
datetime.datetime(2011, 4, 18, 18, 52, 11),
datetime.datetime(2011, 4, 18, 18, 52, 12),
datetime.datetime(2011, 4, 18, 18, 52, 13)]
3
  • Is it possible to provide a simple example input? Commented May 8, 2013 at 11:09
  • @waitingkuo: Added sample input Commented May 8, 2013 at 13:27
  • can my answer solve your problem? Commented May 8, 2013 at 15:41

1 Answer 1

2

pandas's DataFrame provides drop_duplictes which can easily achieve your goal:

In [121]: arr1 = np.array([dt.datetime(2013, 1, 1), dt.datetime(2013, 1, 1), dt.datetime(2013, 1, 2)]) 

In [122]: arr2 = np.array([1, 2, 3]) 

In [123]: df = pd.DataFrame({'date': arr1, 'value': arr2})

In [124]: df
Out[124]: 
                 date  value
0 2013-01-01 00:00:00      1   
1 2013-01-01 00:00:00      2   
2 2013-01-02 00:00:00      3   

In [125]: df.drop_duplicates('date')
Out[125]: 
                 date  value
0 2013-01-01 00:00:00      1   
2 2013-01-02 00:00:00      3 

EDIT

I misunderstood your problem in the very beginning. Please try following one:

Seems sorting is one of your mainly problem, I create the example as a reversed datetime list:

In [74]: now = dt.datetime.utcnow()
In [75]: datetimes = [now - dt.timedelta(hours=6) * i for i in range(10)]

In [76]: datetimes
Out[76]:
[datetime.datetime(2013, 5, 8, 16, 47, 32, 60500),
 datetime.datetime(2013, 5, 8, 10, 47, 32, 60500),
 datetime.datetime(2013, 5, 8, 4, 47, 32, 60500),
 datetime.datetime(2013, 5, 7, 22, 47, 32, 60500),
 datetime.datetime(2013, 5, 7, 16, 47, 32, 60500),
 datetime.datetime(2013, 5, 7, 10, 47, 32, 60500),
 datetime.datetime(2013, 5, 7, 4, 47, 32, 60500),
 datetime.datetime(2013, 5, 6, 22, 47, 32, 60500),
 datetime.datetime(2013, 5, 6, 16, 47, 32, 60500),
 datetime.datetime(2013, 5, 6, 10, 47, 32, 60500)]

Create a DataFrame by datetimes and set the column name as date:

In [81]: df = pd.DataFrame(datetimes, columns=['date'])

In [82]: df
Out[82]:
                        date
0 2013-05-08 16:47:32.060500
1 2013-05-08 10:47:32.060500
2 2013-05-08 04:47:32.060500
3 2013-05-07 22:47:32.060500
4 2013-05-07 16:47:32.060500
5 2013-05-07 10:47:32.060500
6 2013-05-07 04:47:32.060500
7 2013-05-06 22:47:32.060500
8 2013-05-06 16:47:32.060500
9 2013-05-06 10:47:32.060500

Next, sort your DataFrame by the date column:

In [83]: df = df.sort('date')

And then append a new columns for index:

In [85]: df['index'] = df['date'].apply(lambda x:x.day)

In [86]: df
Out[86]:
                        date  index
9 2013-05-06 10:47:32.060500      6
8 2013-05-06 16:47:32.060500      6
7 2013-05-06 22:47:32.060500      6
6 2013-05-07 04:47:32.060500      7
5 2013-05-07 10:47:32.060500      7
4 2013-05-07 16:47:32.060500      7
3 2013-05-07 22:47:32.060500      7
2 2013-05-08 04:47:32.060500      8
1 2013-05-08 10:47:32.060500      8
0 2013-05-08 16:47:32.060500      8

Then group your data by index, and then get the first one for each group. If you are familiar with SQL, it just like SELECT FIRST(*) FROM table GROUP BY table.index:

In [87]: df = df.groupby('index').first()
In [88]: df
Out[88]: 
                            date
index                           
6     2013-05-06 10:47:32.060500
7     2013-05-07 04:47:32.060500
8     2013-05-08 04:47:32.060500

Now you can get the unique indices:

In [91]: df.index.values
Out[91]: array([6, 7, 8])

And get the unique dates:

In [92]: df['date'].values
Out[92]: 
array(['2013-05-06T18:47:32.060500000+0800',
   '2013-05-07T12:47:32.060500000+0800',
   '2013-05-08T12:47:32.060500000+0800'], dtype='datetime64[ns]')
Sign up to request clarification or add additional context in comments.

2 Comments

Since i would need to do data manipulation like averaging and other stuff for all records within a day, i would not like to delete other data. Moreover my datetime object also contains hour,minute,second information
It just generate a new object but not replace the original one.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.