Example setup:
import pandas as pd
df = pd.DataFrame(
data={'ts':
[
'2008-11-05 07:45:23.100',
'2008-11-17 06:53:25.150',
'2008-12-02 07:36:18.643',
'2008-12-15 07:36:24.837',
'2009-01-06 07:03:47.387',
],
'val': range(5)})
df.ts = pd.to_datetime(df.ts)
df.set_index('ts', drop=False, inplace=True)
df
| ts | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-11-17 06:53:25.150 | 2008-11-17 06:53:25.150 | 1
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3
2009-01-06 07:03:47.387 | 2009-01-06 07:03:47.387 | 4
Although the index is a pd.Timestamp type, I can use a string representation of a timestamp to filter it. For example:
df.loc['2008-11-05']
| ts | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
Moreover, pandas comes with a very convenient feature that when my filter is vague it returns the desirable result. For example:
df.loc['2008-12']
| ts | val
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3
My first question is, how can I filter the df with a list of string timestamps? For example if I run the code below
df.loc[['2008-11-05','2008-12']]
, the result I want to get is
| ts | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3
, but in fact I get the following error:
KeyError: "None of [Index(['2008-11-05', '2008-12'], dtype='object', name='ts')] are in the [index]"
My second question is, can I do the similar filtering logic for a regular column? I.e., if I don't set ts as the index but filter the ts column directly with a string filter.
-------------------- Follow up 2019-9-10 10:00 --------------------
All the answers below are very much appreciated. I didn't know pd.Series.str.startswith can support the tuple input of multiple strings, or that pd.Series.str.contains can support the usage of '|'. New skills learned!
I think all the methods based on the use of astype(str) has one major shortcoming for me: In US people use all kinds of date time formats. Besides '2008-11-05', commonly used ones in my company are '2008-11-5', '11/05/2008', '11/5/2008', '20081105', '05nov2008', which would all fail if I used the string based method.
For now I still have to stick with the following method, which requires the column to be the index and doesn't seem efficient (I haven't profiled), but should be sufficiently robust. I don't understand why it is not supported natively by pandas.
L = ['5nov2008','2008/12']
pd.concat([df.loc[val] for val in L]).drop_duplicates()
| ts | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3