2

Example setup:

import pandas as pd
df = pd.DataFrame(
    data={'ts':
          [
                '2008-11-05 07:45:23.100',
                '2008-11-17 06:53:25.150',
                '2008-12-02 07:36:18.643',
                '2008-12-15 07:36:24.837',
                '2009-01-06 07:03:47.387',
          ], 
          'val': range(5)})

df.ts = pd.to_datetime(df.ts)

df.set_index('ts', drop=False, inplace=True)

df


                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-11-17 06:53:25.150 | 2008-11-17 06:53:25.150 | 1
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3
2009-01-06 07:03:47.387 | 2009-01-06 07:03:47.387 | 4

Although the index is a pd.Timestamp type, I can use a string representation of a timestamp to filter it. For example:

df.loc['2008-11-05']

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0

Moreover, pandas comes with a very convenient feature that when my filter is vague it returns the desirable result. For example:

df.loc['2008-12']
                        | ts                      | val
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

My first question is, how can I filter the df with a list of string timestamps? For example if I run the code below

df.loc[['2008-11-05','2008-12']]

, the result I want to get is

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

, but in fact I get the following error:

KeyError: "None of [Index(['2008-11-05', '2008-12'], dtype='object', name='ts')] are in the [index]"

My second question is, can I do the similar filtering logic for a regular column? I.e., if I don't set ts as the index but filter the ts column directly with a string filter.

-------------------- Follow up 2019-9-10 10:00 --------------------

All the answers below are very much appreciated. I didn't know pd.Series.str.startswith can support the tuple input of multiple strings, or that pd.Series.str.contains can support the usage of '|'. New skills learned!

I think all the methods based on the use of astype(str) has one major shortcoming for me: In US people use all kinds of date time formats. Besides '2008-11-05', commonly used ones in my company are '2008-11-5', '11/05/2008', '11/5/2008', '20081105', '05nov2008', which would all fail if I used the string based method.

For now I still have to stick with the following method, which requires the column to be the index and doesn't seem efficient (I haven't profiled), but should be sufficiently robust. I don't understand why it is not supported natively by pandas.

L = ['5nov2008','2008/12']
pd.concat([df.loc[val] for val in L]).drop_duplicates()

                        | ts                      | val
2008-11-05 07:45:23.100 | 2008-11-05 07:45:23.100 | 0
2008-12-02 07:36:18.643 | 2008-12-02 07:36:18.643 | 2
2008-12-15 07:36:24.837 | 2008-12-15 07:36:24.837 | 3

5 Answers 5

1

You can use .contains() by first converting them into str

res = df.loc[(df.index.astype(str).str.contains("2008-12")) 
             | (df.index.astype(str).str.contains('2008-11-05'))]
print(res)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

second question

yes you can apply filter on normal column like

df.loc[(df.ts.astype(str).str.contains("2008-12"))
    |(df.ts.astype(str).str.contains('2008-11-05'))]
Sign up to request clarification or add additional context in comments.

Comments

1

This should be get going for you..

>>> df
                       ts  val
0 2008-11-05 07:45:23.100    0
1 2008-11-17 06:53:25.150    1
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3
4 2009-01-06 07:03:47.387    4

Result:

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05')).any(axis=1)]
                       ts  val
0 2008-11-05 07:45:23.100    0

OR ..

>>> df
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-11-17 06:53:25.150 2008-11-17 06:53:25.150    1
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3
2009-01-06 07:03:47.387 2009-01-06 07:03:47.387    4

Result:

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05')).any(axis=1)]
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0

Looking for multiple values.

>>> df[df.apply(lambda row: row.astype(str).str.contains('2008-11-05|2008-12')).any(axis=1)]
                                             ts  val
ts
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

Comments

1

For your first question, you could use pd.DataFrame.append:

df.loc['2008-11-05'].append(df.loc['2008-12'])

#                                              ts  val
# ts                                                  
# 2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
# 2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
# 2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

For you second question, you could use pd.Series.str.match:

df.ts.astype(str).str.match('2008-11-05|2008-12')

# ts
# 2008-11-05 07:45:23.100     True
# 2008-11-17 06:53:25.150    False
# 2008-12-02 07:36:18.643     True
# 2008-12-15 07:36:24.837     True
# 2009-01-06 07:03:47.387    False
# Name: ts, dtype: bool

hence using this e.g. as a boolean index:

df[df.ts.astype(str).str.match('2008-11-05|2008-12')]

#                                              ts  val
# ts                                                  
# 2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
# 2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
# 2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

Note that you can leave out the astype(str) part if your ts column is already of type string.

Comments

1

First idea is simply join together by concat:

df1 = pd.concat([df.loc['2008-11-05'], df.loc['2008-12']], sort=True)
print (df1)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

Or filter by boolean indexing with mask by Series.str.contains with | for regex OR:

df1 = df[df.index.astype(str).str.contains('2008-11-05|2008-12')]

Or with Series.str.startswith and tuple:

df1 = df[df.index.astype(str).str.startswith(('2008-11-05', '2008-12'))]
print (df1)
                                             ts  val
ts                                                  
2008-11-05 07:45:23.100 2008-11-05 07:45:23.100    0
2008-12-02 07:36:18.643 2008-12-02 07:36:18.643    2
2008-12-15 07:36:24.837 2008-12-15 07:36:24.837    3

If input is list of strings:

L = ['2008-11-05','2008-12']

df2 = df[df.ts.astype(str).str.contains('|'.join(L))]

And similar:

df2 = df[df.ts.astype(str).str.startswith(tuple(L))]
print (df2)
                       ts  val
0 2008-11-05 07:45:23.100    0
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3

And for column only change index to ts:

df2 = df[df.ts.astype(str).str.contains('2008-11-05|2008-12')]

Or:

df2 = df[df.ts.astype(str).str.startswith(('2008-11-05', '2008-12'))]
print (df2)
                       ts  val
0 2008-11-05 07:45:23.100    0
2 2008-12-02 07:36:18.643    2
3 2008-12-15 07:36:24.837    3

Comments

0

You seem to have stumbled upon a bug!

This works

df.loc['2008-11-05']

This works

df.loc['2008-11-05':'2008-12-15']

but this doesn't, as you mentioned.

df.loc[['2008-11-05','2008-12-15']]

However, you can use as below to get the rows you want.

df.iloc[[0,2,3]]
                                                 ts     val
ts      
2008-11-05 07:45:23.100     2008-11-05 07:45:23.100     0
2008-12-02 07:36:18.643     2008-12-02 07:36:18.643     2
2008-12-15 07:36:24.837     2008-12-15 07:36:24.837     3

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.