Python Pandas select value from column based on time

Question

I'm new to Python and ML and I'm trying to work with csv file and create a model which would predict duration of host responding.

The first I did was parsing the logs from csv file by using Pandas and now I have csv file where are columns in the following order with examples:

                               _time             host  duration
202     2020-09-26T10:56:33.630+0200           malcon       850
203     2020-09-26T10:56:33.630+0200          malcon2       878
703     2020-09-25T21:26:04.651+0200           malcon       973

The first I wanted to do was to use some models for anomaly detection but maybe there is an easier way to do what I want. I'd like to get duration values that are higher than 800 in interval of 3 minutes by timestamp and based on data I have for one week predict the values.

I started with code that would find the duration values higher or equal to 800 but don't know how associate them with the time and define the interval.

My code so far is:

import pandas as pd

data = pd.read_csv("example_all.csv")

df = pd.DataFrame(data,columns=['_time','host','duration'])

high = (df.loc[df['duration'] >= 800])

print(high)

Any hints and suggestions would be much appreciated! Thanks!

Update:

I'm trying to work with rolling function but I think I don't understand it correctly and I got a bit lost in it.

As was advised here I convert timestamp by using to_datetime function and sort data by time. Unfortunately I cannot find a way to specify a time interval of 3 minutes where duration was higher than 800.

My code looks like this now:

import pandas as pd

data = pd.read_csv("example_all.csv")

data["_time"] = pd.to_datetime(data["_time"], utc='true')

df = pd.DataFrame(data,columns=['_time','host','duration'])

df.sort_values('_time')

high = (df.loc[df['duration'] >= 800])

print(high)

Output:

                                  _time             host  duration
202    2020-09-26 08:56:33.630000+00:00           malcon       850
203    2020-09-26 08:56:33.630000+00:00          malcon2       850
702    2020-09-25 19:26:05.573000+00:00           malcon       878
703    2020-09-25 19:26:04.651000+00:00           malcon       973
704    2020-09-25 19:26:03.667000+00:00           malcon       993
...

If I am understanding what you want to do is identify all instances where duration exceeds 800 for three consecutive samples. If this is correct you need to (1) convert your _time to a pandas timestamp using to_datetime, (2) sort your data by time using sort or sort_values (3) filter your dataframe by duration > 800 as you have done. The look a rolling or shift — itprorh66
– itprorh66, Commented Oct 26, 2020 at 23:35
Thanks a lot guys, I'll try to do so today and share my insights here. — vloubes
– vloubes, Commented Oct 27, 2020 at 14:19

Diego Veralli · Accepted Answer · 2020-10-28 10:46:17Z

1

If you're looking for any values >= 800 where you haven't recorded any values < 800 in the previous 3 minutes, this approach will work:

import pandas as pd
from pandas.tseries.offsets import Minute

data = pd.read_csv("example_all.csv", parse_dates=[0])

data = data.sort_values('_time')


def all_over_800(values):
    return values.map(lambda x: x >= 800).all()


data['over_threshold'] = data[['_time', 'duration']].rolling(
    Minute(3), on='_time').apply(lambda win: all_over_800(win))['duration']

Note the center window option is not implemented for datetime offset windows, so checking the 3 preceding (or succeeding depending on order) minutes is the only option with this approach. If you don't mind sorting the dataframe twice, you can combine the preceeding and suceeding results to check on both sides of your sample.

answered Oct 28, 2020 at 10:46

Diego Veralli

1,0727 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

vloubes Over a year ago

Thanks for the reply! I'm trying your code but getting value error : "ValueError: window must be an integer". Why rolling is complaining about the date format and requires an the integer?

Diego Veralli Over a year ago

I've tested the exact code from my answer with pandas 1.0 and pandas 1.1.3, it works fine on both cases, and the datetime offset is supported since pandas 0.19 in any case. So I'm afraid I don't know what issue you might be having.

Diego Veralli Over a year ago

If you didn't copy and paste the code from my answer exactly, then the issue could be that the _time coumn is not a datetime in your dataframe. In my example it's read as a datetime column via the parse_dates argument in the read_csv method.

Diego Veralli Over a year ago

The code I pasted will add a new column that will tell you if the current row has a value over 800, where there are no other values under 800 in the preceding 3 minutes. After this you can do whatever you need with the dataframe, if you're having problems writing to CSV I think you should probably ask a new question. Try print(data) to get a sense of the data in it.

Diego Veralli Over a year ago

Note also the code I pasted doesn't filter the dataframe. If you want to just keep the samples of interest in the 3 minute interval, that's a bit more complex and as far as I know will force you to sort twice to get forwards and backwards windows.

|

Collectives™ on Stack Overflow

Python Pandas select value from column based on time

1 Answer 1

8 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Related