2

I'm new to Python and ML and I'm trying to work with csv file and create a model which would predict duration of host responding.

The first I did was parsing the logs from csv file by using Pandas and now I have csv file where are columns in the following order with examples:

                               _time             host  duration
202     2020-09-26T10:56:33.630+0200           malcon       850
203     2020-09-26T10:56:33.630+0200          malcon2       878
703     2020-09-25T21:26:04.651+0200           malcon       973

The first I wanted to do was to use some models for anomaly detection but maybe there is an easier way to do what I want. I'd like to get duration values that are higher than 800 in interval of 3 minutes by timestamp and based on data I have for one week predict the values.

I started with code that would find the duration values higher or equal to 800 but don't know how associate them with the time and define the interval.

My code so far is:

import pandas as pd

data = pd.read_csv("example_all.csv")

df = pd.DataFrame(data,columns=['_time','host','duration'])

high = (df.loc[df['duration'] >= 800])

print(high) 

Any hints and suggestions would be much appreciated! Thanks!

Update:

I'm trying to work with rolling function but I think I don't understand it correctly and I got a bit lost in it.

As was advised here I convert timestamp by using to_datetime function and sort data by time. Unfortunately I cannot find a way to specify a time interval of 3 minutes where duration was higher than 800.

My code looks like this now:

import pandas as pd

data = pd.read_csv("example_all.csv")

data["_time"] = pd.to_datetime(data["_time"], utc='true')

df = pd.DataFrame(data,columns=['_time','host','duration'])

df.sort_values('_time')

high = (df.loc[df['duration'] >= 800])

print(high)

Output:

                                  _time             host  duration
202    2020-09-26 08:56:33.630000+00:00           malcon       850
203    2020-09-26 08:56:33.630000+00:00          malcon2       850
702    2020-09-25 19:26:05.573000+00:00           malcon       878
703    2020-09-25 19:26:04.651000+00:00           malcon       973
704    2020-09-25 19:26:03.667000+00:00           malcon       993
...
3
  • 1
    check out df.rolling() Commented Oct 26, 2020 at 22:37
  • 1
    If I am understanding what you want to do is identify all instances where duration exceeds 800 for three consecutive samples. If this is correct you need to (1) convert your _time to a pandas timestamp using to_datetime, (2) sort your data by time using sort or sort_values (3) filter your dataframe by duration > 800 as you have done. The look a rolling or shift Commented Oct 26, 2020 at 23:35
  • Thanks a lot guys, I'll try to do so today and share my insights here. Commented Oct 27, 2020 at 14:19

1 Answer 1

1

If you're looking for any values >= 800 where you haven't recorded any values < 800 in the previous 3 minutes, this approach will work:

import pandas as pd
from pandas.tseries.offsets import Minute

data = pd.read_csv("example_all.csv", parse_dates=[0])

data = data.sort_values('_time')


def all_over_800(values):
    return values.map(lambda x: x >= 800).all()


data['over_threshold'] = data[['_time', 'duration']].rolling(
    Minute(3), on='_time').apply(lambda win: all_over_800(win))['duration']

Note the center window option is not implemented for datetime offset windows, so checking the 3 preceding (or succeeding depending on order) minutes is the only option with this approach. If you don't mind sorting the dataframe twice, you can combine the preceeding and suceeding results to check on both sides of your sample.

Sign up to request clarification or add additional context in comments.

8 Comments

Thanks for the reply! I'm trying your code but getting value error : "ValueError: window must be an integer". Why rolling is complaining about the date format and requires an the integer?
I've tested the exact code from my answer with pandas 1.0 and pandas 1.1.3, it works fine on both cases, and the datetime offset is supported since pandas 0.19 in any case. So I'm afraid I don't know what issue you might be having.
If you didn't copy and paste the code from my answer exactly, then the issue could be that the _time coumn is not a datetime in your dataframe. In my example it's read as a datetime column via the parse_dates argument in the read_csv method.
The code I pasted will add a new column that will tell you if the current row has a value over 800, where there are no other values under 800 in the preceding 3 minutes. After this you can do whatever you need with the dataframe, if you're having problems writing to CSV I think you should probably ask a new question. Try print(data) to get a sense of the data in it.
Note also the code I pasted doesn't filter the dataframe. If you want to just keep the samples of interest in the 3 minute interval, that's a bit more complex and as far as I know will force you to sort twice to get forwards and backwards windows.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.