I'm new to Python and ML and I'm trying to work with csv file and create a model which would predict duration of host responding.
The first I did was parsing the logs from csv file by using Pandas and now I have csv file where are columns in the following order with examples:
_time host duration
202 2020-09-26T10:56:33.630+0200 malcon 850
203 2020-09-26T10:56:33.630+0200 malcon2 878
703 2020-09-25T21:26:04.651+0200 malcon 973
The first I wanted to do was to use some models for anomaly detection but maybe there is an easier way to do what I want. I'd like to get duration values that are higher than 800 in interval of 3 minutes by timestamp and based on data I have for one week predict the values.
I started with code that would find the duration values higher or equal to 800 but don't know how associate them with the time and define the interval.
My code so far is:
import pandas as pd
data = pd.read_csv("example_all.csv")
df = pd.DataFrame(data,columns=['_time','host','duration'])
high = (df.loc[df['duration'] >= 800])
print(high)
Any hints and suggestions would be much appreciated! Thanks!
Update:
I'm trying to work with rolling function but I think I don't understand it correctly and I got a bit lost in it.
As was advised here I convert timestamp by using to_datetime function and sort data by time. Unfortunately I cannot find a way to specify a time interval of 3 minutes where duration was higher than 800.
My code looks like this now:
import pandas as pd
data = pd.read_csv("example_all.csv")
data["_time"] = pd.to_datetime(data["_time"], utc='true')
df = pd.DataFrame(data,columns=['_time','host','duration'])
df.sort_values('_time')
high = (df.loc[df['duration'] >= 800])
print(high)
Output:
_time host duration
202 2020-09-26 08:56:33.630000+00:00 malcon 850
203 2020-09-26 08:56:33.630000+00:00 malcon2 850
702 2020-09-25 19:26:05.573000+00:00 malcon 878
703 2020-09-25 19:26:04.651000+00:00 malcon 973
704 2020-09-25 19:26:03.667000+00:00 malcon 993
...
df.rolling()