8

How do I extract the IP address that occurs 10 times within a one-second time interval?

In the following case:

241.7118.197.10

28.252.8

2 Answers 2

5

You could collect the data to dict where IP is key and value contains timestamps for given IP. Then every time when timestamp is added you could check if given IP has three timestamps within a second:

from datetime import datetime, timedelta
from collections import defaultdict, deque
import re

THRESHOLD = timedelta(seconds=1)
COUNT = 3

res = set()
d = defaultdict(deque)

with open('test.txt') as f:
    for line in f:
        # Capture IP and timestamp
        m = re.match(r'(\S*)[^\[]*\[(\S*)', line)
        ip, dt = m.groups()

        # Parse timestamp
        dt = datetime.strptime(dt, '%d/%b/%Y:%H:%M:%S:%f')

        # Remove timestamps from deque if they are older than threshold
        que = d[ip]
        while que and (dt - que[0]) > THRESHOLD:
            que.popleft()

        # Add timestamp, update result if there's 3 or more items
        que.append(dt)
        if len(que) >= COUNT:
            res.add(ip)

print(res)

Result:

{'28.252.89.140'}

Above reads the logfile containing the log line by line. For every line a regular expression is used to capture data in two groups: IP and timestamp. Then strptime is used to parse the time.

First group (\S*) captures everything but whitespace. Then [^\[]* captures everything except [ and \[ captures the final character before timestamp. Finally (\S*) is used again to capture everything until next whitespace. See example on regex101.

Once we have IP and time they are added to defaultdict where IP is used as key and value is deque of timestamps. Before new timestamp is added the old ones are removed if they are older than THRESHOLD. This assumes that log lines are already sorted by time. After the addition the length is checked and if there are COUNT or more items in the queue IP is added to result set.

Sign up to request clarification or add additional context in comments.

6 Comments

so each dic[ ip_addr] contains a queue?
Yes, deque of timestamps from where oldest items might be removed every time new timestamp is added.
@Maria Added explanation and link to regex101.
@niemmi Nice solution. I would add a variable with the number of occurs as a constant in the way you did with THRESHOLD.
@Fomalhaut Thanks for the suggestion, updated the answer accordingly.
|
3

First step would be to parse data, you can do so with this:

data = [(ip, datetime.strptime(time, '%d/%b/%Y:%H:%M:%S:%f')) for (ip, time) in re.findall("((?:[0-9]{1,3}\.){3}[0-9]{1,3}).+?\[(.+?) -", text)]

where text is the input text.

This will return a list with a tuple for every entry. First element of tuple will be the ip address, second the date.

Next step is to see which ones happen in a 1 sec interval and have the same ip:

print set([a[0] for a in data for b in data for c in data if (datetime.timedelta(seconds=0)<a[1]-b[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<a[1]-c[1]<datetime.timedelta(seconds=1)) and (datetime.timedelta(seconds=0)<b[1]-c[1]<datetime.timedelta(seconds=1))])

Output:

set(['28.252.89.140'])

2 Comments

Which file? You only specified that input text. If that text is in a file, you should first read it.
Using this two lines you will need two modules: re and datetime

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.