Locate multiple keywords in lines using Python

Question

I got a line like this :

20:28:26.684597 24:d5:6e:76:9s:10 (oui Unknown) > 45:83:r4:7u:9s:i2 (oui Unknown), ethertype 802.1Q (0x8100), length 78: vlan 64, p 0, ethertype IPv4, (tos 0x48, ttl 34, id 5643, offset 0, flags [none], proto TCP (6), length 60) 192.168.45.28.56982 > 172.68.54.28.webcache: Flags [S], cksum 0xg654 (correct), seq 576485934, win 65535, options [mss 1460,sackOK,TS val 2544789 ecr 0,wscale 0,eol], length 0

In this line I need to find ID value from "id 5643" and another value (56982) from 192.168.45.28.56982. In these "id" will be constant and 192.168.45.28 is constant.

I have written a script like this, please suggest a way to shorten the code as in my script multiple steps are involved :

file = open('test.txt')
fi = file.readlines()

for line in fi:
    test = (line.split(","))
    for word2 in test:
        if "id" in word2:
            find2 = word2.split(" ")[-1]
            print("************", find2)
    for word in test:
        if "192.168.45.28" in word:
            find = word.split(".")
            print(find)
            for word1 in find:
                if ">" in word1:
                    find1 = word1.split(">")[0]
                    print(find1)

#

Just edited my question as per your suggestion // so for such cases 'readlines' is best suited or is there a better efficient method available. — Zoro99
– Zoro99, Commented Mar 13, 2016 at 10:18

jDo · Accepted Answer · 2016-03-13 09:55:28Z

2

Same approach as the others. It won't add empty lists to your results though, it compiles the regex for efficiency, it doesn't read the whole file into memory in one go and it doesn't use id as a variable name (it's a built-in function so best to avoid it). There can be duplicates in the output (I couldn't just assume that you wanted unique entries only).

import re

re_id = re.compile("id (\d+)")
re_ip = re.compile("192\.168\.45\.28\.(\d+)")

ids = []
ips = []

with open("test.txt", "r") as f:
    for line in f:
        id_res = re_id.findall(line)
        if any(id_res):
            ids.append(id_res[0])
        ip_res = re_ip.findall(line)
        if any(ip_res):
            ips.append(ip_res[0])

edited Mar 13, 2016 at 9:55

answered Mar 13, 2016 at 8:29

jDo

4,0301 gold badge13 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

dantiston · Accepted Answer · 2016-03-14 04:07:07Z

2

You could use regular expressions:

import re

# This searches for the literal id
# followed by a space and 1 or more digits
idPattern = re.compile("id (\d+)")
# This searches for your IP followed by a 
# a dot and one or more digits
ipPattern = re.compile("192\.168\.45\.28\.(\d+)")

with open("test.txt", 'r') as data:
    for line in data:
        id = idPattern.findall(line)
        ip = ipPattern.findall(line)

See the Python regular expression docs

edited Mar 14, 2016 at 4:07

answered Mar 13, 2016 at 7:46

dantiston

5,4312 gold badges28 silver badges30 bronze badges

4 Comments

Zoro99 Over a year ago

Got the following error "AttributeError: 'set' object has no attribute 'extend'" // But I want values to be stored in variable id1 and ip1 for every line as I need to perform some more operations on them. Could you please suggest a code for that

jDo Over a year ago

@dantiston Sure set() has extend? It's a list attribute. Didn't you mean set.add()?

dantiston Over a year ago

@jDo you're right, I wrote and tested as a list and forgot to change extend when I switched to set.

dantiston Over a year ago

@Zoro99 I updated the code to store the results at each line.

BramV · Accepted Answer · 2016-03-13 11:17:21Z

0

You can use a regex. Some more info here: https://docs.python.org/2/library/re.html

You could write it like this

import re
file = open('test.txt')
fi = file.readlines()

for line in fi:
    match = re.match('.*id (\d+).*',line)
    if match:
        print("************ %s" % match.group(1))
    match = re.match('.*192\.168\.45\.28\.(\d+).*',line)
    if match:
        print(match.group(1))

**update**

As jDo pointed out it is better to use findall, compile the regex upfront qnd dont use readlines, so you will get something like this:

import re

re_id = re.compile("id (\d+)")
re_ip = re.compile("192\.168\.45\.28\.(\d+)")
with open("test.txt", "r") as f:
    for line in f:
        match = re.findall(re_id,line)
        if match:
            print("************ %s" % match.group(1))
        match = re.findall(re_ip,line)
        if match:
            print(match.group(1))

edited Mar 13, 2016 at 11:17

answered Mar 13, 2016 at 7:42

BramV

669 bronze badges

7 Comments

Zoro99 Over a year ago

It didnt give any output, though script got executed fine

BramV Over a year ago

I think the regex wasnt fully correct. I updated it. Quickly tested it here and should work

jDo Over a year ago

You're reading the whole file into memory though. As someone pointed out here "The efficient way to use readlines() is to not use it. Ever." Also, compile your regex for extra efficiency and use findall to search within strings rather than from the beginning (then you could do away with the asterisks)

BramV Over a year ago

You are right but he only asked for sorter code not for memory optimisation.

jDo Over a year ago

@BramV Well, I guess it's a matter of definition whether or not avoiding something you should almost never use can be called an "optimisation" :D

|

Collectives™ on Stack Overflow

Locate multiple keywords in lines using Python

3 Answers 3

Comments

4 Comments

7 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

4 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related