How to compare strings of lines from two different files using python?

Question

I want to find the matching email in two files and sent date by comparing emails from two files. I have two files 1) maillog.txt(postfix maillog) and 2)testmail.txt(contains emails separated by newline) i have used re to extract the email and sent date from maillog.txt file which looks like below,

Nov  3 10:08:43 server postfix/smtp[150754]: 78FA8209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.24.26]:25, delay=3.2, delays=0.1/0/1.6/1.5, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718076 m11si5060862pls.447 - gsmtp)
Nov  3 10:10:45 server postfix/smtp[150754]: 7C42A209EDEF: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.152.217]:25, delay=5.4, delays=0.1/0/3.8/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvkvt5tgc-1 Message accepted for delivery)
Nov  3 10:15:45 server postfix/smtp[150754]: 83533209EDE8: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.8, delays=0.1/0/3.3/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvm8yww64-1 Message accepted for delivery)
Nov  3 10:16:42 server postfix/smtp[150754]: 83A5E209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.6, delays=0.1/0/0.82/0.69, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718555 j186si6198120pgc.455 - gsmtp)
Nov  3 10:17:44 server postfix/smtp[150754]: 8636D209EDEF: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.1, delays=0.11/0/2.6/1.4, dsn=2.0.0, status=sent (250 2.0.0 2dvm8ywwdh-1 Message accepted for delivery)
Nov  3 10:18:42 server postfix/smtp[150754]: 8A014209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.9, delays=0.1/0/0.72/1.1, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718675 o2si6032950pgp.46 - gsmtp)

Here is my another file testmail.txt :

[email protected]
[email protected]

Below is what i have tried and it works too but I want to know if there is more efficient way to do this for large number of maillogs and email addresses

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
with open("testmail.txt") as fh1:
    for addr in fh1:
        if addr:
            with open("maillog.txt") as fh:
                for line in fh:
                    if line:
                        match=re.finditer(pattern,line)
                        for obj in match:
                            addr=addr.strip()
                            addr2=obj.group('email').strip()
                            if addr == addr2:
                                print(obj.groupdict('email'))

this will print out put like if match is found:

{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': '[email protected]'}

But in general, you don't want to read maillog over and over again for each test mail. Instead read the test mails to a set and then scan through maillog once, testing if a mail in a row is in the set. — Ilja Everilä
– Ilja Everilä, Commented Nov 21, 2017 at 7:09
Which one is larger, maillog.txt or testmail? The common way is to load the smaller file in memory (if possible in a dict or a set for faster research) and then scan the larger one line at a time. — Serge Ballesta
– Serge Ballesta, Commented Nov 21, 2017 at 7:38
@SergeBallesta the file "testmail.txt" is smaller, i guess i should put this file in a set and compare against the large maillog file while reading it line by line — sherpaurgen
– sherpaurgen, Commented Nov 21, 2017 at 8:56

Serge Ballesta · Accepted Answer · 2017-11-21 09:16:20Z

My advice would be to store all the emails from testmail.txt in a set, compile the regex, and then iterate over the lines of maillog.txt and search in the mail is in the set. That way, only the shorter of the files has to reside in memory, the regex pattern in only compiled once, and researches are done in a set which is optimized for this kind of access:

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
    for line in fd:
        mails.add(line.strip())

#compile the regex once
rx = re.compile(pattern)

#process the maillog file:
with open('maillog.txt') as fd:
    for line in fd:
        m = rx.match(line)
        if m is not None and m.groupdict()['email'] in mails:
            print(m.groupdict())

The output with your example data is as expected:

{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': '[email protected]'}

theBuzzyCoder · Accepted Answer · 2017-11-21 07:28:00Z

This is my solution

In [1]: import re

In [2]: pat = r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

In [3]: emails = set()

In [4]: date_email = {}

In [6]: with open('maillog.txt', mode='r') as f:
   ...:     for line in f:
   ...:         month, day, ts, email = re.search(pat, line).group('month', 'day', 'ts', 'email')
   ...:         date_email[email] = (month, day, ts)
   ...:         

In [7]: date_email
Out[7]: 
{'[email protected]': ('Nov', '3', '10:08:43'),
 '[email protected]': ('Nov', '3', '10:10:45'),
 '[email protected]': ('Nov', '3', '10:16:42'),
 '[email protected]': ('Nov', '3', '10:15:45'),
 '[email protected]': ('Nov', '3', '10:18:42'),
 '[email protected]': ('Nov', '3', '10:17:44')}

In [11]: with open('testmail.txt', mode='r') as f:
    ...:     for line in f:
    ...:         emails.add(line.strip())
    ...:         

In [12]: emails
Out[12]: {'[email protected]', '[email protected]'}

In [15]: for email in emails:
    ...:     if email in date_email:
    ...:         print(email, date_email[email])
    ...:         
('[email protected]', ('Nov', '3', '10:08:43'))

You can format output the way you want.

open statement along with "with" keyword can be combined like this

with open(file1, mode='r') as f1, open(file2, mode='r') as f2:
    # do something with f1
    # do something with f2

IMHO, as the file were said large it would be better to load only one in memory and scan the other...

tripleee · Accepted Answer · 2017-11-21 07:58:07Z

1

Quick and untested but simple enough conceptually: compile a big whoppin' regex with all the addresses already in it.

import re

with open("testmail.txt") as fh1:
    emails = []
    for addr in fh1:
        emails.append(re.escape(addr.strip()))
    pattern=re.compile(
        r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
            '|'.join(emails))

with open("maillog.txt") as fh:
    for line in fh:
        for match in pattern.finditer(line):
            print(match.groupdict())

edited Nov 21, 2017 at 7:58

answered Nov 21, 2017 at 7:52

tripleee

192k37 gold badges318 silver badges369 bronze badges

Comments

Aaditya Ura · Accepted Answer · 2017-11-21 08:05:20Z

You can try with regex and capture the group :

Let's solve your solution in three steps :

Step first capturing all email address from email.txt :

emails=[]
    with open('emails.txt','r') as f:
        for line in f:
            emails.append(re.search(email_pattern,line).group())

Second step capturing needed data from data.txt:

with open('data.txt','r') as f:
        month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]

Third step : Now we have all the data , just check if that email in our data list then add that group info to dict:

for item in month_day:
        final_dict = {}
        if item[1] in emails:
            final_dict['month'] = item[0][0]
            final_dict['day'] = item[0][1]
            final_dict['ts'] = item[0][2]
            final_dict['email'] = item[1]
        if final_dict:
            print(final_dict)

Full code:

 import re
pattern='^(\w{0,3})\s.(\d)\s(\d.+?\s)|<(\w+[@]\w+[.]\w+)>'
email_pattern='\w+[@]\w+[.]\w+'

emails=[]
with open('emails.txt','r') as f:
    for line in f:
        emails.append(re.search(email_pattern,line).group())
with open('data.txt','r') as f:
    month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]


for item in month_day:
    final_dict = {}
    if item[1] in emails:
        final_dict['month'] = item[0][0]
        final_dict['day'] = item[0][1]
        final_dict['ts'] = item[0][2]
        final_dict['email'] = item[1]
    if final_dict:
        print(final_dict)

output:

{'ts': '10:08:43 ', 'month': 'Nov', 'email': '[email protected]', 'day': '3'}

Regex information :

^ asserts position at start of a line
\w{0,3} matches any word character (equal to [a-zA-Z0-9_])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])

Collectives™ on Stack Overflow

How to compare strings of lines from two different files using python?

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

Comments

1 Comment

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related