0

I want to find the matching email in two files and sent date by comparing emails from two files. I have two files 1) maillog.txt(postfix maillog) and 2)testmail.txt(contains emails separated by newline) i have used re to extract the email and sent date from maillog.txt file which looks like below,

Nov  3 10:08:43 server postfix/smtp[150754]: 78FA8209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.24.26]:25, delay=3.2, delays=0.1/0/1.6/1.5, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718076 m11si5060862pls.447 - gsmtp)
Nov  3 10:10:45 server postfix/smtp[150754]: 7C42A209EDEF: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.152.217]:25, delay=5.4, delays=0.1/0/3.8/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvkvt5tgc-1 Message accepted for delivery)
Nov  3 10:15:45 server postfix/smtp[150754]: 83533209EDE8: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.8, delays=0.1/0/3.3/1.5, dsn=2.0.0, status=sent (250 2.0.0 2dvm8yww64-1 Message accepted for delivery)
Nov  3 10:16:42 server postfix/smtp[150754]: 83A5E209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.6, delays=0.1/0/0.82/0.69, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718555 j186si6198120pgc.455 - gsmtp)
Nov  3 10:17:44 server postfix/smtp[150754]: 8636D209EDEF: to=<[email protected]>, relay=mxa-000f9e01.gslb.pphosted.com[67.231.144.222]:25, delay=4.1, delays=0.11/0/2.6/1.4, dsn=2.0.0, status=sent (250 2.0.0 2dvm8ywwdh-1 Message accepted for delivery)
Nov  3 10:18:42 server postfix/smtp[150754]: 8A014209EDEF: to=<[email protected]>, relay=aspmx.l.google.com[74.125.200.27]:25, delay=1.9, delays=0.1/0/0.72/1.1, dsn=2.0.0, status=sent (250 2.0.0 OK 1509718675 o2si6032950pgp.46 - gsmtp)

Here is my another file testmail.txt :

[email protected]
[email protected]

Below is what i have tried and it works too but I want to know if there is more efficient way to do this for large number of maillogs and email addresses

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'
with open("testmail.txt") as fh1:
    for addr in fh1:
        if addr:
            with open("maillog.txt") as fh:
                for line in fh:
                    if line:
                        match=re.finditer(pattern,line)
                        for obj in match:
                            addr=addr.strip()
                            addr2=obj.group('email').strip()
                            if addr == addr2:
                                print(obj.groupdict('email'))

this will print out put like if match is found:

{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': '[email protected]'}
4
  • This might be better suited to code review. Commented Nov 21, 2017 at 7:03
  • 2
    But in general, you don't want to read maillog over and over again for each test mail. Instead read the test mails to a set and then scan through maillog once, testing if a mail in a row is in the set. Commented Nov 21, 2017 at 7:09
  • Which one is larger, maillog.txt or testmail? The common way is to load the smaller file in memory (if possible in a dict or a set for faster research) and then scan the larger one line at a time. Commented Nov 21, 2017 at 7:38
  • @SergeBallesta the file "testmail.txt" is smaller, i guess i should put this file in a set and compare against the large maillog file while reading it line by line Commented Nov 21, 2017 at 8:56

4 Answers 4

1

My advice would be to store all the emails from testmail.txt in a set, compile the regex, and then iterate over the lines of maillog.txt and search in the mail is in the set. That way, only the shorter of the files has to reside in memory, the regex pattern in only compiled once, and researches are done in a set which is optimized for this kind of access:

import re
pattern=r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

# load the testmail file into a set
mails = set()
with open('testmail.txt') as fd:
    for line in fd:
        mails.add(line.strip())

#compile the regex once
rx = re.compile(pattern)

#process the maillog file:
with open('maillog.txt') as fd:
    for line in fd:
        m = rx.match(line)
        if m is not None and m.groupdict()['email'] in mails:
            print(m.groupdict())

The output with your example data is as expected:

{'month': 'Nov', 'day': '3', 'ts': '10:08:43', 'email': '[email protected]'}
Sign up to request clarification or add additional context in comments.

Comments

1

This is my solution

In [1]: import re

In [2]: pat = r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>([\w\.-]+)@([\w\.-]+))'

In [3]: emails = set()

In [4]: date_email = {}

In [6]: with open('maillog.txt', mode='r') as f:
   ...:     for line in f:
   ...:         month, day, ts, email = re.search(pat, line).group('month', 'day', 'ts', 'email')
   ...:         date_email[email] = (month, day, ts)
   ...:         

In [7]: date_email
Out[7]: 
{'[email protected]': ('Nov', '3', '10:08:43'),
 '[email protected]': ('Nov', '3', '10:10:45'),
 '[email protected]': ('Nov', '3', '10:16:42'),
 '[email protected]': ('Nov', '3', '10:15:45'),
 '[email protected]': ('Nov', '3', '10:18:42'),
 '[email protected]': ('Nov', '3', '10:17:44')}

In [11]: with open('testmail.txt', mode='r') as f:
    ...:     for line in f:
    ...:         emails.add(line.strip())
    ...:         

In [12]: emails
Out[12]: {'[email protected]', '[email protected]'}

In [15]: for email in emails:
    ...:     if email in date_email:
    ...:         print(email, date_email[email])
    ...:         
('[email protected]', ('Nov', '3', '10:08:43'))

You can format output the way you want.

open statement along with "with" keyword can be combined like this

with open(file1, mode='r') as f1, open(file2, mode='r') as f2:
    # do something with f1
    # do something with f2

1 Comment

IMHO, as the file were said large it would be better to load only one in memory and scan the other...
1

Quick and untested but simple enough conceptually: compile a big whoppin' regex with all the addresses already in it.

import re

with open("testmail.txt") as fh1:
    emails = []
    for addr in fh1:
        emails.append(re.escape(addr.strip()))
    pattern=re.compile(
        r'(?P<month>[A-Za-z]{3})\s{1,3}(?P<day>\d{1,2})\s{1,2}(?P<ts>\d+:\d+:\d+).*to=<(?P<email>%s)' %
            '|'.join(emails))

with open("maillog.txt") as fh:
    for line in fh:
        for match in pattern.finditer(line):
            print(match.groupdict())

Comments

1

You can try with regex and capture the group :

Let's solve your solution in three steps :

Step first capturing all email address from email.txt :

emails=[]
    with open('emails.txt','r') as f:
        for line in f:
            emails.append(re.search(email_pattern,line).group())

Second step capturing needed data from data.txt:

with open('data.txt','r') as f:
        month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]

Third step : Now we have all the data , just check if that email in our data list then add that group info to dict:

for item in month_day:
        final_dict = {}
        if item[1] in emails:
            final_dict['month'] = item[0][0]
            final_dict['day'] = item[0][1]
            final_dict['ts'] = item[0][2]
            final_dict['email'] = item[1]
        if final_dict:
            print(final_dict)

Full code:

 import re
pattern='^(\w{0,3})\s.(\d)\s(\d.+?\s)|<(\w+[@]\w+[.]\w+)>'
email_pattern='\w+[@]\w+[.]\w+'

emails=[]
with open('emails.txt','r') as f:
    for line in f:
        emails.append(re.search(email_pattern,line).group())
with open('data.txt','r') as f:
    month_day=[[find.group(4) if find.group(4) != None else [find.group(1), find.group(2), find.group(3)] for find in re.finditer(pattern,line)]for line in f]


for item in month_day:
    final_dict = {}
    if item[1] in emails:
        final_dict['month'] = item[0][0]
        final_dict['day'] = item[0][1]
        final_dict['ts'] = item[0][2]
        final_dict['email'] = item[1]
    if final_dict:
        print(final_dict)

output:

{'ts': '10:08:43 ', 'month': 'Nov', 'email': '[email protected]', 'day': '3'}

Regex information :

^ asserts position at start of a line
\w{0,3} matches any word character (equal to [a-zA-Z0-9_])
\s matches any whitespace character (equal to [\r\n\t\f\v ])
\d matches a digit (equal to [0-9])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.