Python Script - Email Parser

Question

Good morning everyone,

I am taking a Python class right now and we havent covered what I am about to ask. So any help would be great. I have a Python Script that parses emails out of document, but it only allows me to do one document at a time. I have roughly 500 gigs of documents and most of them contain email addresses. I was wondering if there is a way to change this script to read all subfolders and documents and skip over any errors if there are any. I understand there are some file types it may not be able to read. Some of the common file types are .txt, .csv, .sql, .xlsx.

Here is the script I found and it works very well for one file at a time. As always thanks everyone for the help.

#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename) as f:
        return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://[email protected]' as '//[email protected]'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

if __name__ == '__main__':
    parser = OptionParser(usage="Usage: python %prog [FILE]...")
    # No options added yet. Add them here if you ever need them.
    options, args = parser.parse_args()

    if not args:
        parser.print_usage()
        exit(1)

    for arg in args:
        if os.path.isfile(arg):
            for email in get_emails(file_to_str(arg)):
                print email
        else:
            print '"{}" is not a file.'.format(arg)
            parser.print_usage()

Well, you could call that script from another one that navigates subfoldes, so you would have one process per document (it would have the benefit of not making your code stop on parsing errors and you would process multiple documents at once). I'd also recommend to add a list of supported file types. — Lucas Wieloch
– Lucas Wieloch, Commented Sep 5, 2018 at 12:50

Alexandru Martalogu · Accepted Answer · 2018-09-06 08:09:31Z

1

You could use os.walk like this:

not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files
    for file in files:
        _,file_ext = os.path.splitext(file)#Here we get the extension of the file
        file_path = os.path.join(root,file)
        if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
           print("File %s is not parseble"%file_path)
           continue #This one continues the loop to the next file
        if os.path.isfile(file_path):
            for email in get_emails(file_to_str(file_path)):
                print(email)

edited Sep 6, 2018 at 8:09

answered Sep 5, 2018 at 12:53

Alexandru Martalogu

2732 silver badges12 bronze badges

Sign up to request clarification or add additional context in comments.

11 Comments

Alex Over a year ago

Thank you.I will give this a shot.

Alexandru Martalogu Over a year ago

As an input to the script you will now have to give the root directory of the subfolders

Alex Over a year ago

File "parser.py", line 38 print "File %s is not parseble"%file_path ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("File %s is not parseble"%file_path)? -- That is the message I get

Alexandru Martalogu Over a year ago

Yes, sorry, I am using python 2.7. You have to use the parentehses, I updated my answer

Alex Over a year ago

I get errors with this too.

Traceback (most recent call last):   File "/Documents/parser.py", line 32, in <module>     for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files NameError: name 'root_folder' is not defined

@alexandru-martalogu thank you for the help.

|

blhsing · Accepted Answer · 2018-09-06 02:41:33Z

1

You can use os.walk to traverse all the subdirectories:

import os
if __name__ == '__main__':
    parser = OptionParser(usage="Usage: python %prog [DIRECTORIES]...")
    # No options added yet. Add them here if you ever need them.
    options, args = parser.parse_args()

    if not args:
        parser.print_usage()
        exit(1)

    for dir in args:
        for root, _, files in os.walk(dir):
            for file in files:
                if any(file.endswith(ext) for ext in ('.txt', '.csv', '.sql', '.xlsx')):
                    for email in get_emails(file_to_str(os.path.join(root, file))):
                        print(email)

edited Sep 6, 2018 at 2:41

answered Sep 5, 2018 at 12:53

blhsing

109k9 gold badges89 silver badges132 bronze badges

7 Comments

Alex Over a year ago

Thank you I will give this a shot.

Alex Over a year ago

So I just replace with what you wrote? if name == 'main': parser = OptionParser(usage="Usage: python %prog [FILE]...") # No options added yet. Add them here if you ever need them. options, args = parser.parse_args() if not args: parser.print_usage() exit(1) for arg in args: if os.path.isfile(arg): for email in get_emails(file_to_str(arg)): print email else: print '"{}" is not a file.'.format(arg) parser.print_usage()

blhsing Over a year ago

Yes the code in my answer is meant as a replacement to your main block.

Alex Over a year ago

Thank you sir. I will give this a shot.

Alex Over a year ago

So when I tried this I get a error message. I run python parser.py \documents (because thats where all the folders are) > Master.txt and I get this error File "parser.py", line 46 print email(email) ^ SyntaxError: invalid syntax

|

Collectives™ on Stack Overflow

Python Script - Email Parser

2 Answers 2

11 Comments

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

11 Comments

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related