1

Good morning everyone,

I am taking a Python class right now and we havent covered what I am about to ask. So any help would be great. I have a Python Script that parses emails out of document, but it only allows me to do one document at a time. I have roughly 500 gigs of documents and most of them contain email addresses. I was wondering if there is a way to change this script to read all subfolders and documents and skip over any errors if there are any. I understand there are some file types it may not be able to read. Some of the common file types are .txt, .csv, .sql, .xlsx.

Here is the script I found and it works very well for one file at a time. As always thanks everyone for the help.

#!/usr/bin/env python
#
# Extracts email addresses from one or more plain text files.
#
# Notes:
# - Does not save to file (pipe the output to a file if you want it saved).
# - Does not check for duplicates (which can easily be done in the terminal).
#


from optparse import OptionParser
import os.path
import re

regex = re.compile(("([a-z0-9!#$%&'*+\/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+\/=?^_`"
                    "{|}~-]+)*(@|\sat\s)(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?(\.|"
                    "\sdot\s))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

def file_to_str(filename):
    """Returns the contents of filename as a string."""
    with open(filename) as f:
        return f.read().lower() # Case is lowered to prevent regex mismatches.

def get_emails(s):
    """Returns an iterator of matched emails found in string s."""
    # Removing lines that start with '//' because the regular expression
    # mistakenly matches patterns like 'http://[email protected]' as '//[email protected]'.
    return (email[0] for email in re.findall(regex, s) if not email[0].startswith('//'))

if __name__ == '__main__':
    parser = OptionParser(usage="Usage: python %prog [FILE]...")
    # No options added yet. Add them here if you ever need them.
    options, args = parser.parse_args()

    if not args:
        parser.print_usage()
        exit(1)

    for arg in args:
        if os.path.isfile(arg):
            for email in get_emails(file_to_str(arg)):
                print email
        else:
            print '"{}" is not a file.'.format(arg)
            parser.print_usage()
1
  • Well, you could call that script from another one that navigates subfoldes, so you would have one process per document (it would have the benefit of not making your code stop on parsing errors and you would process multiple documents at once). I'd also recommend to add a list of supported file types. Commented Sep 5, 2018 at 12:50

2 Answers 2

1

You could use os.walk like this:

not_parseble_files = ['.txt', '.csv']
for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files
    for file in files:
        _,file_ext = os.path.splitext(file)#Here we get the extension of the file
        file_path = os.path.join(root,file)
        if file_ext in not_parseble_files:#We make sure the extension is not in the banned list 'not_parseble_files'
           print("File %s is not parseble"%file_path)
           continue #This one continues the loop to the next file
        if os.path.isfile(file_path):
            for email in get_emails(file_to_str(file_path)):
                print(email)
Sign up to request clarification or add additional context in comments.

11 Comments

Thank you.I will give this a shot.
As an input to the script you will now have to give the root directory of the subfolders
File "parser.py", line 38 print "File %s is not parseble"%file_path ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("File %s is not parseble"%file_path)? -- That is the message I get
Yes, sorry, I am using python 2.7. You have to use the parentehses, I updated my answer
I get errors with this too. Traceback (most recent call last): File "/Documents/parser.py", line 32, in <module> for root, dirs, files in os.walk(root_folder):#This recursively searches all sub directories for files NameError: name 'root_folder' is not defined @alexandru-martalogu thank you for the help.
|
1

You can use os.walk to traverse all the subdirectories:

import os
if __name__ == '__main__':
    parser = OptionParser(usage="Usage: python %prog [DIRECTORIES]...")
    # No options added yet. Add them here if you ever need them.
    options, args = parser.parse_args()

    if not args:
        parser.print_usage()
        exit(1)

    for dir in args:
        for root, _, files in os.walk(dir):
            for file in files:
                if any(file.endswith(ext) for ext in ('.txt', '.csv', '.sql', '.xlsx')):
                    for email in get_emails(file_to_str(os.path.join(root, file))):
                        print(email)

7 Comments

Thank you I will give this a shot.
So I just replace with what you wrote? if name == 'main': parser = OptionParser(usage="Usage: python %prog [FILE]...") # No options added yet. Add them here if you ever need them. options, args = parser.parse_args() if not args: parser.print_usage() exit(1) for arg in args: if os.path.isfile(arg): for email in get_emails(file_to_str(arg)): print email else: print '"{}" is not a file.'.format(arg) parser.print_usage()
Yes the code in my answer is meant as a replacement to your main block.
Thank you sir. I will give this a shot.
So when I tried this I get a error message. I run python parser.py \documents (because thats where all the folders are) > Master.txt and I get this error File "parser.py", line 46 print email(email) ^ SyntaxError: invalid syntax
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.