Python Multiprocessing File Read

Question

I have a list of keywords and I want to validate if any of these keywords are inside a file containing more than 100,000 domain names. For faster processing, I want to implement multiprocessing so that each keyword can be validated in parallel.

My code doesn't seem to be working well as single processing is much faster. What's wrong? :(

import time
from multiprocessing import Pool


def multiprocessing_func(keyword):

    # File containing more than 100k domain names
    # URL: https://raw.githubusercontent.com/CERT-MZ/projects/master/Domain-squatting/domain-names.txt
    file_domains = open("domain-names.txt", "r")

    for domain in file_domains:
        if keyword in domain:
            print("similar domain identified:", domain)
            
    # Rewind the file, start from the begining
    file_domains.seek(0)


if __name__ == '__main__':

    starttime = time.time()

    # Keywords to check
    keywords = ["google","facebook", "amazon", "microsoft", "netflix"]

    # Create a multiprocessing Pool
    pool = Pool()  

    for keyword in keywords:
        print("Checking keyword:", keyword)
        
        # Without multiprocessing pool
        #multiprocessing_func(keyword)
        
        # With multiprocessing pool
        pool.map(multiprocessing_func, keyword)

    # Total run time
    print('That took {} seconds'.format(time.time() - starttime))

you should just do pool.map(multiprocessing_func, keywords) — goku
– goku, Commented Sep 10, 2020 at 20:02
Note that much of the delay of the parallel method is due to printing. The printing messages are ordered which takes a huge amount of time. If I comment out the print statement in the function, I get a x60 speedup in the parallel task on my machine. (However that time is still x2 the time of the single thread solution. For that, refer to the answer) — kyriakosSt
– kyriakosSt, Commented Sep 10, 2020 at 20:03
@MarkTolonen I agree that reading the file multiple times is hugely unnecessary, but I think there are a few things that need to addressed in the set approach you suggested: 0. this code is clearly a multithreading exercise and not an application, so the goal is to use multi processing. 1. How would you use a set to search for substrings?? would you insert the whole domain? if so the set is useless as it has to be iterated. would you split by some notion of words? How would you define them? And if so, a trie data structure is a better option. [1/2] — kyriakosSt
– kyriakosSt, Commented Sep 10, 2020 at 20:19
... 2. Asymptotic complexity is not a relevant measurement in most cases when we are talking about multiprocessing, as it disregards linear-time speedups - which are crucial in many non-algorithmic applications. [2/2] — kyriakosSt
– kyriakosSt, Commented Sep 10, 2020 at 20:19
@kyriakosSt Didn't notice the code was doing a "keyword in <string>". Mis-read as "keyword in <list>" which could use a set. — Mark Tolonen
– Mark Tolonen, Commented Sep 10, 2020 at 20:27

Tim Peters · Accepted Answer · 2020-09-10 19:58:57Z

2

Think about why this program:

import multiprocessing as mp

def work(keyword):
    print("working on", repr(keyword))

if __name__ == "__main__":
    with mp.Pool(4) as pool:
        pool.map(work, "google")

prints

working on 'g'
working on 'o'
working on 'o'
working on 'g'
working on 'l'
working on 'e'

map() works on a sequence, and a string is a sequence. Instead of sticking the map() call in a loop, you presumably want to invoke it only once with keywords (the whole list) as its second argument.

answered Sep 10, 2020 at 19:58

Tim Peters

71.4k14 gold badges133 silver badges140 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Python Multiprocessing File Read

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related