0

I have a list of keywords and I want to validate if any of these keywords are inside a file containing more than 100,000 domain names. For faster processing, I want to implement multiprocessing so that each keyword can be validated in parallel.

My code doesn't seem to be working well as single processing is much faster. What's wrong? :(

import time
from multiprocessing import Pool


def multiprocessing_func(keyword):

    # File containing more than 100k domain names
    # URL: https://raw.githubusercontent.com/CERT-MZ/projects/master/Domain-squatting/domain-names.txt
    file_domains = open("domain-names.txt", "r")

    for domain in file_domains:
        if keyword in domain:
            print("similar domain identified:", domain)
            
    # Rewind the file, start from the begining
    file_domains.seek(0)


if __name__ == '__main__':

    starttime = time.time()

    # Keywords to check
    keywords = ["google","facebook", "amazon", "microsoft", "netflix"]

    # Create a multiprocessing Pool
    pool = Pool()  

    for keyword in keywords:
        print("Checking keyword:", keyword)
        
        # Without multiprocessing pool
        #multiprocessing_func(keyword)
        
        # With multiprocessing pool
        pool.map(multiprocessing_func, keyword)

    # Total run time
    print('That took {} seconds'.format(time.time() - starttime))
6
  • you should just do pool.map(multiprocessing_func, keywords) Commented Sep 10, 2020 at 20:02
  • Note that much of the delay of the parallel method is due to printing. The printing messages are ordered which takes a huge amount of time. If I comment out the print statement in the function, I get a x60 speedup in the parallel task on my machine. (However that time is still x2 the time of the single thread solution. For that, refer to the answer) Commented Sep 10, 2020 at 20:03
  • @MarkTolonen I agree that reading the file multiple times is hugely unnecessary, but I think there are a few things that need to addressed in the set approach you suggested: 0. this code is clearly a multithreading exercise and not an application, so the goal is to use multi processing. 1. How would you use a set to search for substrings?? would you insert the whole domain? if so the set is useless as it has to be iterated. would you split by some notion of words? How would you define them? And if so, a trie data structure is a better option. [1/2] Commented Sep 10, 2020 at 20:19
  • ... 2. Asymptotic complexity is not a relevant measurement in most cases when we are talking about multiprocessing, as it disregards linear-time speedups - which are crucial in many non-algorithmic applications. [2/2] Commented Sep 10, 2020 at 20:19
  • 1
    @kyriakosSt Didn't notice the code was doing a "keyword in <string>". Mis-read as "keyword in <list>" which could use a set. Commented Sep 10, 2020 at 20:27

1 Answer 1

2

Think about why this program:

import multiprocessing as mp

def work(keyword):
    print("working on", repr(keyword))

if __name__ == "__main__":
    with mp.Pool(4) as pool:
        pool.map(work, "google")

prints

working on 'g'
working on 'o'
working on 'o'
working on 'g'
working on 'l'
working on 'e'

map() works on a sequence, and a string is a sequence. Instead of sticking the map() call in a loop, you presumably want to invoke it only once with keywords (the whole list) as its second argument.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.