3

I want to search for pre-defined list of keywords in a given article and increment the score by 1 if keyword is found in article. I want to use multiprocessing since pre-defined list of keyword is very large - 10k keywords and number of article is 100k.

I came across this question but it does not address my question.

I tried this implementation but getting None as result.

keywords = ["threading", "package", "parallelize"]

def search_worker(keyword):
    score = 0
    article = """
    The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""

   if keyword in article:
        score += 1
    return score

I tried below two method but getting three None as result.

Method1:

 pool = mp.Pool(processes=4)
 result = [pool.apply(search_worker, args=(keyword,)) for keyword in keywords]

Method2:

result = pool.map(search_worker, keywords)
print(result)

Actual output: [None, None, None]

Expected output: 3

I think of sending the worker the pre-defined list of keyword and the article all together, but I am not sure if I am going in right direction as I don't have prior experience of multiprocessing.

Thanks in advance.

7
  • Why don't use ElasticSearch as your search engine? Commented Jan 9, 2018 at 6:05
  • I am not sure how to do this using ElasticSearch. I want to calculate the confidence score for each article against a list of keywords and index article along with confidence score. Commented Jan 9, 2018 at 6:08
  • ElasticSearch Can easily do this! You should really try Commented Jan 9, 2018 at 6:10
  • There are different solutions for your case. One, you can have a shared memory, like a database. Redis is really simple and works very well. Dependning of your scale plan and planned complexity, adopt some map-reduce technique. Commented Jan 9, 2018 at 6:12
  • 2
    Your code works just fine-ish when I run it (python3.5). (I get [1, 1, 1], you just need a global count or to sum the result). Did you remember to run method 1 and method 2 using if __name__ == '__main__'? Commented Jan 9, 2018 at 6:24

2 Answers 2

1

Here's a function using Pool. You can pass text and keyword_list and it will work. You could use Pool.starmap to pass tuples of (text, keyword), but you would need to deal with an iterable that had 10k references to text.

from functools import partial
from multiprocessing import Pool

def search_worker(text, keyword):
    return int(keyword in text)

def parallel_search_text(text, keyword_list):
    processes = 4
    chunk_size = 10
    total = 0
    func = partial(search_worker, text)
    with Pool(processes=processes) as pool:
        for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
            total += result

    return total

if __name__ == '__main__':
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text(text, keywords))

There is overhead in creating a pool of workers. It might be worth timeit-testing this against a simple single-process text search function. Repeat calls can be sped up by creating one instance of Pool and passing it into the function.

def parallel_search_text2(text, keyword_list, pool):
    chunk_size = 10
    results = 0
    func = partial(search_worker, text)

    for result in pool.imap_unordered(func, keyword_list, chunksize=chunk_size):
        results += result
    return results

if __name__ == '__main__':
    pool = Pool(processes=4)
    texts = []  # a list of texts
    keywords = []  # a list of keywords
    for text in texts:
        print(parallel_search_text2(text, keywords, pool))
Sign up to request clarification or add additional context in comments.

Comments

0

User e.s resolved the main problem in his comment but I'm posting a solution to Om Prakash's comment requesting to pass in:

both article and pre-defined list of keywords to worker method

Here is a simple way to do that. All you need to do is construct a tuple containing the arguments that you want the worker to process:

from multiprocessing import Pool

def search_worker(article_and_keyword):
    # unpack the tuple
    article, keyword = article_and_keyword

    # count occurrences
    score = 0
    if keyword in article:
        score += 1

    return score

if __name__ == "__main__":
    # the article and the keywords
    article = """The multiprocessing package also includes some APIs that are not in the threading module at all. For example, there is a neat Pool class that you can use to parallelize executing a function across multiple inputs."""
    keywords = ["threading", "package", "parallelize"]

    # construct the arguments for the search_worker; one keyword per worker but same article
    args = [(article, keyword) for keyword in keywords]

    # construct the pool and map to the workers
    with Pool(3) as pool:
        result = pool.map(search_worker, args)
    print(result)

If you're on a later version of python I would recommend trying starmap as that will make this a bit cleaner.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.