I have a function that counts how often a list of items appears in rows below:
def count(pair_list):
return float(sum([1 for row in rows if all(item in row.split() for item in pair_list)]))
if __name__ == "__main__":
pairs = [['apple', 'banana'], ['cookie', 'popsicle'], ['candy', 'cookie'], ...]
# grocery transaction data
rows = ['apple cookie banana popsicle wafer', 'almond milk eggs butter bread', 'bread almonds apple', 'cookie candy popsicle pop', ...]
res = [count(pair) for pair in pairs]
In reality, len(rows) is 10000 and there are 18000 elements in pairs, so the computing cost of the list comprehension in count() and the one in the main function is expensive.
I tried some parallel processing:
from multiprocessing.dummy import Pool as ThreadPool
import multiprocessing as mp
threadpool = ThreadPool(processes = mp.cpu_count())
res = threadpool.map(count, pairs)
This doesn't run quickly, either. In fact, after 15 minutes, I just quit the job because it didn't look to be ending. Two questions: 1) how can I speed up the actualy searching that takes place in count()? 2) how can I check the status of the threadpool.map process (i.e. see how many pairs are left to iterate over)?
rows. Try to pass chunksize = 100 tomap.count(), though. Wouldn't that be the bottleneck, here, versus just iterating through all p inpairsto getcount(p)?cmdin my Windows system? Or could it be because I'm running it through IDLE that it's so slow?