Python for loop optimization

Question

Given a DataFrame like the one below:

     id  days  cluster
0   aaa     0        0
1   bbb     0        0
2   ccc     0        1
3   ddd     0        1
4   eee     0        0
5   fff     0        1
6   ggg     1        0
7   hhh     1        1
8   iii     1        0
9   lll     1        1
10  mmm     1        1
11  aaa     1        3
12  bbb     1        3

My aim is to create a dictionary with keys tuple of elements of the id column and as values a list of elements of the cluster column if the two id have the same cluster value, all filtered by days column. i.e., if the days change but there are tuple of id elements that have the same cluster value, I want to add this value to my already existing list. The desired output is reported below:

{('aaa', 'bbb'): [0, 3],('aaa', 'eee'): [0], ('bbb', 'eee'): [0], ('ccc', 'ddd'): [1], 
('ccc', 'fff'): [1], ('ddd', 'fff'): [1], ('ggg', 'iii'): [0],
 ('hhh', 'lll'): [1], ('hhh', 'mmm'): [1], ('lll', 'mmm'): [1]}

I obtained this result with the following snippet of code, but with million of rows it becomes too slow. How can I optimize the code?

y={}
for i in range(0, max(df.iloc[:,1]) + 1):
    x = df.loc[df['days'] == i]
    for j in range(0,l en(x)):
        for z in range(1, len(x)):
            if (x.iloc[z,0], x.iloc[j,0]) in y:
                pass
            else:
             if (x.iloc[j,0], x.iloc[z,0]) not in y:
                 if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
                     y[(x.iloc[j,0], x.iloc[z,0])] = [x.iloc[j,2]]
             else:
                 if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
                     y[(x.iloc[j,0], x.iloc[z,0])].append(x.iloc[j,2])

In your example, ID 'aaa' has possible cluster values of 0 and 3 (for days 0 and 1 respectively). But in your desired output, ID 'aaa' is grouped with 'ccc', 'ddd', 'fff', 'hhh', 'lll', and 'mmm', which have cluster values of either 1 or 2. So I don't understand your statement if the two 'id' have the same 'cluster' value. — mtrw
– mtrw, Commented Jul 17, 2020 at 15:18
@mtrw you are right! Fixed it, the desired output that I posted was wrong! Thank you — Nicolapo
– Nicolapo, Commented Jul 17, 2020 at 15:35

chapelo · Accepted Answer · 2020-07-25 16:24:00Z

Considering that the bottleneck is obtaining the combinations of ids, why not leave it until the very end?

Group the data by id, each id will show a set of the "bins" (day, cluster) where it is found:

grouped = collections.defaultdict(set)
for index, (id_, day, cluster) in df.iterrows():
    grouped[id_].add((day, cluster))

For each bin combination found, make a list of id's that belong to each one:

binned = collections.defaultdict(list)
for id_, bins in grouped.items():
    binned[tuple(sorted(bins))].append(id_)

Simplify only by cluster if that is what you need:

clustered = collections.defaultdict(list)
for bins, ids in binned.items():
    clusters = set(cluster for (day, cluster) in bins)
    clustered[tuple(sorted(clusters))].extend(ids)

And finally, getting the combinations of ids for each cluster bin shouldn't be a problem:

for bins, ids in clustered.items():
    if len(ids) > 1:
        for comb_id in itertools.combinations(ids, 2):
            print(bins, comb_id) 
            # or do other stuff with it

chapelo · Accepted Answer · 2020-07-19 19:04:39Z

0

You can take advantage of the pandas.DataFrame.groupby method:

result = collections.defaultdict(list)

for (day, cluster), group in df.groupby(["days", "cluster"]):
    for comb in itertools.combinations(df["id"][group.index], 2):
        result[comb].append(cluster)

which will give you the result you need:

defaultdict(<class 'list'>, {('aaa', 'bbb'): [0, 3], ('aaa', 'eee'): [0], ('bbb', 'eee'): [0], ('ccc', 'ddd'): [1], ('ccc', 'fff'): [1], ('ddd', 'fff'): [1], ('ggg', 'iii'): [0], ('hhh', 'lll'): [1], ('hhh', 'mmm'): [1], ('lll', 'mmm'): [1]})

answered Jul 19, 2020 at 19:04

chapelo

2,56216 silver badges19 bronze badges

2 Comments

Nicolapo Over a year ago

your code looks like faster, but unfortunately with 200.000 rows, 9.000 different id, 2 different days and 8 different cluster, my session crashed after using all available RAM.

chapelo Over a year ago

It's no surprise. With such data, you'd expect to see almost all ids many times in each cluster if numbers are random. The combinations are "infinite". Have you tried the inverse approach of having the cluster combination as dictionary key and a list of ids that appear in that cluster combination as the dictionary value. Getting the combinations of ids for each cluster combination would then be feasible. I managed to do it in about 2 minutes using 6GB of the RAM with a random sample with the numbers you mentioned, You could also try a producer-consumer approach.

Collectives™ on Stack Overflow

Python for loop optimization

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related