3

Given a DataFrame like the one below:

     id  days  cluster
0   aaa     0        0
1   bbb     0        0
2   ccc     0        1
3   ddd     0        1
4   eee     0        0
5   fff     0        1
6   ggg     1        0
7   hhh     1        1
8   iii     1        0
9   lll     1        1
10  mmm     1        1
11  aaa     1        3
12  bbb     1        3

My aim is to create a dictionary with keys tuple of elements of the id column and as values a list of elements of the cluster column if the two id have the same cluster value, all filtered by days column. i.e., if the days change but there are tuple of id elements that have the same cluster value, I want to add this value to my already existing list. The desired output is reported below:

{('aaa', 'bbb'): [0, 3],('aaa', 'eee'): [0], ('bbb', 'eee'): [0], ('ccc', 'ddd'): [1], 
('ccc', 'fff'): [1], ('ddd', 'fff'): [1], ('ggg', 'iii'): [0],
 ('hhh', 'lll'): [1], ('hhh', 'mmm'): [1], ('lll', 'mmm'): [1]}

I obtained this result with the following snippet of code, but with million of rows it becomes too slow. How can I optimize the code?

y={}
for i in range(0, max(df.iloc[:,1]) + 1):
    x = df.loc[df['days'] == i]
    for j in range(0,l en(x)):
        for z in range(1, len(x)):
            if (x.iloc[z,0], x.iloc[j,0]) in y:
                pass
            else:
             if (x.iloc[j,0], x.iloc[z,0]) not in y:
                 if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
                     y[(x.iloc[j,0], x.iloc[z,0])] = [x.iloc[j,2]]
             else:
                 if x.iloc[j,0] != x.iloc[z,0] and x.iloc[j,2] == x.iloc[z,2]:
                     y[(x.iloc[j,0], x.iloc[z,0])].append(x.iloc[j,2])

2
  • In your example, ID 'aaa' has possible cluster values of 0 and 3 (for days 0 and 1 respectively). But in your desired output, ID 'aaa' is grouped with 'ccc', 'ddd', 'fff', 'hhh', 'lll', and 'mmm', which have cluster values of either 1 or 2. So I don't understand your statement if the two 'id' have the same 'cluster' value. Commented Jul 17, 2020 at 15:18
  • @mtrw you are right! Fixed it, the desired output that I posted was wrong! Thank you Commented Jul 17, 2020 at 15:35

2 Answers 2

2

Considering that the bottleneck is obtaining the combinations of ids, why not leave it until the very end?

Group the data by id, each id will show a set of the "bins" (day, cluster) where it is found:

grouped = collections.defaultdict(set)
for index, (id_, day, cluster) in df.iterrows():
    grouped[id_].add((day, cluster))

For each bin combination found, make a list of id's that belong to each one:

binned = collections.defaultdict(list)
for id_, bins in grouped.items():
    binned[tuple(sorted(bins))].append(id_)

Simplify only by cluster if that is what you need:

clustered = collections.defaultdict(list)
for bins, ids in binned.items():
    clusters = set(cluster for (day, cluster) in bins)
    clustered[tuple(sorted(clusters))].extend(ids)

And finally, getting the combinations of ids for each cluster bin shouldn't be a problem:

for bins, ids in clustered.items():
    if len(ids) > 1:
        for comb_id in itertools.combinations(ids, 2):
            print(bins, comb_id) 
            # or do other stuff with it
Sign up to request clarification or add additional context in comments.

Comments

0

You can take advantage of the pandas.DataFrame.groupby method:

result = collections.defaultdict(list)

for (day, cluster), group in df.groupby(["days", "cluster"]):
    for comb in itertools.combinations(df["id"][group.index], 2):
        result[comb].append(cluster)

which will give you the result you need:

defaultdict(<class 'list'>, {('aaa', 'bbb'): [0, 3], ('aaa', 'eee'): [0], ('bbb', 'eee'): [0], ('ccc', 'ddd'): [1], ('ccc', 'fff'): [1], ('ddd', 'fff'): [1], ('ggg', 'iii'): [0], ('hhh', 'lll'): [1], ('hhh', 'mmm'): [1], ('lll', 'mmm'): [1]})

2 Comments

your code looks like faster, but unfortunately with 200.000 rows, 9.000 different id, 2 different days and 8 different cluster, my session crashed after using all available RAM.
It's no surprise. With such data, you'd expect to see almost all ids many times in each cluster if numbers are random. The combinations are "infinite". Have you tried the inverse approach of having the cluster combination as dictionary key and a list of ids that appear in that cluster combination as the dictionary value. Getting the combinations of ids for each cluster combination would then be feasible. I managed to do it in about 2 minutes using 6GB of the RAM with a random sample with the numbers you mentioned, You could also try a producer-consumer approach.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.