Count unique groups within a pandas data frame

Question

I have a data frame of patent numbers and the inventors who invented those patents. For example:

patent_number	inventor_id
1	A
1	B
2	B
2	C
3	A
3	B

I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.

What is the fastest way of counting the number of unique teams using python?

I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.

teams = []

for pat_id, pat_df in inventor_data.groupby("patent_number"):

    if list(pat_df["inventor_id"]) not in teams:
    
        teams.append(list(pat_df["inventor_id"]))

print("Number of teams ", len(teams))

I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.

Thank you!

mozway · Accepted Answer · 2022-02-01 20:47:51Z

4

You can groupby and aggregate as frozenset and count the unique values:

df.groupby('patent_number')['inventor_id'].agg(frozenset).nunique()

Output: 2

Interestingly, you can also easily get the number of occurrences of each team with value_counts:

df.groupby('patent_number')['inventor_id'].agg(frozenset). value_counts()

Output:

(B, A)    2
(B, C)    1
Name: inventor_id, dtype: int64

edited Feb 1, 2022 at 20:47

answered Feb 1, 2022 at 20:43

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Grégoire · Accepted Answer · 2022-02-01 20:47:20Z

1

You could go for:

   inventor_data = inventor_data.sort_values("inventor_id")
   inventor_data.groupby("patent_number").inventor_id.sum().nunique()

A few explanations:

Sorting the values is mandatory to avoid symmetries, and consider (A,B) and (B,A) as a single team.
You can sum the strings "A" and "B" to produce a string "AB" representing the team (A, B)

answered Feb 1, 2022 at 20:47

Grégoire

861 silver badge3 bronze badges

6 Comments

Joe Emmens Over a year ago

Thank you! I have timed your code and the answer given by @mozway above, they are method 1 and you 2. There's appears fractionally faster but you have very slightly different results, do you have any idea why? Number of teams: 3667014 Time elapsed to count teams method 1 (@Grégoire): 0:00:43.931967 Number of teams: 3666748 Time elapsed to count teams method 2 (@mozway): 0:00:38.515821

mozway Over a year ago

@Joe The set ensures to have unordered groups. Sorting is more computationally expensive (although for 2 values this should be quite minimal). That might be the cause of the time difference. Regarding the results, can you give examples of differences?

Grégoire Over a year ago

Indeed, I believe the frozenset approach from @mozway is cleaner. Do you have duplicates in your data ? For example, let's consider 3 rows for a given patent, with inventor = ["A", "A", "B"], then the approach #1 would give "AAB", while the approach #2 would give you {"A", "B"} @JoeEmmens

mozway Over a year ago

I guess it doesn't really make sense to duplicate one person, but we never know ;)

mozway Over a year ago

Btw, better use agg(''.join) rather than sum. Repeated string concatenation is very inefficient.

|

Collectives™ on Stack Overflow

Count unique groups within a pandas data frame

2 Answers 2

Comments

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related