1

I have a data frame of patent numbers and the inventors who invented those patents. For example:

patent_number inventor_id
1 A
1 B
2 B
2 C
3 A
3 B

I define a team as a group of inventors who produce a patent together. E.g. the team (A,B) produced patent 1, (B,C) patent 2 and again (A,B) produced patent 3. I want to count the number of unique teams. In this case the answer is 2.

What is the fastest way of counting the number of unique teams using python?

I have written this code, but it is very slow when I run it on my entire data set which includes over 6 million patent numbers and 3.5 million unique inventor ids.

teams = []

for pat_id, pat_df in inventor_data.groupby("patent_number"):

    if list(pat_df["inventor_id"]) not in teams:
    
        teams.append(list(pat_df["inventor_id"]))

print("Number of teams ", len(teams))

I am looking for speed improvements. If you can help me with understand the reasons why they are faster I am always keen to learn about this.

Thank you!

2 Answers 2

4

You can groupby and aggregate as frozenset and count the unique values:

df.groupby('patent_number')['inventor_id'].agg(frozenset).nunique()

Output: 2

Interestingly, you can also easily get the number of occurrences of each team with value_counts:

df.groupby('patent_number')['inventor_id'].agg(frozenset). value_counts()

Output:

(B, A)    2
(B, C)    1
Name: inventor_id, dtype: int64
Sign up to request clarification or add additional context in comments.

Comments

1

You could go for:

   inventor_data = inventor_data.sort_values("inventor_id")
   inventor_data.groupby("patent_number").inventor_id.sum().nunique()

A few explanations:

  • Sorting the values is mandatory to avoid symmetries, and consider (A,B) and (B,A) as a single team.
  • You can sum the strings "A" and "B" to produce a string "AB" representing the team (A, B)

6 Comments

Thank you! I have timed your code and the answer given by @mozway above, they are method 1 and you 2. There's appears fractionally faster but you have very slightly different results, do you have any idea why? Number of teams: 3667014 Time elapsed to count teams method 1 (@Grégoire): 0:00:43.931967 Number of teams: 3666748 Time elapsed to count teams method 2 (@mozway): 0:00:38.515821
@Joe The set ensures to have unordered groups. Sorting is more computationally expensive (although for 2 values this should be quite minimal). That might be the cause of the time difference. Regarding the results, can you give examples of differences?
Indeed, I believe the frozenset approach from @mozway is cleaner. Do you have duplicates in your data ? For example, let's consider 3 rows for a given patent, with inventor = ["A", "A", "B"], then the approach #1 would give "AAB", while the approach #2 would give you {"A", "B"} @JoeEmmens
I guess it doesn't really make sense to duplicate one person, but we never know ;)
Btw, better use agg(''.join) rather than sum. Repeated string concatenation is very inefficient.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.