4

I just start with pandas and I would like to know how to count the number of document(unique) per year per company

My data are : df

  year  document_id  company
0   1999    3     Orange
1   1999    5     Orange
2   1999    3     Orange
3   2001    41    Banana
4   2001    21    Strawberry
5   2001    18    Strawberry
6   2002    44    Orange

At the end, I would like to have a new dataframe like this

  year    document_id  company nbDocument
0   1999    [3,5]     Orange       2
1   2001    [21]      Banana       1
2   2001    [21,18]   Strawberry   2
3   2002    [44]      Orange       1

I tried :

count2 = apyData.groupby(['year','company']).agg({'document_id': pd.Series.value_counts})

But with groupby operation, I'm not able to have this kind of structure and count unique value for Orange in 1999 for example, is there a way to do this ?

Thx

1
  • Shouldn't the document_id of Banana be [41]? Commented Dec 22, 2015 at 16:38

2 Answers 2

1

You could create a new DataFrame and add the unique document_id using a list comprension as follows:

result = pd.DataFrame()
result['document_id'] = df.groupby(['company', 'year']).apply(lambda x: [d for d in x['document_id'].drop_duplicates()])

now that you have a list of unique document_id, you only need to get the length of this list:

result['nbDocument'] = result.document_id.apply(lambda x: len(x))

to get:

result.reset_index().sort_values(['company', 'year'])

      company  year document_id  nbDocument
0      Banana  2001        [41]           1
1      Orange  1999      [3, 5]           2
2      Orange  2002        [44]           1
3  Strawberry  2001    [21, 18]           2
Sign up to request clarification or add additional context in comments.

Comments

0

This produces the desired output:

out = pd.DataFrame()
grouped = df.groupby(['year', 'company'])
out['nbDocument'] = grouped.apply(lambda x: list(x['document_id'].drop_duplicates()))
out['document_id'] = out['nbDocument'].apply(lambda x: len(x))
print(out.reset_index().sort_values(['year', 'company']))

   year     company nbDocument  document_id
0  1999      Orange     [3, 5]            2
1  2001      Banana       [41]            1
2  2001  Strawberry   [21, 18]            2
3  2002      Orange       [44]            1

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.