Add counter as an additional column in Python pandas dataframe

Question

I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.

The dataframe looks like this:

df

       PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1

Expected out is:

       PMID gene_symbol    gene_label gene_mentions   count
0  33377242       MTHFR  Matched Gene             2   1
1  33414971       CSF3R  Matched Gene            13   9
2  33414971         BCR    Other Gene             2   9
3  33414971        ABL1  Matched Gene             1   9
4  33414971        ESR1  Matched Gene             1   9
5  33414971      NDUFB3    Other Gene             1   9
6  33414971        CSF3    Other Gene             1   9
7  33414971        TP53  Matched Gene             2   9
8  33414971         SRC  Matched Gene             1   9
9  33414971        JAK1  Matched Gene             1   9
10 33414972        MAK2  Matched Gene             1   1

How can I achieve this output?

Thanks

I would like to show count per pmid. That can be provided for each row.. I have updated the output. — rshar
– rshar, Commented Feb 10, 2021 at 20:58

Quang Hoang · Accepted Answer · 2021-02-10 21:04:38Z

You can add count for each row with groupby().transform:

df['count'] = df.groupby('PMID')['PMID'].transform('size')

Output:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2      1
1  33414971       CSF3R  Matched Gene             13      9
2  33414971         BCR    Other Gene              2      9
3  33414971        ABL1  Matched Gene              1      9
4  33414971        ESR1  Matched Gene              1      9
5  33414971      NDUFB3    Other Gene              1      9
6  33414971        CSF3    Other Gene              1      9
7  33414971        TP53  Matched Gene              2      9
8  33414971         SRC  Matched Gene              1      9
9  33414971        JAK1  Matched Gene              1      9

Now if you really want only count at the first row for each PMID, you can use mask:

df['count'] = df['count'].mask(df['PMID'].duplicated())

Then you would have:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2    1.0
1  33414971       CSF3R  Matched Gene             13    9.0
2  33414971         BCR    Other Gene              2    NaN
3  33414971        ABL1  Matched Gene              1    NaN
4  33414971        ESR1  Matched Gene              1    NaN
5  33414971      NDUFB3    Other Gene              1    NaN
6  33414971        CSF3    Other Gene              1    NaN
7  33414971        TP53  Matched Gene              2    NaN
8  33414971         SRC  Matched Gene              1    NaN
9  33414971        JAK1  Matched Gene              1    NaN

juanpa.arrivillaga · Accepted Answer · 2021-02-10 20:59:11Z

0

You just want to group by the ID and get the size of the group:

df.set_index("PMID", inplace=True)
df['count'] = df.groupby("PMID").size()

And if you want, you can:

df.reset_index(inplace=True)

edited Feb 10, 2021 at 20:59

answered Feb 10, 2021 at 20:54

juanpa.arrivillaga

97.6k14 gold badges141 silver badges190 bronze badges

3 Comments

juanpa.arrivillaga Over a year ago

@QuangHoang why?

Behzad Shayegh Over a year ago

ValueError: Length of values (3) does not match length of index (11)

juanpa.arrivillaga Over a year ago

@BehzadShayegh ah, yes yes

Ángel Igualada · Accepted Answer · 2021-02-10 21:03:01Z

You can do it with something like this:

data = [
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414972", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"}
]

df = pd.DataFrame(data)
df = df.sort_values("PMID")
df["count"] = df.groupby("PMID")["PMID"].transform("count")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# remove count of non first row of each group
df["count"][~conditions] = None

or:

df = df.sort_values("PMID")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# write count only in first row
df["count"] = None
df["count"][conditions] = df.groupby("PMID")["PMID"].transform("count")[conditions]

Collectives™ on Stack Overflow

Add counter as an additional column in Python pandas dataframe

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related