1

I have following dataframe as an output of my python script. I would like to add another column with count per pmid and add the counter to the first row, keeping the other rows.

The dataframe looks like this:

df

       PMID gene_symbol    gene_label gene_mentions
0  33377242       MTHFR  Matched Gene             2
1  33414971       CSF3R  Matched Gene            13
2  33414971         BCR    Other Gene             2
3  33414971        ABL1  Matched Gene             1
4  33414971        ESR1  Matched Gene             1
5  33414971      NDUFB3    Other Gene             1
6  33414971        CSF3    Other Gene             1
7  33414971        TP53  Matched Gene             2
8  33414971         SRC  Matched Gene             1
9  33414971        JAK1  Matched Gene             1

Expected out is:

       PMID gene_symbol    gene_label gene_mentions   count
0  33377242       MTHFR  Matched Gene             2   1
1  33414971       CSF3R  Matched Gene            13   9
2  33414971         BCR    Other Gene             2   9
3  33414971        ABL1  Matched Gene             1   9
4  33414971        ESR1  Matched Gene             1   9
5  33414971      NDUFB3    Other Gene             1   9
6  33414971        CSF3    Other Gene             1   9
7  33414971        TP53  Matched Gene             2   9
8  33414971         SRC  Matched Gene             1   9
9  33414971        JAK1  Matched Gene             1   9
10 33414972        MAK2  Matched Gene             1   1

How can I achieve this output?

Thanks

2
  • Is it necessary that "just" first row track count? Commented Feb 10, 2021 at 20:56
  • I would like to show count per pmid. That can be provided for each row.. I have updated the output. Commented Feb 10, 2021 at 20:58

3 Answers 3

1

You can add count for each row with groupby().transform:

df['count'] = df.groupby('PMID')['PMID'].transform('size')

Output:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2      1
1  33414971       CSF3R  Matched Gene             13      9
2  33414971         BCR    Other Gene              2      9
3  33414971        ABL1  Matched Gene              1      9
4  33414971        ESR1  Matched Gene              1      9
5  33414971      NDUFB3    Other Gene              1      9
6  33414971        CSF3    Other Gene              1      9
7  33414971        TP53  Matched Gene              2      9
8  33414971         SRC  Matched Gene              1      9
9  33414971        JAK1  Matched Gene              1      9

Now if you really want only count at the first row for each PMID, you can use mask:

df['count'] = df['count'].mask(df['PMID'].duplicated())

Then you would have:

       PMID gene_symbol    gene_label  gene_mentions  count
0  33377242       MTHFR  Matched Gene              2    1.0
1  33414971       CSF3R  Matched Gene             13    9.0
2  33414971         BCR    Other Gene              2    NaN
3  33414971        ABL1  Matched Gene              1    NaN
4  33414971        ESR1  Matched Gene              1    NaN
5  33414971      NDUFB3    Other Gene              1    NaN
6  33414971        CSF3    Other Gene              1    NaN
7  33414971        TP53  Matched Gene              2    NaN
8  33414971         SRC  Matched Gene              1    NaN
9  33414971        JAK1  Matched Gene              1    NaN
Sign up to request clarification or add additional context in comments.

Comments

0

You just want to group by the ID and get the size of the group:

df.set_index("PMID", inplace=True)
df['count'] = df.groupby("PMID").size()

And if you want, you can:

df.reset_index(inplace=True)

3 Comments

@QuangHoang why?
ValueError: Length of values (3) does not match length of index (11)
@BehzadShayegh ah, yes yes
0

You can do it with something like this:

data = [
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33377242", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414972", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"},
    {"PMID": "33414971", "gene_label":"MTHFR", "gene_mentions": "Matched Gene"}
]

df = pd.DataFrame(data)
df = df.sort_values("PMID")
df["count"] = df.groupby("PMID")["PMID"].transform("count")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# remove count of non first row of each group
df["count"][~conditions] = None 

or:

df = df.sort_values("PMID")
# get position of first element of each group
conditions = df["PMID"] != df["PMID"].shift() 
# write count only in first row
df["count"] = None
df["count"][conditions] = df.groupby("PMID")["PMID"].transform("count")[conditions] 

enter image description here

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.