1

I have a data frame that is created by an application and saves info with the following structure: Each row is one mutation that affects genes and transcripts (these are the same gene but different configuration)

data = {'ID': ['mut1', 'mut1', 'mut1', 'mut1', 'mut2'],
       'transcript_affected': ["00001", "00002", "00003", "00001", "00001"],
       'gene_affected' : ['DIABLO','DIABLO','DIABLO','PLNH3','BRCA1']
       }
df = pd.DataFrame(data)

df
ID      transcript_affected          gene_affected
mut1       00001                       DIABLO    
mut1       00002                       DIABLO
mut1       00003                       DIABLO
mut1       00001                       PLNH3
mut2       00001                       BRCA1

# Mut 1 affects 2 genes (DIABLO and PLNH3), mut2 affects 1 gene

From this I have two questions:

  1. How many mutations affect 1 gene.

  2. How likely (in %) more than one transcript is affected when one gene is affected. I mean, proportion of genes with more than one transcript. For this example would be 33% (there is 3 genes affected but only one affected more than 1)

Some ideas I think it could be done

To know how many mutations affect more than 1 gene I think this would be a start

df.groupby('ID')['transcript_affected'].count()
ID
mut1    4
mut2    1

And then I could count how many IDs its value is more than 1

For the second question

df.groupby('gene_affected')['transcript_affected'].count()
gene_affected
BRCA1     1
DIABLO    3
PLNH3     1
Name: transcript_affected, dtype: int64

Then I could count somehow (I don't) how many were more than 1 (>=2).

2
  • It would be helpful if you also post a sample of the output format you required Commented Mar 23, 2022 at 11:31
  • Will do in a minute Commented Mar 23, 2022 at 11:31

1 Answer 1

1

please try this:

Answer 1:

df_1 = df.groupby('ID')['gene_affected'].count().reset_index()
answer_1 = df_1[df_1['gene_affected']>1].shape[0]

Answer 2:

df_2 = df.groupby('gene_affected')['transcript_affected'].count().reset_index()
answer_2 = (df_2[df_2['transcript_affected']>1].shape[0]/df_2.shape[0]) * 100
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.