1

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):

ENSEMBL_IDs = ['ENSG00000040608',
               'ENSG00000070371',
               'ENSG00000070413']

which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):

genes_list = (['ENSG00000040608.28', 'RTN4R'],
            ['ENSG00000070371.91', 'CLTCL1'],
            ['ENSG00000070413.17', 'DGCR2'])

genes_df = pd.DataFrame(genes_list)

The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.

I want to store these in a list. So, in the end, I would be left with a list like follows:

`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`

Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.

I think I am looking for something along the lines of:

`genenames = []
for i in ENSEMBL_IDs:
    if i in genes_df.iloc[:,0]:
        genenames.append(# corresponding value in genes_df.iloc[:,1])`

I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.

Thank you for your help!

Thanks also for the edit, English is not my first language, so the improvements were insightful.

1 Answer 1

1

You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:

m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)

out = genes_df.loc[m, 1].tolist()

Or use a regex with str.match:

pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)

out = genes_df.loc[m, 1].tolist()

Output: ['RTN4R', 'CLTCL1', 'DGCR2']

Sign up to request clarification or add additional context in comments.

1 Comment

Thank you mozway! That was helpful and solved the problem.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.