Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

Question

I am a beginner in python and have encountered the following problem: I have a long list of strings (I took 3 now for the example):

ENSEMBL_IDs = ['ENSG00000040608',
               'ENSG00000070371',
               'ENSG00000070413']

which are partial matches of the data in column 0 of my DataFrame genes_df (first 3 entries shown):

genes_list = (['ENSG00000040608.28', 'RTN4R'],
            ['ENSG00000070371.91', 'CLTCL1'],
            ['ENSG00000070413.17', 'DGCR2'])

genes_df = pd.DataFrame(genes_list)

The task I want to perform is conceptually not that difficult: I want to compare each element of ENSEMBL_IDs to genes_df.iloc[:,0] (which are partial matches: each element of ENSEMBL_IDs is contained within column 0 of genes_df, as outlined above). If the element of EMSEMBL_IDs matches the element in genes_df.iloc[:,0] (which it does, apart from the extra numbers after the period ".XX" ), I want to return the "corresponding" value that is stored in the first column of the genes_df Dataframe: the actual gene name, 'RTN4R' as an example.

I want to store these in a list. So, in the end, I would be left with a list like follows:

`genenames = ['RTN4R', 'CLTCL1', 'DGCR2']`

Some info that might be helpful: all of the entries in ENSEMBL_IDs are unique, and all of them are for sure contained in column 0 of genes_df.

I think I am looking for something along the lines of:

`genenames = []
for i in ENSEMBL_IDs:
    if i in genes_df.iloc[:,0]:
        genenames.append(# corresponding value in genes_df.iloc[:,1])`

I am sorry if the question has been asked before; I kept looking and was not able to find a solution that was applicable to my problem.

Thank you for your help!

Thanks also for the edit, English is not my first language, so the improvements were insightful.

mozway · Accepted Answer · 2023-01-05 12:57:07Z

1

You can get rid of the part after the dot (with str.extract or str.replace) before matching the values with isin:

m = genes_df[0].str.extract('([^.]+)', expand=False).isin(ENSEMBL_IDs)
# or
m = genes_df[0].str.replace('\..*$', '', regex=True).isin(ENSEMBL_IDs)

out = genes_df.loc[m, 1].tolist()

Or use a regex with str.match:

pattern = '|'.join(ENSEMBL_IDs)
m = genes_df[0].str.match(pattern)

out = genes_df.loc[m, 1].tolist()

Output: ['RTN4R', 'CLTCL1', 'DGCR2']

edited Jan 5, 2023 at 12:57

answered Jan 5, 2023 at 12:51

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

tom Over a year ago

Thank you mozway! That was helpful and solved the problem.

Collectives™ on Stack Overflow

Python: (partial) matching elements of a list to DataFrame columns, returning entry of a different column

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related