1

I have a column data as follows:

abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|

I want to search and extract specific keywords within the frame and extract only that data with in the separators

Specific keywords are

enst.35
enst.18
enst.98
enst.63

The expected output is

abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA

I tried this herebut was not working effectively

3 Answers 3

1

You can construct a regex pattern using the given keywords then use Series.str.findall to find all occurrences of regex in series:

import re

keywords = ['enst.35','enst.18','enst.98','enst.63']
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in keywords])
result = df['col'].str.findall(pattern).str.get(0)

#print(result)

0    abc|framex|gtk4|enst.35|pxc|h5g|
1     abc|frbx|hgk4|enst.18|pif|homg|
2        abc|frame|gtk|enst.98|pc|hg|
3                                 NaN
Name: col, dtype: object

You can test the regex pattern here

Sign up to request clarification or add additional context in comments.

Comments

0

You can try in bashscript as follows:

for STRING in enst.35 enst.18 enst.98 enst.63; do
  tr \, \\n < file.txt | grep "$STRING" || echo NA
done

Comments

0

With str.extract and str.split(',) to take the last comma separated value:

df['Data2'] = df['Data'].str.extract('(^.*enst.35\|.+?\|.+?\||^.*enst.18\|.+?\|.+?\||^.*enst.98\|.+?\|.+?\||^.*enst.63\|.+?\|.+?\||)', expand=False).str.split(',').str[-1]

You could create a list of keywords and do list comprehension as well per another answer.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.