How to extract specific keywords within the frame and extract only that data with in the separators

Question

I have a column data as follows:

abc|frame|gtk|enst.24|pc|hg|,abc|framex|gtk4|enst.35|pxc|h5g|,abc|frbx|hgk4|enst.23|pix|hokg|
abc|frame|gtk|enst.15|pc|hg|,abc|framex|gtk2|enst.59|pxc|h5g|,abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|,abc|framex|gtk1|enst.45|pxc|h5g|,abc|frbx|hgk4|enst.74|pig|hofg|
abc|frame|gtk|enst.34|pc|hg|,abc|framex|gtk1|enst.67|pxc|h5g|,abc|frbx|hgk4|enst.39|pik|hoqg|

I want to search and extract specific keywords within the frame and extract only that data with in the separators

Specific keywords are

enst.35
enst.18
enst.98
enst.63

The expected output is

abc|framex|gtk4|enst.35|pxc|h5g|
abc|frbx|hgk4|enst.18|pif|homg|
abc|frame|gtk|enst.98|pc|hg|
NA

I tried this herebut was not working effectively

Shubham Sharma · Accepted Answer · 2020-06-06 09:05:53Z

1

You can construct a regex pattern using the given keywords then use Series.str.findall to find all occurrences of regex in series:

import re

keywords = ['enst.35','enst.18','enst.98','enst.63']
pattern = '|'.join([rf'[^,]+{re.escape(k)}[^,]+'for k in keywords])
result = df['col'].str.findall(pattern).str.get(0)

#print(result)

0    abc|framex|gtk4|enst.35|pxc|h5g|
1     abc|frbx|hgk4|enst.18|pif|homg|
2        abc|frame|gtk|enst.98|pc|hg|
3                                 NaN
Name: col, dtype: object

You can test the regex pattern here

edited Jun 6, 2020 at 9:05

answered Jun 6, 2020 at 8:50

Shubham Sharma

71.8k6 gold badges26 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

user2110417 · Accepted Answer · 2020-06-06 09:01:27Z

0

You can try in bashscript as follows:

for STRING in enst.35 enst.18 enst.98 enst.63; do
  tr \, \\n < file.txt | grep "$STRING" || echo NA
done

answered Jun 6, 2020 at 9:01

user2110417

Comments

David Erickson · Accepted Answer · 2020-06-06 09:40:39Z

0

With str.extract and str.split(',) to take the last comma separated value:

df['Data2'] = df['Data'].str.extract('(^.*enst.35\|.+?\|.+?\||^.*enst.18\|.+?\|.+?\||^.*enst.98\|.+?\|.+?\||^.*enst.63\|.+?\|.+?\||)', expand=False).str.split(',').str[-1]

You could create a list of keywords and do list comprehension as well per another answer.

edited Jun 6, 2020 at 9:40

answered Jun 6, 2020 at 9:21

David Erickson

16.7k2 gold badges21 silver badges37 bronze badges

Collectives™ on Stack Overflow

How to extract specific keywords within the frame and extract only that data with in the separators

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related