2

I am quite new in python & trying to do some new stuff.I have two list in a dictionary.Let's say,

List1:                              List2:
Anterior                            cord
cuneate nucleus                     Medulla oblongata
nucleus                             Spinal cord
Intermediolateral nucleus           Spinal 
                                    sksdsj
british                             7

And I have some text lines as below:

<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

I have to get return those line whose belongs string both from list1 & list2.So,I have tried with the following code:

result = ""
if list1 in line and list2 in line:
    i1 = re.sub('(?i)(\s+)(%s)(\s+)'%list1, '\\1<e1>\\2</e1>\\3', line)
    i2 = re.sub('(?i)(\s+)(%s)(\s+)'%list2, '\\1<e2>\\2</e2>\\3', i1)
    result = result + i2 + "\n"
    continue

But I am getting the following result:

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate <e1>nucleus</e1> is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>
<s id="69-7">...Meanwhile is the studio 7 album by British pop band 10cc.</s>

Here,Only the result line-4, I got that matches string from both list that is what I want.But, I don't want to get those line which match only one or no string(eg. result line-1 & 3).Also,if matches string from both list , should it tag them(eg. result line-2).

Any kind of help will be greatly appreciated.

2 Answers 2

5

Basically, you want to put some words in <e1> tags and other words in <e2> tags. Is that right?

If so, then something like this will do:

#!/usr/bin/python

from __future__ import print_function
import re

text = '''\
<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>
<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>
<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>
<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>'''

list1 = ('Anterior', 'cuneate nucleus', 'Intermediolateral nucleus')
list2 = ('cord', 'Medulla oblongata', 'Spinal cord')

# put phrases in \b so that they match the whole words
re1 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list1), re.IGNORECASE)
re2 = re.compile("(%s)" % "|".join(r"\b%s\b" % i for i in list2), re.IGNORECASE)

for line in text.split("\n"):
    line = re1.sub(r"<e1>\1</e1>", line)
    line = re2.sub(r"<e2>\1</e2>", line)
    print(line)

Output:

<s id="5239778-2">The name refers collectively to the <e1>cuneate nucleus</e1> and gracile nucleus, which are present at the junction between the <e2>spinal cord</e2> and the <e2>medulla oblongata</e2>.</s>
<s id="3691284-1">In the <e2>medulla oblongata</e2>, the arcuate nucleus is a group of neurons located on the <e1>anterior</e1> surface of the medullary pyramids.</s>
<s id="21120-99"><e1>Anterior</e1> horn cells, motoneurons located in the <e2>spinal cord</e2>.</s>
<s id="1053949-16">The <e1>Anterior</e1> <e2>cord</e2> syndrome results from injury to the <e1>anterior</e1> part of the <e2>spinal cord</e2>, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the <e2>spinal cord</e2>.</s>
Sign up to request clarification or add additional context in comments.

5 Comments

I want to put exactly those string from list1 in <e1> tags & from list2 in <e2> tags that matches with the lines strings.
Please consider that I also have number string in list that I wnat to match. So, I always need to escape this part <s id="697"> part
You extended the question after I gave you the answer. Accept this answer and post a new question then.
Updated the answer. Just added "\b" to match the whole words.
I mean for each line I will take single string from both list for each query! –
1

How about this:

result = ""
lines = ['<s id="5239778-2">The name refers collectively to the cuneate nucleus and gracile nucleus, which are present at the junction between the spinal cord and the medulla oblongata.</s>',
'<s id="3691284-1">In the medulla oblongata, the arcuate nucleus is a group of neurons located on the anterior surface of the medullary pyramids.</s>',
'<s id="21120-99">Anterior horn cells, motoneurons located in the spinal cord.</s>',
'<s id="1053949-16">The Anterior cord syndrome results from injury to the anterior part of the spinal cord, causing weakness and loss of pain and thermal sensations below the injury site but preservation of proprioception that is usually carried in the posterior part of the spinal cord.</s>']

for line in lines:
    for item1 in list1:
        if line.find(item1) != -1:
            for item2 in list2:
                if line.find(item2) != -1:
                      result = result + line + '\n'
                      break
            break
print result

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.