Selecting all lines/strings that fall between pattern in text file

Question

Given a text file that looks like this when loaded:

>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp

How can I extract all lines that fall between lines that contain '>' and the last lines where there is no ending '>' ?

For example, the result should look like this

result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']

I'm realizing what I did won't work because its looking for text between each new line and '>'. Running this just gives me empty strings.

def findtext(inputtextfile, start, end):
    try:
       pattern=rf'{start}(.*?){end}'
       return re.findall(pattern, inputtextfile)
    except ValueError:
       return -1
result = findtext(inputtextfile,"\n", ">")

You can try >.*\s*([^>]+) and extract the contents from group 1 of each match and store it in list — Gurmanjot Singh
– Gurmanjot Singh, Commented Sep 21, 2022 at 16:07

Chris · Accepted Answer · 2022-09-21 16:04:49Z

1

Maybe try splitting on rows that start with >, that way you get back a list of the data between and can join those after replacing the \n

s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""

def findtext(inputtextfile, start, end):
    import re
    try:
        return [''.join(x.replace('\n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
    except ValueError:
        return -1

Trying with your provided case

findtext(s, '>','\n')

Output

['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
 'SSSSSSSSSSS',
 'pppppppppppppppppppppppppppppppppppppppppp']

answered Sep 21, 2022 at 16:04

Chris

16.3k3 gold badges26 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

The fourth bird · Accepted Answer · 2022-09-21 21:04:58Z

1

One option could be using re.split on the line that starts with > and then remove all the whitespace chars from the parts.

text = (">rice1 1ALBRGHAER\n"
     "NNNNNNNNNNNNNNNNNNNNN\n"
     "NNNNNNNNNNNNNNNNNNNNN\n"
     ">peanuts2 2LAEKaq\n"
     "SSSSSSSSSSS\n"
     ">OIL3 3hkasUGSV\n"
     "ppppppppppppppppppppp\n"
     "ppppppppppppppppppppp")


def findtext(inputtextfile):
    import re

    pattern = r"^>.*"
    
    try:
        return [re.sub(r"\s+", "", s) for s in re.split(pattern, inputtextfile, 0, re.M) if s]
    except ValueError:
        return -1


print(findtext(text))

Output (formatted a bit)

[
  'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
  'SSSSSSSSSSS',
  'pppppppppppppppppppppppppppppppppppppppppp'
]

See a Python demo.

edited Sep 21, 2022 at 21:04

answered Sep 21, 2022 at 16:15

The fourth bird

165k16 gold badges61 silver badges75 bronze badges

Collectives™ on Stack Overflow

Selecting all lines/strings that fall between pattern in text file

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related