0

Given a text file that looks like this when loaded:

>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp

How can I extract all lines that fall between lines that contain '>' and the last lines where there is no ending '>' ?

For example, the result should look like this

result = ['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN','SSSSSSSSSSS','pppppppppppppppppppppppppppppppppppppppppp']

I'm realizing what I did won't work because its looking for text between each new line and '>'. Running this just gives me empty strings.

def findtext(inputtextfile, start, end):
    try:
       pattern=rf'{start}(.*?){end}'
       return re.findall(pattern, inputtextfile)
    except ValueError:
       return -1
result = findtext(inputtextfile,"\n", ">")
1
  • 2
    You can try >.*\s*([^>]+) and extract the contents from group 1 of each match and store it in list Commented Sep 21, 2022 at 16:07

2 Answers 2

1

Maybe try splitting on rows that start with >, that way you get back a list of the data between and can join those after replacing the \n

s = """>rice1 1ALBRGHAER
NNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNN
>peanuts2 2LAEKaq
SSSSSSSSSSS
>OIL3 3hkasUGSV
ppppppppppppppppppppp
ppppppppppppppppppppp"""

def findtext(inputtextfile, start, end):
    import re
    try:
        return [''.join(x.replace('\n','')) for x in list(filter(None,re.split(f'{start}.*{end}',s)))]
    except ValueError:
        return -1

Trying with your provided case

findtext(s, '>','\n')

Output

['NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
 'SSSSSSSSSSS',
 'pppppppppppppppppppppppppppppppppppppppppp']
Sign up to request clarification or add additional context in comments.

Comments

1

One option could be using re.split on the line that starts with > and then remove all the whitespace chars from the parts.

text = (">rice1 1ALBRGHAER\n"
     "NNNNNNNNNNNNNNNNNNNNN\n"
     "NNNNNNNNNNNNNNNNNNNNN\n"
     ">peanuts2 2LAEKaq\n"
     "SSSSSSSSSSS\n"
     ">OIL3 3hkasUGSV\n"
     "ppppppppppppppppppppp\n"
     "ppppppppppppppppppppp")


def findtext(inputtextfile):
    import re

    pattern = r"^>.*"
    
    try:
        return [re.sub(r"\s+", "", s) for s in re.split(pattern, inputtextfile, 0, re.M) if s]
    except ValueError:
        return -1


print(findtext(text))

Output (formatted a bit)

[
  'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN',
  'SSSSSSSSSSS',
  'pppppppppppppppppppppppppppppppppppppppppp'
]

See a Python demo.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.