0

I am trying to extract the text between a list of items based on two separate lists.

For example 
start = ['intro','Intro','[intro','Introduction',(intro)]
end = ['P1','P2','[P1','[P2']

input:
intro
L1
L2
P1
L3
L4
[intro]
L5
L6

Expected Output:
L1
L2
L5
L6

How can I achieve this, Having tried

text = 'I want to find a string between two substrings'
start = 'find a '
end = 'between two'

print(text[text.index(start)+len(start):text.index(end)])

I want my output based on Example 1

4
  • 1
    can you explain it properly its hard to understand what you want Commented Apr 16, 2019 at 16:20
  • how come your start list isn't producing error? Commented Apr 16, 2019 at 16:23
  • Your example code text[text.index(start)+len(start):text.index(end)] should output "string". Are you expecting a different output? Also, how does that example relate to the list of input and output posted above it? Commented Apr 16, 2019 at 16:24
  • 1
    @benvc, I believe start and end are lists as described in first example. Commented Apr 16, 2019 at 16:25

1 Answer 1

2

Quick and dirty example based on your second example:

text = 'I want to find a string between two substrings'
start = 'find a '
end = 'substrings'

s_idx = text.index(start) + len(start) if start in text else -1

e_idx = text.index(end) if end in text else -1

if s_idx > -1 and e_idx > -1:
    print(text[s_idx:e_idx])

You have to check if substring is a part of a string or else str.index() throws a ValueError.

EDIT: Output based on first example:

start_list = ["work", "start", "also"]
end_list = ["of", "end", "substrings"]
text = "This can also work on a list of start and end substrings"

print("* Example with a list of start and end strings, stops on a first match")
print("- Text: {0}".format(text))
print("- Start: {0}".format(start_list))
print("- End: {0}".format(end_list))

s_idx = -1
for string in start_list:
    if string in text:
        s_idx = text.index(string) + len(string)
        # we're breaking on a first find.
        break

e_idx = -1
for string in end_list:
    if string in text:
        e_idx = text.index(string)
        # we're breaking on a first find.
        break

if e_idx > -1 and s_idx > -1:
    print(text[s_idx:e_idx])

Or, if you even want to go further and find all substrings between all occurrences:

print("* Example with a list of start and end strings, finds all matches")
print("- Text: {0}".format(text))
print("- Start: {0}".format(start_list))
print("- End: {0}".format(end_list))

s_idxs = []
e_idxs = []

for string in start_list:
    if string in text:
        s_idxs.append(text.index(string) + len(string))

for string in end_list:
    if string in text:
        e_idxs.append(text.index(string))


for s_idx in s_idxs:
    for e_idx in e_idxs:
        if e_idx <= s_idx:
            print("ignoring end index {0}, it's before our start at {1}!".format(e_idx, s_idx))
            # end index is lower than start index, ignoring it.
            continue

        print("{0}:{1} => {2}".format(s_idx, e_idx, text[s_idx:e_idx]))

You can further 'shorten' and improve this code, this is just a quick and dirty write up.

Sign up to request clarification or add additional context in comments.

4 Comments

My question is, if start and end are a list of words, how can I handle that situation
You'll have to iterate through that list and basically do same thing I did. If you need an example, I could provide one.
Yes, can you provide an example
I have edited my answer and added two more examples.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.