1

The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.

My attempt:

string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."

pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")

matches = pattern.finditer(string)
for match in matches:
    print('-', match.group(0))

The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$).

Here is the same string over multiple lines, just for reference:

Q: This is a question. 
Q: This is a 2nd question 
on two lines. 

A: This is an answer. 
A: This is a 2nd answer 
on two lines.
Q: Here's another question. 
A: And another answer.

So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.

If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.

3
  • Do you need to map each question to the right answer? Are they all following in set order? Commented Oct 28, 2021 at 14:39
  • @WiktorStribiżew Originally (when each Q: and A: was one line) I would go through and get all the Qs first into a list, then all the As. So the correct Qs and As would all have matching index numbers. Commented Oct 28, 2021 at 14:53
  • Then use two separate regexps and then zip the outputs. Use my regex I shared in the comments. I could provide an answer but I am on a mobile now. Commented Oct 28, 2021 at 15:01

2 Answers 2

1

One approach could be to use a negative lookahead ?! to match a newline followed by an A: | Q: block, as follows:

^([AQ]):(?:.|\n(?![AQ]:))+

You can also try it out here on the Regex Demo.

Here's another approach suggested by @Wiktor that should be a little faster:

^[AQ]:.*(?:\n+(?![AQ]:).+)*

A slight modification where we match .* instead of like \n+ (but note that this also captures blank lines at the end):

^[AQ]:.*(?:\n(?![AQ]:).*)*
Sign up to request clarification or add additional context in comments.

10 Comments

I think (?m)^[AQ]:.*(?:\n(?![AQ]:).+)* would be a much faster pattern for what you tried to achieve with yours.
You should never use an alternation of . and whitespace/line break patterns. There are too many issues related to that pattern. Simply use re.DOTALL to make . match line breaks.
Maybe. In my regex, to match empty lines, you need to replace .+ with .*.
@WiktorStribiżew rv.kvetch Great! You've both been very helpful because I would definitely like to save blank lines if possible. Trailing ones don't matter but there might be some in the middle of a Q/A
I have just published a YT video about the evil (?:\s|.)* pattern.
|
1

I suggest just using a for-loop for this as it's easier for me at least. To answer your question, why not just target until the period rather than the next A: | Q:? You'd probably have to use lookaheads otherwise.

(A: |Q: )[\s\S]*?\.

[\s\S] (Conventionally used to match every character though [\w\W] work as well)

*? is a lazy quantifier. It matches as few characters as it can. If we had just (A: |Q: )[\s\S]*?, then it'd only match the (A: |Q: ), but we have the ending \..

\. matches a literal period.

For the for-loop:

questions_and_answers = []
for line in string.splitlines():
    if line.startswith(("Q: ", "A: ")):
        questions_and_answers.append(line)
    else:
        questions_and_answers[-1] += line

# ['Q: This is a question. ', 'Q: This is a 2nd question on two lines. ', 'A: This is an answer. ', 'A: This is a 2nd answer on two lines.', "Q: Here's another question. ", 'A: And another answer.']```

2 Comments

Unfortunately I can't use a period because the text might not include a period. But according to you and the other answer it seems like lookaheads is what I was looking for, so thank you.
Actually your alternative without re is very good! I'll give it a try. I want to upvote your answer but unfortunately I don't have enough points apparently.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.