Regex to match specific strings but only first string on new line

Question

Using Python regex, I'm trying to scrape some Behat scenarios. Here is a regex: https://regex101.com/r/EGdK3O/1 (Scenario:([\s\S]*?)(And|When|Then|Given)).

The current version of my code is items = re.findall(r'Scenario:([\s\S]*?)(And|When|Then|Given|#)', contents, re.MULTILINE). This works, except when one of these strings is in the scenario.

What I'm having trouble figuring out is how to only match (And|When|Then|Given) when the string occurrence is the first string on a new line. Even better would be if I can match with a new line that has a tab or number of spaces.

The ultimate goal here is to get the Scenario description but not the steps.

Do you need the performance of a regex? If not, have you considered writing a simple stateful parser? It might be easier to understand later-on. Having said that, I have to admit that I don't fully understood what your expected outcome is from the question. — exhuma
– exhuma, Commented Sep 26, 2019 at 15:22

zmo · Accepted Answer · 2019-09-26 15:32:06Z

even though you might end up with some very complex regex to parse the Behat language, this is a typical case of 'I had one problem, I used a regex, now I have 2 problems':

Instead of losing your mind trying to solve this with a regex, you should better use a library that can read and parse the Behat language.

The reason is that the regex language is great to work on simple string parsing problem (working with the tokens of a language). Even though it can do it (with extended regex), parsing a complex language is more abstract. You need to not only look at the tokens (the words), but at the grammar (the syntax and its meaning).

A typical issue (which you're facing) is when a word has a different meaning given the context, and a grammar is there to help on this. And even though you can figure out the first step of parsing the scenarios, when you'll look at each scenario, you're likely to have a similar issue.

So that's why you need to implement a full blown parser… But writing a parser is not easy (the most complex part being writing the grammar). So if you're lucy, someone else has done it for you!

And you're lucky! Looking at some documentation on Behat the language used is call gherkin. With some googling, I found at least one python package that understands that language : cucumber/gherkin-python, which has now moved to the cucumber/cucumber repository.

The snippet to use the parser is the following:

from gherkin.parser import Parser
from gherkin.pickles.compiler import compile

parser = Parser()
gherkin_document = parser.parse("Feature: ...")
pickles = compile(gherkin_document)

Then you'll get a structured data output which you'll be able to navigate through easily in python.

You are correct and I'm very grateful you chimed in with this! 2 minutes and now I have beautiful json that I can use to build my UI.

The fourth bird · Accepted Answer · 2019-09-26 15:32:36Z

You could match Scenario followed by a capturing group which will match until the end of the string without matching a newline.

Then use a single capturing group to repeat matching the lines that do not start with (And|When|Then|Given) prepended with 1+ tabs or spaces and finally match the line that contains one of the options after the capturing group.

\bScenario:(.*(?:\r?\n(?![ \t]+(And|[WT]hen|Given)).*)*)\r?\n[ \t]+(?:And|[WT]hen|Given)

\bScenario: Match Scenario: prepended by a word boundary
( Capture group 1
- .* Match any char except a newline
- (?: Non capturing group
  - \r?\n Match a newline
  - (?! Negative lookahead, if what is on the right is not [ \t]+(And|[WT]hen|Given) Match 1+ spaces or tabs and 1 of the options
  - ).* Close group and match 0+ times any char except a newline
- )* Close group and repeat 0+ times
) Close capture group
\r?\n[ \t]+ Match a newline and 1+ spaces or tabs
(?:And|[WT]hen|Given) Match any of the listed

Regex demo

Collectives™ on Stack Overflow

Regex to match specific strings but only first string on new line

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related