Introduction
I am trying to parse information using RegEx which is structured like this:
1. Data
A. Data sub 1
B. Data sub 2
2. Data
A. Data sub 1
B. Data sub 2
C. Data sub 3
D. Data sub 4
3. Data
A. Data sub 1
Each piece of information is a new line, I could go line by line, but I believe that a RegEx string would be sufficient to defeat this issue.
Intention
I would like to extract it block by block, where a block would be:
1. Data
A. Data sub 1
B. Data sub 2
My attempt
I was able to observe that there is a "pattern" in this data and though that I could try to extract it using the next RegEx string:
(?s)(?=1.)(.*?)(?=(2. ))
Which succesfully extracts a block, but if the block contains a number such that it is include in the expresision, the block extracted is incompleted and corrupts the output file
What I expect
I would like to extract the data blocks without being interrupted by a string or char found between the defined start and end.
re.findall(r'(?m)^\d.+(?:\n .+)*', str)