1

Introduction

I am trying to parse information using RegEx which is structured like this:

1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1

Each piece of information is a new line, I could go line by line, but I believe that a RegEx string would be sufficient to defeat this issue.

Intention

I would like to extract it block by block, where a block would be:

1. Data
  A. Data sub 1
  B. Data sub 2

My attempt

I was able to observe that there is a "pattern" in this data and though that I could try to extract it using the next RegEx string:

(?s)(?=1.)(.*?)(?=(2. ))

Which succesfully extracts a block, but if the block contains a number such that it is include in the expresision, the block extracted is incompleted and corrupts the output file

What I expect

I would like to extract the data blocks without being interrupted by a string or char found between the defined start and end.

2
  • 2
    Why use a regex? Just read it line by line and accumulate until you see a digit in column 1. Commented Jun 27, 2022 at 20:29
  • More ideas: re.findall(r'(?m)^\d.+(?:\n .+)*', str) Commented Jun 27, 2022 at 21:15

1 Answer 1

2

I would use re.split here, splitting on a newline if it is followed by \d\.:

text = '''1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1'''

import re

blocks = re.split('\s*\n(?=\d+\.)', text)

output:

['1. Data\n  A. Data sub 1\n  B. Data sub 2',
 '2. Data\n  A. Data sub 1\n  B. Data sub 2\n  C. Data sub 3\n  D. Data sub 4',
 '3. Data\n  A. Data sub 1']

In a loop:

for block in re.split('\s*\n(?=\d+\.)', text):
    print('--- NEW BLOCK ---')
    print(block)

output:

--- NEW BLOCK ---
1. Data
  A. Data sub 1
  B. Data sub 2
--- NEW BLOCK ---
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4
--- NEW BLOCK ---
3. Data
  A. Data sub 1
Sign up to request clarification or add additional context in comments.

2 Comments

It's a good idea to use split, however the \s* could make it a bit inefficient (see eg this longer string 22000 steps vs 460 without). However I doubt that could be a reason for downvote (I'll remove this comment in 10 minutes).
@bobblebubble the \s* is not really necessary I used it as a way to remove the trailing spaces, it will work fine with just \n(?=\d+\.). The trailing spaces can be removed later, if needed. You can leave your comment, the remark is quite interesting! Thanks for the feedback

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.