How to properly extract blocks of data from a file using my RegEx string?

Question

Introduction

I am trying to parse information using RegEx which is structured like this:

1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1

Each piece of information is a new line, I could go line by line, but I believe that a RegEx string would be sufficient to defeat this issue.

Intention

I would like to extract it block by block, where a block would be:

1. Data
  A. Data sub 1
  B. Data sub 2

My attempt

I was able to observe that there is a "pattern" in this data and though that I could try to extract it using the next RegEx string:

(?s)(?=1.)(.*?)(?=(2. ))

Which succesfully extracts a block, but if the block contains a number such that it is include in the expresision, the block extracted is incompleted and corrupts the output file

What I expect

I would like to extract the data blocks without being interrupted by a string or char found between the defined start and end.

Why use a regex? Just read it line by line and accumulate until you see a digit in column 1. — Tim Roberts
– Tim Roberts, Commented Jun 27, 2022 at 20:29

mozway · Accepted Answer · 2022-06-27 20:30:16Z

2

I would use re.split here, splitting on a newline if it is followed by \d\.:

text = '''1. Data
  A. Data sub 1
  B. Data sub 2
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4  
3. Data
  A. Data sub 1'''

import re

blocks = re.split('\s*\n(?=\d+\.)', text)

output:

['1. Data\n  A. Data sub 1\n  B. Data sub 2',
 '2. Data\n  A. Data sub 1\n  B. Data sub 2\n  C. Data sub 3\n  D. Data sub 4',
 '3. Data\n  A. Data sub 1']

In a loop:

for block in re.split('\s*\n(?=\d+\.)', text):
    print('--- NEW BLOCK ---')
    print(block)

output:

--- NEW BLOCK ---
1. Data
  A. Data sub 1
  B. Data sub 2
--- NEW BLOCK ---
2. Data
  A. Data sub 1
  B. Data sub 2
  C. Data sub 3
  D. Data sub 4
--- NEW BLOCK ---
3. Data
  A. Data sub 1

answered Jun 27, 2022 at 20:30

mozway

267k13 gold badges56 silver badges106 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

bobble bubble Over a year ago

It's a good idea to use split, however the \s* could make it a bit inefficient (see eg this longer string 22000 steps vs 460 without). However I doubt that could be a reason for downvote (I'll remove this comment in 10 minutes).

mozway Over a year ago

@bobblebubble the \s* is not really necessary I used it as a way to remove the trailing spaces, it will work fine with just \n(?=\d+\.). The trailing spaces can be removed later, if needed. You can leave your comment, the remark is quite interesting! Thanks for the feedback

Collectives™ on Stack Overflow

How to properly extract blocks of data from a file using my RegEx string?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related