Regex text between two strings

Question

I am trying to extract data fields from PDF texts using regex.

The text is:

"SAMPLE EXPERIAN CUSTOMER\n2288150 - EXPERIAN SAMPLE REPORTS\nData Dictionary Report\nFiltered By:\nCustom Selection\nMarketing Element:\nPage 1 of 284\n2014-11-11 21:52:01 PM\nExperian and the marks used herein are service marks or registered trademarks of Experian.\n© Experian 2014 All rights reserved. Confidential and proprietary.\n**Data Dictionary**\nDate of Birth is acquired from public and proprietary files. These sources provide, at a minimum, the year of birth; the month is provided where available. Exact date of birth at various levels of detail is available for \n\n\n\n\n\nNOTE: Records coded with DOB are exclusive of Estimated Age (101E)\n**Element Number**\n0100\nDescription\nDate Of Birth / Exact Age\n**Data Dictionary**\n\n\n\n\n\n\n\n\n\n\nFiller, three bytes\n**Element Number**\n0000\n**Description**\nEnhancement Mandatory Append\n**Data Dictionary**\n\n\nWhen there is insufficient data to match a customer's record to our enrichment master for estimated age, a median estimated age based on the ages of all other adult individuals in the same ZIP+4 area is provided. \n\n\n\n\n\n\n00 = Unknown\n**Element Number**\n0101E\n**Description**\nEstimated Age\n"

The field names are in bold. The texts between field names are the field values.

The first time I tried to extract the 'Description' field using the following regex:

pattern = re.compile('\nDescription\n(.*?)\nData Dictionary\n')
re.findall(pattern,text)

The results are correct:

['Date Of Birth / Exact Age', 'Enhancement Mandatory Append']

But using the same idea to extract 'Data Dictionary' Field gives the empty result:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n')
re.findall(pattern,text)

Results:

[]

Any idea why?

You might want to use raw strings to define your pattern, by the way (e.g. re.compile(r'\nDescription...). — TigerhawkT3
– TigerhawkT3, Commented Aug 7, 2015 at 19:39

kirbyfan64sos · Accepted Answer · 2015-08-07 19:37:38Z

4

. doesn't match newlines by default. Try:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', flags=re.DOTALL)
re.findall(pattern,text)

Notice how I passed re.DOTALL as the flags argument to re.compile.

answered Aug 7, 2015 at 19:37

kirbyfan64sos

10.8k6 gold badges58 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

isosceleswheel · Accepted Answer · 2015-08-07 19:43:17Z

1

Try using the flag re.MULTILINE in your regex:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', re.MULTILINE)
re.findall(pattern,text)

answered Aug 7, 2015 at 19:43

isosceleswheel

1,5461 gold badge13 silver badges21 bronze badges

5 Comments

isosceleswheel Over a year ago

Using multiline mode, you might also just remove all the explicit \n and just use the two delimiting patterns as boundaries: 'Data Dictionary\n(.*?)\nElement Number'

user2517984 Over a year ago

Yours also work. Unfortunately I can only accept the earliest answer...but thanks very much.

isosceleswheel Over a year ago

@user2517984 if you like it, you can give it a +1 ;)

user2517984 Over a year ago

I wish I could, but I need to have 15 reputaion to vote . Will get back to this after getting extra reputations.

RB17 Over a year ago

I wanted to get "APPLICATION_NAME, ROLE_ID" from the following text PRIMARY KEY (APPLICATION_NAME, ROLE_ID) USING INDEX APP_ROLES.SWR_PK so I used: pattern2 = re.compile('\nPRIMARY KEY\n(.*?)\nUSING', flags=re.MULTILINE) x=re.findall(pattern2,text) print(x) but all I got was empty set [ ]. Can you suggest what did I do wrong?

Collectives™ on Stack Overflow

Regex text between two strings

2 Answers 2

Comments

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related