2

I am trying to extract data fields from PDF texts using regex.

The text is:

"SAMPLE EXPERIAN CUSTOMER\n2288150 - EXPERIAN SAMPLE REPORTS\nData Dictionary Report\nFiltered By:\nCustom Selection\nMarketing Element:\nPage 1 of 284\n2014-11-11 21:52:01 PM\nExperian and the marks used herein are service marks or registered trademarks of Experian.\n© Experian 2014 All rights reserved. Confidential and proprietary.\n**Data Dictionary**\nDate of Birth is acquired from public and proprietary files. These sources provide, at a minimum, the year of birth; the month is provided where available. Exact date of birth at various levels of detail is available for \n\n\n\n\n\nNOTE: Records coded with DOB are exclusive of Estimated Age (101E)\n**Element Number**\n0100\nDescription\nDate Of Birth / Exact Age\n**Data Dictionary**\n\n\n\n\n\n\n\n\n\n\nFiller, three bytes\n**Element Number**\n0000\n**Description**\nEnhancement Mandatory Append\n**Data Dictionary**\n\n\nWhen there is insufficient data to match a customer's record to our enrichment master for estimated age, a median estimated age based on the ages of all other adult individuals in the same ZIP+4 area is provided. \n\n\n\n\n\n\n00 = Unknown\n**Element Number**\n0101E\n**Description**\nEstimated Age\n"

The field names are in bold. The texts between field names are the field values.

The first time I tried to extract the 'Description' field using the following regex:

pattern = re.compile('\nDescription\n(.*?)\nData Dictionary\n')
re.findall(pattern,text)

The results are correct:

['Date Of Birth / Exact Age', 'Enhancement Mandatory Append']

But using the same idea to extract 'Data Dictionary' Field gives the empty result:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n')
re.findall(pattern,text)

Results:

[]

Any idea why?

1
  • You might want to use raw strings to define your pattern, by the way (e.g. re.compile(r'\nDescription...). Commented Aug 7, 2015 at 19:39

2 Answers 2

4

. doesn't match newlines by default. Try:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', flags=re.DOTALL)
re.findall(pattern,text)

Notice how I passed re.DOTALL as the flags argument to re.compile.

Sign up to request clarification or add additional context in comments.

Comments

1

Try using the flag re.MULTILINE in your regex:

pattern = re.compile('\nData Dictionary\n(.*?)\nElement Number\n', re.MULTILINE)
re.findall(pattern,text)

5 Comments

Using multiline mode, you might also just remove all the explicit \n and just use the two delimiting patterns as boundaries: 'Data Dictionary\n(.*?)\nElement Number'
Yours also work. Unfortunately I can only accept the earliest answer...but thanks very much.
@user2517984 if you like it, you can give it a +1 ;)
I wish I could, but I need to have 15 reputaion to vote . Will get back to this after getting extra reputations.
I wanted to get "APPLICATION_NAME, ROLE_ID" from the following text PRIMARY KEY (APPLICATION_NAME, ROLE_ID) USING INDEX APP_ROLES.SWR_PK so I used: pattern2 = re.compile('\nPRIMARY KEY\n(.*?)\nUSING', flags=re.MULTILINE) x=re.findall(pattern2,text) print(x) but all I got was empty set [ ]. Can you suggest what did I do wrong?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.