0

is it possible to extract specific text from the pdf using python.

test case:I have a PDF file of more than 10pages, I need to extract the specific text and the value associated with them. example: user:value user id:value. These values need to be extracted.

I was able to read all the pages, I want specific text now

3
  • 1
    Does this answer your question? How to extract text from pdf in python 3.7.3 Commented May 10, 2020 at 9:58
  • As a new user, please also take the tour and read How to Ask. In particular, questions that can be answered with yes or no are usually bad questions. Commented May 10, 2020 at 10:33
  • You may transform PDF to XML or to json and then use a lib-xml library or json library in order to extract whatever you want from it. Commented May 10, 2020 at 10:43

1 Answer 1

0

If you are already able to read the PDF and store the text into a string, you could do the following:

import re # Import the Regex Module

pdf_text = """
user:John
user:Doe
user id:2
user id:4
"""

# re.findall will create a list of all strings matching the specified pattern
results = re.findall(r'user:\s\w+', pdf_text)
results = ['user: John', 'user: Doe']

This basically means: find all matches that start with the string 'user:', followed by a whitespace '\s' and then followed by characters that form words (letters and numbers) '\w' until it cannot match anymore '+'.

If you would only like to get the "value" field back, you could use: r'user:\s(\w+)' which would instruct the regex engine to group the string matched by '\w+'. If you have groups in your regex pattern, findall return a list of the group matches instead, so the result would be:

results = re.findall(r'user:\s(\w+)', pdf_text)
['John', 'Doe']

Take a look at the regex module documentation at: https://docs.python.org/3/library/re.html

Some other methods like finditer() could also help in case you want to do more complex stuff.

This regex guide could also be of help: https://www.regexbuddy.com/regex.html?wlr=1

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.