how to extract fields from pdf in python using pdfminer

Question

I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3. please find attached sample of pdf

Gaurav Sharma · Accepted Answer · 2019-11-15 08:21:07Z

1

First, use tika to to convert PDF to text.

import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice 

file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content

Now extract desired fields using regex. You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.

answered Nov 15, 2019 at 8:21

Gaurav Sharma

355 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

nishtha vijay Over a year ago

I am able to read the pdf as text format..now how can I recognized what is the name of the person..because in pdf there is no keyword as the name

Ramon Medeiros · Accepted Answer · 2019-11-15 08:06:03Z

0

Try to use tika package:

from tika import parser

raw = parser.from_file('sample.pdf')
print(raw['content'])

answered Nov 15, 2019 at 8:06

Ramon Medeiros

2,7115 gold badges33 silver badges50 bronze badges

Collectives™ on Stack Overflow

how to extract fields from pdf in python using pdfminer

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related