I have a pdf form that I need to extract email id, name of the person and other information like skills, city, etc..how can I do that using pdfminer3.
please find attached sample of pdf
2 Answers
First, use tika to to convert PDF to text.
import re
import sys
!{sys.executable} -m pip install tika
from tika import parser
from io import StringIO
from itertools import islice
file = 'filename with directory'
parsedPDF = parser.from_file(file) # Parse data from file
text = parsedPDF['content'] # Get files text content
Now extract desired fields using regex. You can find extensive regex tutorials online. If you have any problem implementing the same, please ask here.
1 Comment
nishtha vijay
I am able to read the pdf as text format..now how can I recognized what is the name of the person..because in pdf there is no keyword as the name