Extracting Data from PDF with particular heading in python

Question

I wanted to parse the PDF file in python. I have seen examples with PDFMiner which could not explain my requirement.

For Example if I want to parse a resume, it contains various fields like Summary, Experience and Hobbies.

I am interested to extract only experience and this experience field will be in the first place or second place or at any place, I need to Identify where the experience field located and need to extract the data.

How can I do this?

Is this feasible to do, extracting data with heading. Or else, is any any idea to do this — Jack Daniel
– Jack Daniel, Commented Jun 7, 2016 at 9:28
In the general case it cannot be done (short of rendering the PDF file and feeding the results into an OCR system). PDF is a display format and is not guaranteed to have any internal structure for defining fields, let alone any standardized structure. If you have a bunch of PDFs all generated by the exact same software stack, you may be able to parse them as a special case (that will be different to other folks' special cases). — nigel222
– nigel222, Commented Jun 7, 2016 at 11:03

Eugene · Accepted Answer · 2016-06-07 13:31:32Z

1

There are 2 viable approaches to extract that field data:

Search for some predefined keyword, like Experience to get its location. Then search for the next section's keyword (Hobbies) and then just determine coordinates of the text partition between these 2 sections and extract this text from this location.
If PDF are generated using the same generator then you may just find coordinates of Experience section and just extract text from the same location everytime.
(easiest) Just convert the whole page into text and then parse the generated text using substring search or regular expressions. This will be the easiest and simpliest way as all the work regarding PDF format relies on the specialized tool

answered Jun 7, 2016 at 13:31

Eugene

2,93821 silver badges24 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Siddharth Das Over a year ago

Short-comings of this approach: 1. In some cases "Career History", "Professional History" or something else can be written in place of "Experience". 2. Word "experience" can have multiple occurence in the resume.

Collectives™ on Stack Overflow

Extracting Data from PDF with particular heading in python

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related