1

I wanted to parse the PDF file in python. I have seen examples with PDFMiner which could not explain my requirement.

For Example if I want to parse a resume, it contains various fields like Summary, Experience and Hobbies.

I am interested to extract only experience and this experience field will be in the first place or second place or at any place, I need to Identify where the experience field located and need to extract the data.

How can I do this?

3
  • 1
    What was done so far and what exactly goes wrong? Commented Jun 7, 2016 at 9:18
  • Is this feasible to do, extracting data with heading. Or else, is any any idea to do this Commented Jun 7, 2016 at 9:28
  • 1
    In the general case it cannot be done (short of rendering the PDF file and feeding the results into an OCR system). PDF is a display format and is not guaranteed to have any internal structure for defining fields, let alone any standardized structure. If you have a bunch of PDFs all generated by the exact same software stack, you may be able to parse them as a special case (that will be different to other folks' special cases). Commented Jun 7, 2016 at 11:03

1 Answer 1

1

There are 2 viable approaches to extract that field data:

  1. Search for some predefined keyword, like Experience to get its location. Then search for the next section's keyword (Hobbies) and then just determine coordinates of the text partition between these 2 sections and extract this text from this location.

  2. If PDF are generated using the same generator then you may just find coordinates of Experience section and just extract text from the same location everytime.

  3. (easiest) Just convert the whole page into text and then parse the generated text using substring search or regular expressions. This will be the easiest and simpliest way as all the work regarding PDF format relies on the specialized tool

Sign up to request clarification or add additional context in comments.

1 Comment

Short-comings of this approach: 1. In some cases "Career History", "Professional History" or something else can be written in place of "Experience". 2. Word "experience" can have multiple occurence in the resume.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.