0

I have a list of words which I am searching in a pdf document using fitz in python The code generally works for most of the words except for a few like "efficiency"

My code is given below :

        if (len(re.findall(f'\\b{phrase.casefold()}s?\\b', mpage.casefold(), flags=0))>0) :
        
             text_instances = page.search_for(phrase, quads=True)

This code works for mostly all words except for some words e.g. efficiency For the word "efficiency", the if statement successfully matches but the page.search_for statement does not match The word efficiency given in the image below has different fonts for first and second f Is it because of this that the word is not matched?

enter image description here

3
  • 2
    The reason is that certain character combinations are not stored as separate characters, but as one - so-called "ligatures". The most frequent ones and their hex codes are ∗ "ff" -> 0xFB00, ∗ "fi" -> 0xFB01,∗ "fl" -> 0xFB02, ∗ "ffi" -> 0xFB03, ∗ "ffl" -> 0xFB04, ∗ "ft" -> 0xFB05, ∗ "st" -> 0xFB06. You should use a text extraction package that is capable of disassembling those characters codes, like PyMuPDF. Commented Dec 18, 2023 at 10:58
  • Text extraction is successful, but i need to also highlight the text in the pdf, hence using page.search_for with quads = True Commented Dec 18, 2023 at 11:18
  • Thanks . Your solution here github.com/pymupdf/PyMuPDF/issues/1503. helped me get the answer Commented Dec 18, 2023 at 11:32

1 Answer 1

0

I got the solution. In order to disregard ligatures, we should set flags = 0 as

text_instances = page.search_for(phrase,flags = 0, quads=True)

This link helped me finding the solution https://github.com/pymupdf/PyMuPDF/issues/1503

Thanks to @jorj-mickie https://stackoverflow.com/users/4474869/jorj-mckie

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.