How to Extract Text Tables Images from PDFs while maintaining the structures

Ask Question

Asked 7 months ago

Modified 7 months ago

Viewed 66 times

from unstructured library opensource one when i tried a pdf that have background images design patterns and XObjects in it this library also consider those as a images and store the path. so how can we clean the pdfs and store only the images that are in the pdfs as a figure?

import os
from io import StringIO
from lxml import etree
import pandas as pd
from unstructured.partition.pdf import partition_pdf
import json

class PDFProcessor:
    def __init__(self):
        """Extract structured elements (text, images, tables, equations) from a PDF."""
        pass

    def process_pdf(self, path: str, max_characters=30000):
        try:
            pdf_elements = partition_pdf(
                filename=path,
                extract_images_in_pdf=True,
                strategy='hi_res',
                infer_table_structure=True,
                extract_image_block_types=["Image"],
                # extract_image_block_to_payload=True,
                max_character=max_characters,
            )
            chunks = [el.to_dict() for el in pdf_elements ]
            text_data = [el for el in chunks if el["type"] in ["NarrativeText", "Title", "ListItem", "Text", "FigureCaption","UncategorizedText"]]
            image_data = [el for el in chunks if el["type"] == "Image"]
            table_data = [el for el in chunks if el["type"] == "Table"]
            return {
                "text": text_data,
                "images": image_data,
                "tables": table_data
            }
        except Exception as e:
            print(f"[ERROR] Cannot process the document: {e}")
            return {
                "text": [],
                "images": [],
                "tables": [],
                "error": str(e)
            }

edited Apr 22 at 11:36

desertnaut

60.8k32 gold badges155 silver badges183 bronze badges

asked Apr 22 at 9:48

Umair Ashraf

111 bronze badge

PyMuPDF4LLM is for converting the PDFs into html or markdown but in my case i want to make a agent that extract the images tables and text from the PDFs while maintaining the structure of the tables images like diagrams and figures and make a retrieval using MongoDB $vector Search so, what i have tried is working fine with the simple PDFs which doesn't have background images design patterns specifically on the front and within the PDFs as logos or other miscellaneous. so in that case what do you think which approach we need to use..?

Umair Ashraf
– Umair Ashraf

2025-04-23 07:50:25 +00:00
Commented Apr 23 at 7:50

Add a comment |

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

How to Extract Text Tables Images from PDFs while maintaining the structures

0

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest