0

from unstructured library opensource one when i tried a pdf that have background images design patterns and XObjects in it this library also consider those as a images and store the path. so how can we clean the pdfs and store only the images that are in the pdfs as a figure?

import os
from io import StringIO
from lxml import etree
import pandas as pd
from unstructured.partition.pdf import partition_pdf
import json

class PDFProcessor:
    def __init__(self):
        """Extract structured elements (text, images, tables, equations) from a PDF."""
        pass

    def process_pdf(self, path: str, max_characters=30000):
        try:
            pdf_elements = partition_pdf(
                filename=path,
                extract_images_in_pdf=True,
                strategy='hi_res',
                infer_table_structure=True,
                extract_image_block_types=["Image"],
                # extract_image_block_to_payload=True,
                max_character=max_characters,
            )
            chunks = [el.to_dict() for el in pdf_elements ]
            text_data = [el for el in chunks if el["type"] in ["NarrativeText", "Title", "ListItem", "Text", "FigureCaption","UncategorizedText"]]
            image_data = [el for el in chunks if el["type"] == "Image"]
            table_data = [el for el in chunks if el["type"] == "Table"]
            return {
                "text": text_data,
                "images": image_data,
                "tables": table_data
            }
        except Exception as e:
            print(f"[ERROR] Cannot process the document: {e}")
            return {
                "text": [],
                "images": [],
                "tables": [],
                "error": str(e)
            }
1
  • PyMuPDF4LLM is for converting the PDFs into html or markdown but in my case i want to make a agent that extract the images tables and text from the PDFs while maintaining the structure of the tables images like diagrams and figures and make a retrieval using MongoDB $vector Search so, what i have tried is working fine with the simple PDFs which doesn't have background images design patterns specifically on the front and within the PDFs as logos or other miscellaneous. so in that case what do you think which approach we need to use..? Commented Apr 23 at 7:50

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.