Converting a PDF file to Base64 to index into Elasticsearch

Question

I need to index PDFs to Elasticsearch. For that, I need to convert the files to base64. I will be using the Attachment Mapping.

I used the following Python code to convert the file to Base64 encoded string:

from elasticsearch import Elasticsearch
import base64
import constants

def index_pdf(pdf_filename):
    encoded = ""
    with open(pdf_filename) as f:
        data = f.readlines()
        for line in data:
            encoded += base64.b64encode(f.readline())
    return encoded

if __name__ == "__main__":
    encoded_pdf = index_pdf("Test.pdf")
    INDEX_DSL = {
        "pdf_id": "1",
        "text": encoded_pdf
    }
    constants.ES_CLIENT.index(
            index=constants.INDEX_NAME,
            doc_type=constants.TYPE_NAME,
            body=INDEX_DSL,
            id="1"
    )

The creation of index as well as document indexing works fine. Only issue is that I don't think that the file has been encoded in a right way. I tried encoding that file using online tools and I get a completely different encoding which is bigger as compared to the one I get using Python.

Here is the PDF file.

I tried Querying the text data as suggested in the Documentation of the plugin.

GET index_pdf/pdf/_search
{
  "query": {
    "match": {
      "text": "piece text"
    }
  }
}

I gives my zero hits. How should I go about it?

keety · Accepted Answer · 2015-07-09 18:09:33Z

5

The encoding snippet is incorrect it is opening the pdf file in "text" mode.

Depending on the file size you could just open the file in binary mode and use the encode string method Example:

def pdf_encode(pdf_filename):
    return open(pdf_filename,"rb").read().encode("base64");

or if the file size is large you could have to break the encoding into chunks did not look into if there is module to do so but it could be as simple as the below example Code:

 def chunk_24_read(pdf_filename) :
    with open(pdf_filename,"rb") as f:
        byte = f.read(3)
        while(byte) :
            yield  byte
            byte = f.read(3)


def pdf_encode(pdf_filename):
    encoded = ""
    length = 0
    for data in chunk_24_read(pdf_filename):
        for char in base64.b64encode(data) :
            if(length  and  length % 76 == 0):
               encoded += "\n"
               length = 0

            encoded += char  
            length += 1
    return encoded

answered Jul 9, 2015 at 18:09

keety

17.5k4 gold badges53 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Animesh Pandey Over a year ago

Sorry for the late reply. I tried this and it is now working. Thanks

Collectives™ on Stack Overflow

Converting a PDF file to Base64 to index into Elasticsearch

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related