I need to index PDFs to Elasticsearch. For that, I need to convert the files to base64. I will be using the Attachment Mapping.
I used the following Python code to convert the file to Base64 encoded string:
from elasticsearch import Elasticsearch
import base64
import constants
def index_pdf(pdf_filename):
encoded = ""
with open(pdf_filename) as f:
data = f.readlines()
for line in data:
encoded += base64.b64encode(f.readline())
return encoded
if __name__ == "__main__":
encoded_pdf = index_pdf("Test.pdf")
INDEX_DSL = {
"pdf_id": "1",
"text": encoded_pdf
}
constants.ES_CLIENT.index(
index=constants.INDEX_NAME,
doc_type=constants.TYPE_NAME,
body=INDEX_DSL,
id="1"
)
The creation of index as well as document indexing works fine. Only issue is that I don't think that the file has been encoded in a right way. I tried encoding that file using online tools and I get a completely different encoding which is bigger as compared to the one I get using Python.
Here is the PDF file.
I tried Querying the text data as suggested in the Documentation of the plugin.
GET index_pdf/pdf/_search
{
"query": {
"match": {
"text": "piece text"
}
}
}
I gives my zero hits. How should I go about it?