Indexing PDF file in ElasticSearch using Java Code

Question

I am trying to Index PDF files in elastic search 6.3.2 using Java code. So far I have written following code to save the pdf in ES. The code is working fine and I am able to save the Base64 encoded string of my PDF in ES. I want to understand if the approach which I am following is correct or not? Is there any better way of doing it? Following is my code:

            InputStream inputStream = new FileInputStream(new File("mypdf.pdf"));
        try {
            byte[]  fileByteStream = IOUtils.toByteArray(inputStream );
            String base64String = new String(Base64.getEncoder().encodeToString(fileByteStream).getBytes(),"UTF-8");
            String strEncoded = Base64.getEncoder().encodeToString( base64String.getBytes( "utf-8" ));
            this.stream.close();

                    JSONObject correspondenceNode = new JSONObject(); 
                    correspondenceNode.put("data",strEncoded );

                    String strSsonValues = correspondenceNode.toString();
                    HttpEntity entity = new NStringEntity(strSsonValues , ContentType.APPLICATION_JSON);
                    elasticrestClient.put("/2018/documents/"1, entity);

        } catch (IOException e) {
            e.printStackTrace();
        }

Basically what I am doing here is, I am converting the PDF document into Base64String and saving it into ES and while reading, I am converting it back.

following is the code for decoding:

String responseBody = elasticrestClient.get("/2018/documents/1");
//some code to fetch the hits
JSONObject h = hitsArray.getJSONObject(0);
source = h.getJSONObject("_source");
String object = (source.getString("data"));
byte[] decodedStr = Base64.getDecoder().decode( object );

FileOutputStream fos = new FileOutputStream("download.pdf");
fos.write(Base64.getDecoder().decode(new String( decodedStr, "utf-8" )));
fos.close();

dadoonet · Accepted Answer · 2018-07-31 20:28:45Z

2

This might be correct to store a BASE64 content in elasticsearch but few pieces might be missing here:

You are not "indexing" the PDF as per say in Elasticsearch. If you want to do so, you need to define an ingest pipeline and use the ingest attachment plugin to extract the content from the PDF.
You did not speak about the mapping you are using. If you "really" want to keep the binary content around, you might want to define the BASE64 field as a binary data type.
It does not sound to me a good idea to use elasticsearch to store large blobs like this.

Instead, I'd extract text and metadata and index that + an URL to the binary itself. Like:

{
  "content": "Extracted text here",
  "meta": {
    // Meta data there
  },
  "url": "file://path/to/file"
}

You can also look at FSCrawler (including its code) which does basically that.

answered Jul 31, 2018 at 20:28

dadoonet

14.6k3 gold badges46 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Prerak Tiwari Over a year ago

I do not want to do full text search on the PDF. I just want to save and retrieve it based of other properties. I wan just wondering if there is any other alternative to what I was doing.

dadoonet Over a year ago

Then don't forget to mark it as a binary in the mapping as I wrote. But again, I'd not recommend storing big PDF files in elasticsearch. May be few kilobytes is ok but I'd avoid storing megabytes of documents. Using a storage system like CouchDB, MapR, HDFS, S3... Would be a better solution. But try it with Elasticsearch if it works for you...

Prerak Tiwari Over a year ago

Yes. The property which is holding the base64 value have type "type": "binary". The average document size which we have is about 60~90 KB. Currently we are saving our documents in JCR Repository, but now we want to move to some other solution, that's why I am evaluating ES to check if it can serve my purpose.

dadoonet Over a year ago

elastic.co/guide/en/elasticsearch/resiliency/current/index.html is a good read.

Collectives™ on Stack Overflow

Indexing PDF file in ElasticSearch using Java Code

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related