5

I have to implement a full-text based search in a pdf document using Elasticsearch ingest plugin. I'm getting an empty hit array when I'm trying to search the word someword in the pdf document.

//Code for creating pipeline

PUT _ingest/pipeline/attachment
{
    "description" : "Extract attachment information",
    "processors" : [
      {
        "attachment" : {
        "field" : "data",
        "indexed_chars" : -1
        }
      }
    ]
}

//Code for creating the index

PUT my_index/my_type/my_id?pipeline=attachment
{
   "filename" : "C:\\Users\\myname\\Desktop\\bh1.pdf",
   "title" : "Quick",
   "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="

}

//Code for searching the word in pdf 

GET /my_index/my_type/_search
{
    "query": {
    "match": {
      "data" : {
        "query" : "someword"
    }
 }
}
5
  • If you open the PDF in a PDF viewer, are you able to search for "someword" in it and find a match? Commented Feb 8, 2017 at 14:39
  • @Alcanzar Yeah it searches for the word. Commented Feb 8, 2017 at 14:51
  • 1
    This looks like a duplicate of stackoverflow.com/questions/37861279/… -- note that your PUT statement is putting a specific "data" for the file. You need to use curl or something like that to pass the specific file data. The "data" you are putting in is Lorem ipsum dolor sit amet -- if you search for Lorem, you'd find a result Commented Feb 8, 2017 at 14:55
  • @Alcanzar I verified by searching for Lorem by running the GET on Kibana dashboard. But still there are not hits. Commented Feb 8, 2017 at 15:29
  • @Alcanzar Can you pls tell me the theory behind the elasticsearch indexing unstructered data like pdf files? Commented Feb 8, 2017 at 15:58

1 Answer 1

4
+50

When you index your document with the second command by passing the Base64 encoded content, the document then looks like this:

        {
           "filename": "C:\\Users\\myname\\Desktop\\bh1.pdf",
           "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
           "attachment": {
              "content_type": "application/rtf",
              "language": "ro",
              "content": "Lorem ipsum dolor sit amet",
              "content_length": 28
           },
           "title": "Quick"
        }

So your query needs to look into the attachment.content field and not the data one (which only serves the purpose of sending the raw content during indexing)

Modify your query to this and it will work:

POST /my_index/my_type/_search
{
   "query": {
      "match": {
         "attachment.content": {         <---- change this
            "query": "lorem"
         }
      }
   }
}

PS: Use POST instead of GET when sending a payload

Sign up to request clarification or add additional context in comments.

5 Comments

Any idea on how can we convert a pdf file to base64 encoded file using elastic search ?
I think this should be a new question as it is unrelated to this one.
Why did you use POST instead of GET ? The later works fine for me
It depends on the HTTP client you're using, but you should NEVER send a payload via GET (= not HTTP compliant). See a more detailed example here: stackoverflow.com/questions/34795053/…
@Ashley afaik that is done before data is sent to elastic search, so you do the conversion with whatever method exists in the programming language you are using.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.