1

Am trying to index my word/pdf document so that I created one util program using java to encode my files into base64 and then am trying to index them in ElasticSearch.

Please find my below code that I able to encode my files into base64. Now, I am not sure how can I index them in ElasticSearch

Please find my java code below.

public static void main(String args[]) throws IOException {
    String filePath = "D:\\\\1SearchEngine\\testing.pdf";
    String encodedfile = null;
    RestHighLevelClient restHighLevelClient = null;
    File file = new File(filePath);
    try {
        FileInputStream fileInputStreamReader = new FileInputStream(file);
        byte[] bytes = new byte[(int) file.length()];
        fileInputStreamReader.read(bytes);
        encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
        //System.out.println(encodedfile);
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    }

    try {
        if (restHighLevelClient != null) {
            restHighLevelClient.close();
        }
    } catch (final Exception e) {
        System.out.println("Error closing ElasticSearch client: ");
    }

    try {
        restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
                new HttpHost("localhost", 9201, "http")));
    } catch (Exception e) {
        System.out.println(e.getMessage());
    }

    IndexRequest request = new IndexRequest( "attach_local", "doc", "103");   
    Map<String, Object> jsonMap = new HashMap<>();
    jsonMap.put("resume", "Karthikeyan");
    jsonMap.put("postDate", new Date());
    jsonMap.put("resume", encodedfile);
    try {
        IndexResponse response = restHighLevelClient.index(request);
    } catch(ElasticsearchException e) {
        if (e.status() == RestStatus.CONFLICT) {

        }
    }
}

Am using ElasticSearch 6.2.3 version and i have installed ingest-attachment plugin version 6.3.0

Am using below dependency for ElasticSearch Client

<dependency>
    <groupId>org.elasticsearch.client</groupId>
    <artifactId>elasticsearch-rest-high-level-client</artifactId>
    <version>6.1.2</version>
</dependency>

Please find my mapping details

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}

PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}

Now am getting the below error from java while create index

Exception in thread "main" org.elasticsearch.action.ActionRequestValidationException: Validation Failed: 1: source is missing;2: content type is missing;
    at org.elasticsearch.action.ValidateActions.addValidationError(ValidateActions.java:26)
    at org.elasticsearch.action.index.IndexRequest.validate(IndexRequest.java:153)
    at org.elasticsearch.client.RestHighLevelClient.performRequest(RestHighLevelClient.java:436)
    at org.elasticsearch.client.RestHighLevelClient.performRequestAndParseEntity(RestHighLevelClient.java:429)
    at org.elasticsearch.client.RestHighLevelClient.index(RestHighLevelClient.java:312)
    at com.es.utility.DocumentIndex.main(DocumentIndex.java:82)
9
  • For one the versions of ES and plugin have to match. Then you'll have to setup a named pipeline in the index specification call then use it in the indexing call. Post your code where you specify the mapping and your indexing code and we can help you Commented Jun 19, 2018 at 12:33
  • I have update my mapping details., Commented Jun 19, 2018 at 12:43
  • I meant your index mapping. Anyway you have to set the pipeline up first, PUT _ingest/pipeline/attachment call before you can index docs. Also Change PUT my_index/my_type/my_id to PUT employee/details/1?pipeline=attachment Commented Jun 19, 2018 at 15:52
  • 1
    If you are new, a good way to learn is to look at ES github repo and their unit tests, see here github.com/elastic/elasticsearch/tree/… Commented Jun 19, 2018 at 16:07
  • 1
    mapping is separate from java api. Just change your mapping and update it by curl. Then post your documents to elastic like this elastic.co/guide/en/elasticsearch/client/java-api/current/… Commented Jun 20, 2018 at 11:03

1 Answer 1

1

Finally i got the solution, how to index PDF/WORD document in ElasticSearch via Java APIs

String filePath = "D:\\\\1SearchEngine\\testing.pdf";
String encodedfile = null;
RestHighLevelClient restHighLevelClient = null;
File file = new File(filePath);
try {
    FileInputStream fileInputStreamReader = new FileInputStream(file);
    byte[] bytes = new byte[(int) file.length()];
    fileInputStreamReader.read(bytes);
    encodedfile = new String(Base64.getEncoder().encodeToString(bytes));
} catch (FileNotFoundException e) {
    e.printStackTrace();
}

try {
    if (restHighLevelClient != null) {
        restHighLevelClient.close();
    }
} catch (final Exception e) {
    System.out.println("Error closing ElasticSearch client: ");
}

try {
    restHighLevelClient = new RestHighLevelClient(RestClient.builder(new HttpHost("localhost", 9200, "http"),
            new HttpHost("localhost", 9201, "http")));
} catch (Exception e) {
    System.out.println(e.getMessage());
}


Map<String, Object> jsonMap = new HashMap<>();
jsonMap.put("Name", "Karthikeyan");
jsonMap.put("postDate", new Date());
jsonMap.put("resume", encodedfile);

IndexRequest request = new IndexRequest("attach_local", "doc", "104")
        .source(jsonMap)
        .setPipeline("attach_local");

try {
    IndexResponse response = restHighLevelClient.index(request);
} catch(ElasticsearchException e) {
    if (e.status() == RestStatus.CONFLICT) {

    }
}

Mapping Details :

PUT attach_local
{
  "mappings" : {
    "doc" : {
      "properties" : {
        "attachment" : {
          "properties" : {
            "content" : {
              "type" : "binary"
            },
            "content_length" : {
              "type" : "long"
            },
            "content_type" : {
              "type" : "text"
            },
            "language" : {
              "type" : "text"
            }
          }
        },
        "resume" : {
          "type" : "text"
        }
      }
    }
  }
}


PUT _ingest/pipeline/attach_local
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "resume"
      }
    }
  ]
}
Sign up to request clarification or add additional context in comments.

2 Comments

Good deal amigo
Thanks., still challenge is there to search with the content of pdf

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.