0

I have 3 nodes of Elasticsearch (version 6.2.4) in my dev cluster. All the configurations are the default (even shards). I am trying to run some searches which will return millions of records. I decided to use Scroll with Java High-Level Rest Client. So my code looks like this

MatchQueryBuilder matchQueryBuilder = new MatchQueryBuilder("galaxy", galaxyName);

SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
searchSourceBuilder.query(matchQueryBuilder);
searchSourceBuilder.size(scrollSize);

SearchRequest searchRequest = new SearchRequest();

searchRequest.indices(galaxyIndexName);
searchRequest.source(searchSourceBuilder);
searchRequest.scroll(TimeValue.timeValueSeconds(scrollTimeValue));

SearchResponse searchResponse = restHighLevelClient.search(searchRequest);

StarCollection starCollection = new StarCollection();

boolean moreResultsExist = true;

int resultCount = 0;

while (moreResultsExist) {

    String scrollId = searchResponse.getScrollId();

    for (SearchHit searchHit : searchResponse.getHits()) {

        Star star = objectMapper.readValue(searchHit.getSourceAsString(), Star.class);
        resultCount++;

        starCollection.addContentsItem(star);
    }

    if (resultCount >= searchResponse.getHits().getTotalHits()) {

        moreResultsExist = false;

        ClearScrollRequest request = new ClearScrollRequest();
        request.addScrollId(scrollId);
        restHighLevelClient.clearScroll(request);
    }

    SearchScrollRequest scrollRequest = new SearchScrollRequest(scrollId);
    scrollRequest.scroll(TimeValue.timeValueSeconds(scrollTimeValue));
    searchResponse = restHighLevelClient.searchScroll(scrollRequest);
}

Now, when I run search which returns 1.5 millions of documents, its taking forever. My method never finishes. Sometimes I get exception like

org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=search_context_missing_exception, reason=No search context found for id

So, I have following questions -

  1. Is this the right way to use Scroll?
  2. Whats the best way to do searches which return millions of records?
3
  • Maybe use a smaller scrollSize and also possibly increase the scrollTimeValue. Commented Jun 20, 2018 at 5:22
  • I tried with different scroll sizes, like 100, 500, 1000. It takes 1.5 seconds to get 100 documents which is still high. Commented Jun 20, 2018 at 5:28
  • Hard to tell without knowing what kind of boxes you have. Commented Jun 20, 2018 at 5:43

1 Answer 1

1

Is this the right way to use Scroll?

Yes, Scroll is the optimum way to retrieve large scale results

Whats the best way to do searches which return millions of records?

First you must think why do you want so many records? Are you exporting your documents? otherwise retrieving so many results is not rational. You can limit your total search results by setting terminate_after settings in query.

But if you really needs all those records, you have to break your query in smaller parts. For example if there is a date field in records, try to put filter on it, and iterate on it in smaller spans (for example 5 minutes steps).

And finally if you have delay more than scrollTimeValue in your iterates, you get search_context_missing_exception error.

Sign up to request clarification or add additional context in comments.

1 Comment

actually my use case demands that kind of data in response. I can not add any filter. But your suggestions gave me food for thought and I thought about some strategies - 1) Dividing index - Index itself can be breakdown into multiple indices based on time and multiple search queries can be fired on all indices simultaneously using thread. 2) Return paginated result - I can return a batch of results and page Id in response and consumer can demand next set of result with the page Id.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.