3

I am using elasticsearch scroll API as documented here. It's well understood that each scroll request takes as input a scroll id returned in response of previous scroll response. Once done with scrolling all the chunks, the last scroll id needs to be cleared.

Use Case

  • Consume a big data set (in order of 0.1 to 2 million) matching a given query in chunk size of 5000. Individual chunk query performance is good.
  • Data is most likely to be queried from single indice and shard.
  • The data which is being queried never gets updated in real time.

Questions / Concerns

  • How elastic search maintains the scroll session or state internally ? Will all the matching documents (or their ids) stored or parked aside in-memory and returned in subsequent scroll requests ? Should I be concerned about RAM/CPU that are currently allocated to the cluster.
  • Are there any performance penalty while using the scroll API ? I understand that there is default max number of scroll session allowed at a time which is 500. This default is acceptable in my case as number of requests per seconds in quite low.

1 Answer 1

2

During performance testing in my environment, the scroll API with scroll size set to 7,000, GC pause time upto 1.5 minutes and high CPU usage was observed. ( Obviously this is also affected by the cluster configuration and type of query that ran)

From the documentation and an informative blog

The results that are returned from a scroll request reflect the state of the data stream or index at the time that the initial search request was made, like a snapshot in time. Subsequent changes to documents (index, update or delete) will only affect later search requests.

The data matching the search-request passed in first scroll API is kept aside in memory. Quoting from the mentioned blog :-

As I mentioned above, scrolling works by taking a "snapshot" of your data and then serving it to you in pieces. This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. Memory bloat can lead to issues if you don't have a large surplus of memory to work with.

Short Answer Yes, do consider heap and cpu usage while using the scroll API. Factor like request per second and optimal scroll size should be considered for given cluster configuration.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.