I am using elasticsearch scroll API as documented here. It's well understood that each scroll request takes as input a scroll id returned in response of previous scroll response. Once done with scrolling all the chunks, the last scroll id needs to be cleared.
Use Case
- Consume a big data set (in order of 0.1 to 2 million) matching a given query in chunk size of 5000. Individual chunk query performance is good.
- Data is most likely to be queried from single indice and shard.
- The data which is being queried never gets updated in real time.
Questions / Concerns
- How elastic search maintains the scroll session or state internally ? Will all the matching documents (or their ids) stored or parked aside in-memory and returned in subsequent scroll requests ? Should I be concerned about RAM/CPU that are currently allocated to the cluster.
- Are there any performance penalty while using the scroll API ? I understand that there is default max number of scroll session allowed at a time which is 500. This default is acceptable in my case as number of requests per seconds in quite low.