0

I am writing a service that will be creating and managing user records. 100+ million of them. For each new user, service will generate a unique user id and write it in database. Database is sharded based on unique user id that gets generated.

Each user record has several fields. Now one of the requirement is that the service be able to search if there exists a user with a matching field value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id ). I will need to search on all shards to find a user record that matches a particular column.

So to make that lookup fast. One thing i am thinking of doing is setting up an ElasticSearch cluster. Service will write to the ES cluster every time it creates a new user record. ES cluster will index the user record based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have 100+million user records where 5 columns of each user record need to be indexed. I know it depends on hardware config as well. But please assume a well tuned hardware.

-- Here i am trying to use ES as a memcache alternative that provides multiple keys. So i want all dataset to be in memory and does not need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated.

3
  • 1
    I think you can use ES for this. 100M records is a normal number in ES. My data is around 80M records with 8 columns indexed and it work fine. In ES everything is indexed, and will be load into memory for faster searching. I suggest you to read document/presentations in elasticsearch.org and join the community to research more on how to implement Commented Dec 13, 2013 at 3:43
  • Hello Duc, does all your data reside in memory ? What kind of read performance do you get ? Also what is your reason of using ES ? Commented Dec 13, 2013 at 21:16
  • It depend on your query, your purpose. I let it cache in memory because I focus on the performance, and I use it mainly for searching data Commented Dec 16, 2013 at 14:41

1 Answer 1

1

ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). Rather, it'll cache what it can and rely on the OS's disk cache for the rest.

100+ million records shouldn't be an issue at all even on a single machine. I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm.

Sign up to request clarification or add additional context in comments.

2 Comments

Hey Bruce, Thanks alot for reply. 500ms sounds large value actually. Can i control how ES shards my data. Actually i tried to find some tech doc on how ES keeps indexes, but couldnt find one. I ideally would not like data to reside outside memory as that would require disk IO, swapping etc. I am trying to use ES as a multi-key-value memcache.
500ms is large - but my query is large and complex too. Simpler queries will be faster and plain GET -type requests are very very quick.ES can use an memory store if you have the available memory (elasticsearch.org/guide/en/elasticsearch/reference/current/…)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.