6

I am working with a team which uses two data sources.

  1. MSSQL as a primary data source for making transaction calls.
  2. ES as a back-up/read-only source of truth for viewing the data.

e.g. If I put an order, The order is inserted in DB, then there is a RabbitMQ listener/ Batch which then synchronizes the data from DB to ES.

Somehow this system fails for even just a million records. When I say fails, it means the records are not updated in ES in timely fashion, e.g. Say I create a coupon, then the coupon is generated in DB, when the coupon is generated, customer tries to redeem it immediately, although ES doesn't have the information about the coupon yet, so it fails. Of course there are options to use RabbitMQ's priority Queues etc, but the questions I have got are very basic

I have few questions in my mind, which I asked to the team, and still haven't got satisfactory answers

  1. What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.
  2. Does it really makes sense to use ES as source of truth for real-time data?
  3. Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such search-optimized databases are once write, multiple read kind.
  4. If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using ES just for analytic?

The main question I have in mind is, how we can optimize this architecture so that we can scale better?

PS: When I asked minimum load, what I really meant is what is the number of records/transaction for which we can say ES will be faster than conventional relational databases? Or there is no such term at all?

2 Answers 2

3
  1. What is the minimum load should be expected when we use elastic search, and doesn't it become an overkill if we have just 1M records.

Answer: the possible load depends on the capabilities of your server

  1. Does it really makes sense to use ES as source of truth for real-time data?

From ES website: "Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases. As the heart of the Elastic Stack, it centrally stores your data so you can discover the expected and uncover the unexpected."

So yes, it can be your source of truth, that said, it is "eventually consistent" which raises the question, how soon is it considered "real-time"... and there is no way to answer it without testing and measuring your system .

  1. Is ES designed for handling relational-like databases, and to handle the data that gets continuously updated? AFAIK such search-optimized databases are once write, multiple read kind.

That's a good point, as any eventual-consistent system, it is indeed NOT optimized to series of modifications!

  1. If we are doing it to handle load, then how will it be different than making a cluster of MSSQL databases as source of truth and using ES just for analytic?

It won't. Bare in mind that ES, as quoted above, was build to accommodate requirements of search and analysis. If that's not what you intend to do with it you should consider another tool. Use the right tool for the right job.

Sign up to request clarification or add additional context in comments.

4 Comments

Regarding 2: The official stance of the ES folks is to never use ES as a primary source of truth, since ES is not a database but a search engine. Just my 2 cents...
@Val Agreed! ES used to be less reliable and not so "Partition tolerance" (we experienced splits in the cluster a few times). But I hear that there were some great improvement over the past year. That said, I agree that it was not meant to be a datastore and ideally one should use the right tool for the right job (which is why I wrote it under #4).
ES does not have WAL so it cannot be a source of truth system.
@user1870400 no argument about that, see the last 3 lines of the answer as well as the comment above your comment ;)
2

1) There isn't a minimum expected load. You can have 2 small nodes (master & data) with 2 shards per index (1 primary + 1 replica).

You can also split your data into multiple indices if it makes sense from a functional point of view (i.e. how data is searched).

2) In my experience, the main benefits you get from ElasticSearch are:

  • Near linear scalability.
  • Lucene-based text search.
  • Many ways to put your data to work: RESTful query API, Kibana...
  • Easy administration (compared to your typical RDBMS).

If your project doesn't get these benefits, then most probably ES is not the right tool for the job.

3) ElasticSearch doesn't like data that is updated frequently. The best use case is for read-only data.

Anyway, this doesn't explain the high latency you are getting; your problem must lie in RabbitMQ or the network.

4) Indeed, that's what I would do: MSSQL cluster for application data and ES for analytics.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.