1

I have a set of databases, distributed across multiple locations in the network and for ex. one client that needs to store some data in that databases.

I need to make sure my data will always be stored.

I can't organize a replica set with sync/async replication as it will make me to connect to one master which is a point of failure, so I send data from the client to all databases I know. Apparently, one database can fail to store, so I am relying on other databases writes. In the end I get different data sets stored in DB's though these sets are overlapping. (Ex. DB1 -> [1, 2, 3], DB2 -> [1, 3], DB3 -> [2,3,4])

How can get consistent data when reading from these DBs? What techniques should I apply on the client that writes data and a client that reads to be able to merge data sets successfully (getting on reader [1,2,3,4])?

2 Answers 2

2

What you're asking is basically an entire branch of computer science. It is very much a non-trivial problem and you will find that a surprising number of things are impossible.

Also note that simply saying "consistent" data is not a sufficient definition. There are all sorts of levels of consistency (read-your-own-writes, reads-follow-writes, monotonic read, linearizable, causal, etc.) I think you likely mean (in a very loose sense): consistency similar to what you get when you use just one database.

To answer your question directly, you want to decide on a read quorum size and a write quorum size. These sizes must be selected such that reads and writes will overlap by at least one database instance. If you want to optimize for write latency, use a smaller write quorum and do the opposite if you want to optimize for read latency.

A more detailed exposition of overlapping read/write quorums can be found in Weighted Voting for Replicated Data. This is considered a seminal work in the field of replication.

Also be careful around the behavior of your overlapping quorums when adding or removing a database instance. It sounds like you have a relatively static topology, but if that is not the case, then an entirely different set of choices need to be made.

Lastly - and here's the real kick in the teeth - what I have described doesn't actually give you consistency (by any definition) in some cases (I like Daniel Abadi's explanation of when andy why), but for many systems it gives you good enough consistency. It's up to you to decide exactly what level of consistency you need.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for a good reply. I've read Werner's article about types of eventual consistency, where he talks about read-your-own-writes, monotonic read and etc. This is all about replica set EC, so that each instance will eventually get the same data as others. I probably don't need that as the main purpose of that storage is low write latency, I just want to be able to read from all nodes and recreate data set which has been sent from the client. For now I see a solution that I assign number to each batch I send out to the nodes so later on a reader I can detect unique data items.
I see. I mistook the thrust of your question a bit. You might look into using vector clocks as a method of detection conflicts. The question then becomes fashioning a deterministic merge function to apply at read-time (i.e. read-resolution)
0

There are two-way/three-way replication software that do not require a "master". You can also use transaction log based replications.

What and how you can use will depend on the database product you use.

HTH

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.