Automatically merging relational data

Question

Background

Semiautomated file merging is a critical part of version control systems (all the ones I've used anyway!). This is usually accomplished on textual source code without the tool understanding the underlying language – although semantic merging also exists.

I'm interested in whether merging relational data has been researched, and what the findings have been.

Specifically, I'm imagining that a single relational database is "forked" in to two (or more) independent copies. That independent changes are made to these copies, and that we then want to recombine these databases in to one again, which reflects the changes in each.

What this question isn't

I'm not asking a question about specific technologies here, although that would be interesting. I'm definitely not asking about SQL based technology sepecificaly. I'm also specifically interested in relational data, and not tree data like XML, JSON, or ASTs, etc.

I'm also not asking about integrating heterogeneous databases or information stores, which seems to be the subject of Data integration. For the purposes of this question, it can be assumed that the databases will have the same schema and this doesn't change.

Some ideas about how this might work

Relational data is, basically, sets. Sets, in principle, have (several) sensible merge semantics, even when the common base isn't known, such as for a commutative replicated datatype set.

There are obviously some basic issues, such as generating identities. If identities are simple incrementing numbers, then these become extremely hard to merge. This problem seems to be solved by using globally unique identifiers.

Incrementing identifiers seems to be a specific case of a more general problem. It is an example of inserting a new fact in to the database that has been derived from the current state of the database.

In general, I suspect that any fact added to the database that is derived from the existing facts in the database, rather than simply new information collected from the outside world, is a potential source of difficulty during a merge. I wonder if this is a huge problem, or one that can be avoided for the most part by something like normalisation?

For example: if we adjust a balance by looking up the current balance and adding or decrementing, then this is hard or impossible to merge. However, if we store individual increments and decrements, this information is trivial to merge. When we need the balance, we should compute it from the data we've got.

Problems also arise for integrity constraints. Following naive set merging, a primary key may no longer be a primary key. However, it seems like it might be possible to automatically determine where this could happen from the database schema, and then have the user choose a resolution scheme, or default to some sensible one if such a schema exists. Note: it is possible to keep an audit, show conflicts as they arise, and have a user revert a decision later, as is done with version control systems.

Often databases can log all transactions done to them. Can you log the history of transactions? This might help do a better job of merging (and detecting conflicts). — D.W.
– D.W. ♦, Commented Oct 12, 2017 at 20:33
@D.W. certainly, a persistent database (one that maintains history) seems like it ought to make merging simpler. I'm more interested in what is known about this subject as a whole – it seems like something amazingly fundamental, so I'm a bit floored I've found no discussion of it online. Possibly its considered so trivial and obvious that it doesn't warrant discussion! But it seems a bit complex to me! — Benjohn
– Benjohn, Commented Oct 12, 2017 at 21:28
Yes there's lots of material about version control in a database setting. I'm surprised you can't find anything. Search for 'distributed databases', 'eventually consistent' (but beware Carl Hewitt has rather pronounced opinions). The classic paper for turning source version control systems into a database is Loh, Swierstra, Leijen — AntC
– AntC, Commented Oct 13, 2017 at 6:03

Lance Pollard · Accepted Answer · 2018-07-03 14:47:07Z

I'm interested in whether merging relational data has been researched, and what the findings have been.

Here are several papers mainly on the tree-to-tree editing problem. A lot of good stuff is related to XML documents.

A record in a database can be considered a simple tree of just a single row of children (the fields) or perhaps extending to the relations the record points to. With that you can do tree diffing/merging using the techniques outlined in these papers, and adding to it any column-specific string diffing techniques.

You could then add extra database-specific heuristics to the diffing system to make it more useful. For example, you have the extra metadata of the schema of the database, which provides a lot of information.

Thanks Lance. These look interesting: +1 from me for the links. I hope I'll get a chance to look through them – I'll give the tick if they seem to relate well to relational data. — Benjohn
– Benjohn, Commented Jul 3, 2018 at 15:21

Stack Exchange Network

Automatically merging relational data

Background

What this question isn't

Some ideas about how this might work

1 Answer 1

Your Answer

Hot Network Questions

Automatically merging relational data

Background

What this question isn't

Some ideas about how this might work

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions