Best practice for revision control of data in SQL databases

Question

My entire database occasionally has entries which are wrong, but instead of altering the data directly I'd like the ability to keep a revision of changes.

These changes occur very rarely.

Ideally something like this: -

 (original table fields) | revision_version | origin | user | timestamp

So say I had a table called posts with the following schema: -

title | description | timestamp | author

An additional table called posts_revisions would be created thusly: -

title | description | timestamp | author | revision_version | origin | user | timestamp

origin being the source of the change, be it a bot, user generated or what have you.

As you can imagine this is a rather large change to the existing database, my current concern is the performance hit of checking the _revisions tables for every query. Is this best practice for this sort of thing?

Don't be afraid to duplicate origin, user and timestamp in both tables. You might want to delete revisions in a background job. Delete all revisions whose post doesn't exist. In theory you could even lazy-create the revisions with log mining. Bigger transactions and lower amortized cost. — John Watts
– John Watts, Commented Aug 2, 2012 at 13:00

Gordon Linoff · Accepted Answer · 2012-08-02 13:52:28Z

For this type of problem, I keep a current table and a history table.

The history table has the following additional columns:

HistoryID
EffectiveDate
EndDate
VersionNumber
CreatedBy
CreatedAt

The effective and end dates are the time span where the values are valid. The version is just incremented every time there is a change for a record. The id, CreatedAt, and CreatedBy are columns I put into almost every table in the database.

Generally, I keep the history table up-to-date with nightly jobs, that compare the tables and then use MERGE to combine the data. An alternative is to wrap all changes in stored procedures, and to update both tables there. Another alternative is to use triggers, that detect when a change occurs. However, I shy away from triggers, preferring the first two alternatives.

I must admit that disk space is not a big consideration for these tables. So, there is no problem storing the data twice, once in the results once in the history. It would be just a minor tweak to store only history in the history table, with the current records in the "current" table.

One downside to this approach is changing the structure of the base table. If you want to add a column, you need to add it to the history table as well as the base table.

Fluffeh · Accepted Answer · 2012-08-02 12:57:46Z

1

If the tables are used for summary purposes (especially by business users if they have some SQL access) I think it is best to remove the data and place it into another table. While flags and revisions are sometimes fine, when you have to do something along the lines of select sum(select someVar where revision_version=max(revision_version and someID=ID)) then it really gets beyond simple.

If you have a table that is being used for quick and nasty data collection, replace the data and if needed, place the old data into a revisions table. If only some application will access it AND it isn't a performance issue then keep it in the main table.

answered Aug 2, 2012 at 12:57

Fluffeh

33.6k16 gold badges69 silver badges80 bronze badges

Collectives™ on Stack Overflow

Best practice for revision control of data in SQL databases

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related