0

I saw this pattern in multiple projects I wrote: I create an SQL model for a certain type of entities and at some point, we realize that there's a need to store multiples types of the same entities. Usually I introduce a "status" column to the table, because the type is mutually exclusive. For example, person can be "dead" or "alive", chat_message can be "to_send", "sent", "to_edit", "to_delete" and "deleted".

The problem is, when I introduce such a column, I need to re-trace all of the queries I made and consider whether it's still valid for all statuses - otherwise I'd need to specify it in the query. It's very easy to introduce a bug at this point, which makes me wonder: is it a common pattern in software engineering? Which approaches would help me avoid it?

If it wasn't for database normalization, my perfect solution would be to copy the schema of the SQL table for each possible value of the "status" field. For instance, when I need to introduce "status" to "people", I would drop the "people" table and instead create variants for "alive people" and "dead people". I don't think I ever saw this kind of solution though and it feels like tech debt to me. What other options do I have?

2
  • 3
    Where a new status is added for something, there is probably no design solution that avoids you having to review the correctness of all queries which depend on that thing. An exact solution may require you to balance performance, convenience, and flexibility, but if you need an audit trail for the transitions between each status (which is not an uncommon requirement), it may be best to start by thinking about a separate status table (rather than a status column, which can only record current statuses, or separate tables for each status which may become extremely numerous). Commented Jul 14, 2020 at 18:42
  • If you find my response helpful please check it as accepted. Thank you. Commented Jul 22, 2020 at 23:59

4 Answers 4

1

For instance, when I need to introduce "status" to "people", I would drop the "people" table and instead create variants for "alive people" and "dead people".

You can instead create a dead_people view and a living_people view. Anyway, your code should query views and not tables to begin with.

I understand the overhead of checking everything everywhere, but with modern MVC approaches and separation of concerns, the places where a certain table or view is queried should be very limited and not widespread. Is it possible you are talking about legacy code? Been there, done that. Creating views, specially for the more widely used cases helps. What helps even more is creating functions or business objects that return the stuff you want and that are the only places where the database is queried.

1
  • 1
    +1 for Views. I'd first create the View that represents the 'default' status ('alive', for example), then I'd switch out all my existing queries to hit the View instead of the Table, which is a simple find+replace. Then, create new Views to satisfy any new queries that take into account the non-default status. Commented Aug 18, 2020 at 15:58
0

It is very common to use a “status” column in tables even if it might add overhead. The added benefit is you can do “logical or soft” deletes in such cases as you might use a “status” column as representing the status of your row data.

I (almost) always include some kind of “status” or “flag” column but that may due to the types of applications I have worked on.

I have also encountered (but less so) each row being given a start and end date with the “active” record having no end date. But that will only assist you when dealing with “active record status”. The benefit here is that record history is kept in the table. An obvious drawback would be increased space consumption.

As mentioned in the comments you would also probably be best served with a lookup table that contains all your statuses and related codes. But be careful not to create a Massively Unified Code-Key (MUCK) table.

Which brings me to the conclusion that there’s no silver bullet / one size fits all answer here since the need for a “status” column is driven by the needs of the application.

0

Moving people to dead_people rarely makes sense. It is a form of an archiving cleanup, so the original table works optimal. One use is having a fast small user table for mass event checks (user rights), and a user table with full info.

With O/R mappings I would suggest discriminator columns and use database inheritance when a "status" column determines which other columns are used. With that the queries can be still maintainable.

Say you have a query on people and now introduce dead or alive people. Then the business rule determines the scope of the query: what is the goal of that piece of code.

The obvious advice is do not repeat yourself. Queries may often vary in some criteria. Measures to follow DRY can be done in JPA (java).


A concrete example:

A data logger receives binary message from devices. Such a message consists of bytes, containing several signals, bit groups. For the definition of such binary message parsing there is a table. The signal is an entity.

Now comes a time, where custom signals user-defined as expression of signals is introduced. Signal becomes message-specified original signal or expression based custom signal. Consider that now a custom signal can also be defined from other custom signals rather than merely the original hardware signals.

This is a semantic change Signal = HardwareSignal xor CustomSignal. A discriminator column and separate O/R mappings would do. A clean separation of corresponding fields: reference to a MessageSpecification/SignalSpecification for HardwareSignal only. No unused fields. So no fields can be accessed unintended.

That would be sufficient for a good software maintainance.

All this is not optimal, but I still have not encountered a database + O/R framework that does this all gracefully. Despite Domain Driven Development and more. But who knows.

0

For future people:

The issue here is about data integrity, maintainability, and efficiency.

With free-text status values, you can accidentally insert "Dead", "DEAD", "deceased", etc. - leading to inconsistent data. Storing some variable length repeating text thousands of times typically wastes space compared to storing a foreign key UUID. If you need to rename a status or add metadata (like status descriptions, colors for UI, etc.), you'd need to update potentially millions of records vs 1 or 10. And joins on UUIDs are faster than string comparisons.

Duplicate tables... creating separate tables like alive_people and dead_people would violate several design principles. For example... DRY (Don't Repeat Yourself): You're duplicating the entire schema. This will create a maintenance nightmare, schema changes need to be applied to multiple tables. It also makes queries complex; getting all people requires UNION operations. It also violates the principle of having one source of truth: Person data is scattered across multiple tables.

Having a text "status" field in a table goes against fundamental database design principles about schema stability and logical data organization.

This is "What I Would Do" (oh, I can make bullet points! I forgot about that!)

  • Create views like alive_people and dead_people that filter the main table (someone already said this! It's worth repeating!)
  • Use repository patterns or ORM scopes
  • When adding status columns, audit existing queries an update them to use the view when possible (and index the view! you can do that now!)
  • Design your application to handle missing status filters gracefully OR make the foreign key field not null.

The foreign key approach to a status lookup table remains the best practice for data integrity and maintainability, even though it requires some upfront planning for query updates.

If you have multiple tables that need status, you can make the status tables parent-table specific, because status for a one entity isn't actually the same thing as status for another type of entity. Examples:

  • Person (records have all the data about a person) - Person_Status (has alive and dead)
  • Project (records for every project) - Project_Status (has active, inactive, completed, et al)
  • Customer - Customer_Status (active, dormant, no longer operating)

The post on why you should use UUID for keys and not ints is for another post... but we all know you're probably just gonna use ints. But this actually is a good opportunity to explain one of the reasons not to use ints- misaligned joins. Accidentally joining Projects to Person_Status will result in results, especially if Project_Status and Person_Status both have a [name] field... but the results don't make sense. This is something a developer or dba could easily miss. Using UUIDs (among other things) ensures that joins either work or they return nothing, which is a strong indicator you perhaps got something wrong. Another reason is the rowguid, if you ever need to scale out across multiple servers, every record must have a rowguid, and if you use a uuid as your PK then all your rows already have one. I could write six more paragraphs on all the reasons to never use int (and especially never auto incrementing ints!) but I'm already off topic so I'll stop. :)

1
  • I'm not sure the OP was suggesting that the status field would be free text rather than a fixed list. Also, there is no reason in principle why short ASCII codes as keys should be less performant than integers as keys. Commented Aug 13 at 17:02

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.