Represent sparse RDBMS columns in PostgreSQL with JSON [closed]

Question

Closed. This question is opinion-based. It is not currently accepting answers.

Want to improve this question? Because this question may lead to opinionated discussion, debate, and answers, it has been closed. You may edit the question if you feel you can improve it so that it requires answers that include facts and citations or a detailed explanation of the proposed solution. If edited, the question will be reviewed and might be reopened.

Closed last month.

Improve this question

Assume we have columns in relational db (table) for measured characteristics: A,B,C,D,...,Z. Each of them has 3 columns: name,value,error. Samples are rows, one sample has one or zero measurement for each characteristic. Data for A and B columns are 90% filled, but C,D,...,Z are very rare - sparse columns (approximately 10% of cells contain not null values in each).

What is the best way to store these data in PostgreSQL with JSON?

My variants (new table has 2 columns: serial ID and JSON)

Store JSON array of sample in one cell (one original row matches one new row).
Break JSON array of sample into several rows (one array element in one row; thus, one original row matches a few new rows).
Use rational columns :)

Example: 2 original rows provide these JSON strings:

----------row 1----------
[
    {
        "name" : "A",
        "value" : 3.300000,
        "err" : 1.200000,
    },
    {
        "name" : "B",
        "value" : 730.000000,
        "err" : 112.000000,
    },
    {
        "name" : "E",
        "value" : 22.600000,
        "err" : 4.700000,
    },
    {
        "name" : "H",
        "value" : 58.300000,
        "err" : 11.100000,
    }
]
----------row 2----------
[
    {
        "name" : "A",
        "value" : 2.100000,
        "err" : 1.400000,
    },
    {
        "name" : "J",
        "value" : 266.000000,
        "err" : 65.000000,
    },
    {
        "name" : "K",
        "value" : 14.700000,
        "err" : 3.800000,
    }
 ]

Which one should I use?

And how to import this dataset to PostgreSQL if I have text file with records (JSON array for each row of original table) as mentioned in example?

Ask exactly 1 (specific researched non-duplicate) question. tour How to Ask Help center PS "best" means nothing in particular. Strategy for “Which is better” questions — philipxy
– philipxy, Commented Oct 12 at 20:50

David Aldridge · Accepted Answer · 2022-03-13 17:00:41Z

-1

Can you be clear about what problem you are trying to solve with this approach?

A null value in a column only uses 1 bit of space.

https://www.postgresql.org/docs/current/storage-page-layout.html

It seems unlikely that your JSON (or JSONB) column is going to use less space, and it is going to cost you in terms of code clarity and ability to use conventional query techniques.

I would go for a regular table design, and use the relational database as a relational database, unless doing otherwise solves an identifiable problem.

edited Mar 13, 2022 at 17:00

answered Mar 9, 2022 at 23:32

David Aldridge

52.5k8 gold badges73 silver badges99 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Ben Longo Over a year ago

> A null value in a column only uses 1 bit of space. This depends on how many columns you have.

David Aldridge Over a year ago

Sure, it can be a bit more, but it's not like one byte per column of anything.

Zegarek · Accepted Answer · 2025-10-12 13:45:01Z

Import:

Import your data as plain text first using COPY, in case it's not in valid JSON format (which is the case here). You can use pattern matching for some basic cleaning, then parse it with a simple type cast. If the file is not easily reachable from the db, you can look into psql \copy as well as its PGAdmin wrapper

create table public.measurement_samples_raw (sample_row text);

/* /home/username/measurement_samples.json:
[{"name":"A","value":3.300000,"err":1.200000,},{"name":"B","value":730.000000,"err":112.000000,},{"name":"E","value":22.600000,"err":4.700000,},{"name":"H","value":58.300000,"err":11.100000,}]
[{"name":"A","value":2.100000,"err":1.400000,},{"name":"J","value":266.000000,"err":65.000000,},{"name":"K","value":14.700000,"err":3.800000,}] */
copy public.measurement_samples_raw (sample_row) 
    from '/home/your_username/measurement_samples.json';

update public.measurement_samples_raw set 
    sample_row = regexp_replace(
                    '{"top_key":'||sample_row||'}', --unnamed lists aren't supported, so adding a key and wrapping in {...}
                    ',\s*}', '}',--pattern and replacement to remove trailing commas
                    'g' --forces the function to replace all instances of the pattern
                    );

alter table public.measurement_samples_raw 
    add column sample_id serial,--so that each measurement sample has an ID
    add column sample_row_jsonb jsonb;
    
update public.measurement_samples_raw 
    set sample_row_jsonb=sample_row::jsonb;

Since Postgres 16, you can also use pg_input_is_valid() to filter incoming values based on whether they match a given type's accepted input format. For versions 15 and earlier, it's easy to backport it.

Since Postgres 17, COPY also offers ON_ERROR and REJECT_LIMIT options:

ON_ERROR
Specifies how to behave when encountering an error converting a column's input value into its data type. An error_action value of stop means fail the command, while ignore means discard the input row and continue with the next one. The default is stop.

The ignore option is applicable only for COPY FROM when the FORMAT is text or csv.

A NOTICE message containing the ignored row count is emitted at the end of the COPY FROM if at least one row was discarded. When LOG_VERBOSITY option is set to verbose, a NOTICE message containing the line of the input file and the column name whose input conversion has failed is emitted for each discarded row. When it is set to silent, no message is emitted regarding ignored rows.

REJECT_LIMIT
Specifies the maximum number of errors tolerated while converting a column's input value to its data type, when ON_ERROR is set to ignore. If the input causes more errors than the specified value, the COPY command fails, even with ON_ERROR set to ignore. This clause must be used with ON_ERROR=ignore and maxerror must be positive bigint. If not specified, ON_ERROR=ignore allows an unlimited number of errors, meaning COPY will skip all erroneous data.

Storage:

You can normalize the structure and populate it by extraction using json type functions and operators

drop table if exists public.measurement_samples;
create table public.measurement_samples (
    id    serial,
    name  char(1),
    value numeric,
    err   numeric,
    constraint measurement_samples_pk primary key (id, name) --I assume you don't want >=2 values for characteristic 'A' in a single measurement row
);

insert into public.measurement_samples (id, name, value, err)
select  sample_id,
        (single_measurement_in_sample->>'name')::text,
        (single_measurement_in_sample->>'value')::numeric,
        (single_measurement_in_sample->>'err')::numeric
from (        
    select  sample_id,
            json_array_elements(
                sample_row::json->'top_key'
            ) as single_measurement_in_sample
    from public.measurement_samples_raw
    ) raw_input;

Which gives you a gapless structure:

table public.measurement_samples;

id	name	value	err
1	A	3.300000	1.200000
1	B	730.000000	112.000000
1	E	22.600000	4.700000
1	H	58.300000	11.100000
2	A	2.100000	1.400000
2	J	266.000000	65.000000
2	K	14.700000	3.800000

Which you can rearrange to your needs, without wasting space, at the cost of performance - unless you make the view materialized:

create view public.v_measurement_samples as
select  id, 
        sum(value) filter (where name='A') as "A_value",
        sum(err) filter (where name='A') as "A_err",
        sum(value) filter (where name='B') as "B_value",
        sum(err) filter (where name='B') as "B_err",
        sum(value) filter (where name='C') as "C_value",
        sum(err) filter (where name='C') as "C_err",
        sum(value) filter (where name='D') as "D_value",
        sum(err) filter (where name='D') as "D_err",
        sum(value) filter (where name='E') as "E_value",
        sum(err) filter (where name='E') as "E_err"
from    public.measurement_samples
group by id
order by id;

table public.v_measurement_samples;

id	A_value	A_err	B_value	B_err	C_value	C_err	D_value	D_err	E_value	E_err
1	3.300000	1.200000	730.000000	112.000000					22.600000	4.700000
2	2.100000	1.400000

Working dbfiddle example.

The normalisation to treat the name as a value is fine (it's close to an entity-attribute-value EAV model), unless you want to efficiently run queries comparing or selecting values within a single serial id, for which the materialised view is likely to be essential. If the MV proves to be important, why not just save the data like that in first place? So it feels like either (i) use the EAV model, or (ii) save it as a regular relational table.
I agree. The initial structure is supposed to address the problem of sparse input data, while being an easy target to import into, pretty much directly reflecting the presented input layout. The end of my answer points away from that initial structure without assuming too much about how the data will be processed further - it's not specified by the question but would logically be the next problem to look into.

Collectives™ on Stack Overflow

Represent sparse RDBMS columns in PostgreSQL with JSON [closed]

2 Answers 2

2 Comments

Import:

Storage:

2 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Import:

Storage:

2 Comments

Linked

Related