2

Assume we have columns in relational db (table) for measured characteristics: A,B,C,D,...,Z. Each of them has 3 columns: name,value,error. Samples are rows, one sample has one or zero measurement for each characteristic. Data for A and B columns are 90% filled, but C,D,...,Z are very rare - sparse columns (approximately 10% of cells contain not null values in each).

What is the best way to store these data in PostgreSQL with JSON?

My variants (new table has 2 columns: serial ID and JSON)

  1. Store JSON array of sample in one cell (one original row matches one new row).

  2. Break JSON array of sample into several rows (one array element in one row; thus, one original row matches a few new rows).

  3. Use rational columns :)

Example: 2 original rows provide these JSON strings:

----------row 1----------
[
    {
        "name" : "A",
        "value" : 3.300000,
        "err" : 1.200000,
    },
    {
        "name" : "B",
        "value" : 730.000000,
        "err" : 112.000000,
    },
    {
        "name" : "E",
        "value" : 22.600000,
        "err" : 4.700000,
    },
    {
        "name" : "H",
        "value" : 58.300000,
        "err" : 11.100000,
    }
]
----------row 2----------
[
    {
        "name" : "A",
        "value" : 2.100000,
        "err" : 1.400000,
    },
    {
        "name" : "J",
        "value" : 266.000000,
        "err" : 65.000000,
    },
    {
        "name" : "K",
        "value" : 14.700000,
        "err" : 3.800000,
    }
 ]

Which one should I use?

And how to import this dataset to PostgreSQL if I have text file with records (JSON array for each row of original table) as mentioned in example?

2

2 Answers 2

-1

Can you be clear about what problem you are trying to solve with this approach?

A null value in a column only uses 1 bit of space.

https://www.postgresql.org/docs/current/storage-page-layout.html

It seems unlikely that your JSON (or JSONB) column is going to use less space, and it is going to cost you in terms of code clarity and ability to use conventional query techniques.

I would go for a regular table design, and use the relational database as a relational database, unless doing otherwise solves an identifiable problem.

Sign up to request clarification or add additional context in comments.

2 Comments

> A null value in a column only uses 1 bit of space. This depends on how many columns you have.
Sure, it can be a bit more, but it's not like one byte per column of anything.
-1

Import:

Import your data as plain text first using COPY, in case it's not in valid JSON format (which is the case here). You can use pattern matching for some basic cleaning, then parse it with a simple type cast. If the file is not easily reachable from the db, you can look into psql \copy as well as its PGAdmin wrapper

create table public.measurement_samples_raw (sample_row text);

/* /home/username/measurement_samples.json:
[{"name":"A","value":3.300000,"err":1.200000,},{"name":"B","value":730.000000,"err":112.000000,},{"name":"E","value":22.600000,"err":4.700000,},{"name":"H","value":58.300000,"err":11.100000,}]
[{"name":"A","value":2.100000,"err":1.400000,},{"name":"J","value":266.000000,"err":65.000000,},{"name":"K","value":14.700000,"err":3.800000,}] */
copy public.measurement_samples_raw (sample_row) 
    from '/home/your_username/measurement_samples.json';

update public.measurement_samples_raw set 
    sample_row = regexp_replace(
                    '{"top_key":'||sample_row||'}', --unnamed lists aren't supported, so adding a key and wrapping in {...}
                    ',\s*}', '}',--pattern and replacement to remove trailing commas
                    'g' --forces the function to replace all instances of the pattern
                    );

alter table public.measurement_samples_raw 
    add column sample_id serial,--so that each measurement sample has an ID
    add column sample_row_jsonb jsonb;
    
update public.measurement_samples_raw 
    set sample_row_jsonb=sample_row::jsonb;

Since Postgres 16, you can also use pg_input_is_valid() to filter incoming values based on whether they match a given type's accepted input format. For versions 15 and earlier, it's easy to backport it.

Since Postgres 17, COPY also offers ON_ERROR and REJECT_LIMIT options:

ON_ERROR
Specifies how to behave when encountering an error converting a column's input value into its data type. An error_action value of stop means fail the command, while ignore means discard the input row and continue with the next one. The default is stop.

The ignore option is applicable only for COPY FROM when the FORMAT is text or csv.

A NOTICE message containing the ignored row count is emitted at the end of the COPY FROM if at least one row was discarded. When LOG_VERBOSITY option is set to verbose, a NOTICE message containing the line of the input file and the column name whose input conversion has failed is emitted for each discarded row. When it is set to silent, no message is emitted regarding ignored rows.

REJECT_LIMIT
Specifies the maximum number of errors tolerated while converting a column's input value to its data type, when ON_ERROR is set to ignore. If the input causes more errors than the specified value, the COPY command fails, even with ON_ERROR set to ignore. This clause must be used with ON_ERROR=ignore and maxerror must be positive bigint. If not specified, ON_ERROR=ignore allows an unlimited number of errors, meaning COPY will skip all erroneous data.


Storage:

You can normalize the structure and populate it by extraction using json type functions and operators

drop table if exists public.measurement_samples;
create table public.measurement_samples (
    id    serial,
    name  char(1),
    value numeric,
    err   numeric,
    constraint measurement_samples_pk primary key (id, name) --I assume you don't want >=2 values for characteristic 'A' in a single measurement row
);

insert into public.measurement_samples (id, name, value, err)
select  sample_id,
        (single_measurement_in_sample->>'name')::text,
        (single_measurement_in_sample->>'value')::numeric,
        (single_measurement_in_sample->>'err')::numeric
from (        
    select  sample_id,
            json_array_elements(
                sample_row::json->'top_key'
            ) as single_measurement_in_sample
    from public.measurement_samples_raw
    ) raw_input;

Which gives you a gapless structure:

table public.measurement_samples;
id name value err
1 A 3.300000 1.200000
1 B 730.000000 112.000000
1 E 22.600000 4.700000
1 H 58.300000 11.100000
2 A 2.100000 1.400000
2 J 266.000000 65.000000
2 K 14.700000 3.800000

Which you can rearrange to your needs, without wasting space, at the cost of performance - unless you make the view materialized:

create view public.v_measurement_samples as
select  id, 
        sum(value) filter (where name='A') as "A_value",
        sum(err) filter (where name='A') as "A_err",
        sum(value) filter (where name='B') as "B_value",
        sum(err) filter (where name='B') as "B_err",
        sum(value) filter (where name='C') as "C_value",
        sum(err) filter (where name='C') as "C_err",
        sum(value) filter (where name='D') as "D_value",
        sum(err) filter (where name='D') as "D_err",
        sum(value) filter (where name='E') as "E_value",
        sum(err) filter (where name='E') as "E_err"
from    public.measurement_samples
group by id
order by id;

table public.v_measurement_samples;
id A_value A_err B_value B_err C_value C_err D_value D_err E_value E_err
1 3.300000 1.200000 730.000000 112.000000 22.600000 4.700000
2 2.100000 1.400000

Working dbfiddle example.

2 Comments

The normalisation to treat the name as a value is fine (it's close to an entity-attribute-value EAV model), unless you want to efficiently run queries comparing or selecting values within a single serial id, for which the materialised view is likely to be essential. If the MV proves to be important, why not just save the data like that in first place? So it feels like either (i) use the EAV model, or (ii) save it as a regular relational table.
I agree. The initial structure is supposed to address the problem of sparse input data, while being an easy target to import into, pretty much directly reflecting the presented input layout. The end of my answer points away from that initial structure without assuming too much about how the data will be processed further - it's not specified by the question but would logically be the next problem to look into.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.