Using bit strings for modeling stable hierarchies in tables, is the faster solution?

Question

Purely relational databases (SQL) seem to suffer to nowadays from the lack of a good solution for search/indexing hierarchies. Here a systematic review of the subject: https://stackoverflow.com/q/4048151

What is surprising is that no "Lineage Column" solution makes use of a bit string column, in particular adopting this column as the primary key. Maybe this article, which called the strategy "Sort Path", suggests something along those lines... But I didn't see anything more explicit.

Illustrating

Fruits can form a stable (static) taxonomy for the owner of a grocery store. Below, each fruit in the grocery store's inventory received a bitString identifier. Apples were labeled with the prefix 0, oranges with the prefix 1. Within the set of apples, green apples receive the prefix 01.

When using SQL bitstring the "ORDER BY" use the lexicographic order. For example ordering all bitstrings, from 1 to 4 bits:

 00       apple "0"
 000      apple "00"
 0000     apple "000"
 01       green apple ""
 010      green apple "0"
 011      green apple "1"
 10       orange "0"
 101      orange "01"

As we can see, the lexicographical order is grouping taxons through intervals ("all apples" set, green apples subset, oranges set).

BitString labels are not numbers, but can be counters using successor function, so is valid as Primary Keys. For example the next green apple will be 0011 and the next orange 110.

In a finite tree, suppose 5 bits, we can define the function MaxCount5bits(x), to check the last child. For example for oranges MaxCount5bits('1') is 11111, for green apples MaxCount5bits('01') is 01111. That is, using standard bitString operators, MaxCount5bits(x) = substring(x || '11111',1,5).

The fruit table can be something like CREATE TABLE fruit (id bitString PRIMARY KEY, info text), so we can do any tree-search operation:

find all oranges: SELECT * FROM fruit WHERE id BETWEEN '1' AND MaxCount5bits('1')
find all green apples: SELECT * FROM fruit WHERE id BETWEEN '01' AND MaxCount5bits('01')
find all apples: SELECT * FROM fruit WHERE id BETWEEN '0' AND MaxCount5bits('0')
find the parent of the apple 01: no search is necessary, it is the prefix 0, that is something as substring(x,1,length(x)-1).
find the parent of the orange 101: 10 or 1 (because it is a two-bit-prefix taxonomy).
... reduced the taxonomic analysis to a simple "check BETWEEN at b-tree" operation, and "check ID prefix" operation.

And, since it is indexed as a primary key, it is fast (the search engine uses a b-tree).

Question

In view of the illustrated, the question is: is it the faster solution?
That is, best performance in any SQL-database that supports bitstrings and its lexicographical order b-tree indexation.

Seems also the the most economical solution: all operations are ID-based, no extra column, and with all integrity (reference and primary key) guaranteed.

PS: perhaps a more useful variant of the model is to adopt the first bits as taxonomic control and the remaining bits as a serial-integer counter. In the illustration, there are 2 taxonomic bits.

2 revs · Accepted Answer · 2025-03-07 23:44:55Z

Let's compare, using PostgreSQL v16 in a Notebook with Intel i7:
LTREE module vs "VARBIT as pure bit strings".

Sources: https://GIST.github.com/ppKrauss/226fd3fb0dca0a6c6d4f875cb93aa2db

Testing that is the same logic

As the same as SQL Server's IsDescendantOf. At GIST all other @Charlieface examples.

CREATE TABLE treetest.fruit (
    id          bigint  NOT NULL PRIMARY KEY,
    binpath     varbit  NOT NULL,
    charpath    ltree   NOT NULL, 
    info        text
);
--find all apples:
SELECT * FROM treetest.fruit 
-- WHERE  charpath <@ '0'::ltree;
WHERE binpath BETWEEN b'0' AND treetest.MaxBits(b'0');

id	binpath	charpath	info
4	00	0.0	apple "0"
8	000	0.0.0	apple "00"
16	0000	0.0.0.0	apple "000"
5	01	0.1	green apple ""
10	010	0.1.0	green apple "0"
11	011	0.1.1	green apple "1"

Benchmark

Comparing performances by psql's \timing command. The function treetest.generate_vbit_series(m) generates all bit strings, from empty to m bits. Using m in {16, 20, 22). This and all other functions in the cited GIST. Example populating a table, see all at GIST:

CREATE TABLE treetest.fruit_vbit (
    id          bigint  NOT NULL PRIMARY KEY,
    binpath     varbit  NOT NULL
);
INSERT INTO treetest.fruit_vbit
 SELECT treetest.vBit_to_hiddenBig(x), x 
 FROM treetest.generate_vbit_series(22) t(x)
; -- 8388606 rows for 22 bits.
  -- 2097150 rows for 20 bits; 131070 rows for 16 bits.

CPU time

Here the main performance test, repeated for each m value (14 to 22).

SELECT COUNT(*) n_t1_r1 FROM treetest.fruit_ltree
WHERE  charpath <@ '1.0'::ltree;

SELECT COUNT(*) n_t1_r2 FROM treetest.fruit_vbit
WHERE binpath BETWEEN b'10' AND treetest.MaxBits(b'10');

------
SELECT COUNT(*) n_t2_r1 FROM treetest.fruit_ltree
WHERE  charpath <@ '1.0.1.1.0.1.0.1.1.0.0.1'::ltree;

SELECT COUNT(*) n_t2_r2 FROM treetest.fruit_vbit
WHERE binpath BETWEEN b'101101011001' AND treetest.MaxBits(b'101101011001');

m bits	test	n result	LTree	vBit
14	1	n_t1=8191	14 ms	9 ms
14	2	n_t2=7	12 ms	8 ms
16	1	n_t1=32767	34 ms	27 ms
16	2	n_t2=31	21 ms	10 ms
22	1	n_t1=2097151	1500 ms	300 ms
22	2	n_t2=2047	900 ms	700 ms

--- INDEXING:
CREATE INDEX fruit_ltree_idx1 ON treetest.fruit_ltree USING GIST (charpath);
CREATE UNIQUE INDEX fruit_vbit_idx1 ON treetest.fruit_vbit (binpath);

m bits	test	n result	LTree	vBit
22	1	n_t1=2097151	1080 ms	260 ms
22	2	n_t2=2047	0.78 ms	0.53 ms

Disc usage

relname	size	size_bytes
fruit_ltree_idx1	3380 MB	3544358912
fruit_vbit_idx1	180 MB	188440576

fruit_ltree	1720 MB	1803550720
fruit_vbit	354 MB	371458048

fruit_vbit_pkey	252 MB	264232960
fruit_ltree_pkey	252 MB	264232960

CONCLUSION: vbit indexed is the best, for performance and disc usage. vbit was 4 times faster in large retrieve (1.5 faster in small). The vbit index use 18 times less disc, and the vbit column consumes 4 times less.

Charlieface · Accepted Answer · 2025-03-04 18:03:36Z

SQL Server already has this with the hierarchyid type, many of whose functions are understood by the optimizer and converted into seeks. Postgres seems to have similar functionality using the ltree module, although I have no experience with it.

CREATE TABLE Fruit (
    id hierarchyid PRIMARY KEY,
    info varchar(100) NOT NULL
);

INSERT Fruit (id, info)
VALUES
 ('/0/0/',  'apple "0"'),
 ('/0/0/0/',   'apple "00"'),
 ('/0/0/0/0/',    'apple "000"'),
 ('/0/1/',  'green apple ""'),
 ('/0/1/0/',   'green apple "0"'),
 ('/0/1/1/',   'green apple "1"'),
 ('/1/0/',  'orange "0"'),
 ('/1/0/1/',   'orange "01"');

--find all oranges:
SELECT *
FROM fruit
WHERE id.IsDescendantOf('/1/') = 1;
  
--find all green apples:
SELECT *
FROM fruit
WHERE id.IsDescendantOf('/0/1/') = 1;
  
--find all apples:
SELECT *
FROM fruit
WHERE id.IsDescendantOf('/0/') = 1;
  
--find the parent of the apple 01:
SELECT id.GetAncestor(1) parent
FROM fruit
WHERE id = '/0/1/';
  
--find the parent of the orange 101:
SELECT id.GetAncestor(1) parent
FROM fruit
WHERE id = '/1/0/1/';

You can see from this fiddle that these are all seeks.

A more correct solution would store the whole tree including the missing root rows. And add a self-referencing foreign key so that an item can only exist if its parent also does.

CREATE TABLE Fruit (
    id hierarchyid PRIMARY KEY,
    info varchar(100) NOT NULL,
    parentId AS (id.GetAncestor(1)) PERSISTED REFERENCES Fruit (id),
    INDEX ix_parentId (parentId, id) INCLUDE (info)
);

INSERT Fruit (id, info)
VALUES
 ('/', 'root'),
 ('/0/', 'all apples'),
 ('/0/0/',  'apple "0"'),
 ('/0/0/0/',   'apple "00"'),
 ('/0/0/0/0/',    'apple "000"'),
 ('/0/1/',  'green apple ""'),
 ('/0/1/0/',   'green apple "0"'),
 ('/0/1/1/',   'green apple "1"'),
 ('/1/',   'all oranges'),
 ('/1/0/',  'orange "0"'),
 ('/1/0/1/',   'orange "01"');

Thanks @Charlieface, good concrete examples (!), and I am reusing them at Wiki answer. You showed what hierarchies article named "Sort Path" (and this answer named "Lineage Column" or "Materialized Path" or "Path Enumeration"). Yes, LTREE has same functionality... But the question was "is it the faster solution?", so we need a benchmark: I do something also in the Wiki answer, and you can edit there. — Peter Krauss
– Peter Krauss, Commented Mar 7 at 23:27

Stack Exchange Network

Using bit strings for modeling stable hierarchies in tables, is the faster solution?

Illustrating

Question

2 Answers 2

Testing that is the same logic

Benchmark

CPU time

Disc usage

Your Answer

Hot Network Questions

Using bit strings for modeling stable hierarchies in tables, is the faster solution?

Illustrating

Question

2 Answers 2

Testing that is the same logic

Benchmark

CPU time

Disc usage

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions