What is the difference between the smallint type and the bool type for storing boolean values?
This question arose in the comments to a question on Geographic Information Systems Stack Exchange.
What is the difference between the smallint type and the bool type for storing boolean values?
This question arose in the comments to a question on Geographic Information Systems Stack Exchange.
Always store boolean data as boolean. Only exotic exception imaginable.
Just to address the storage angle in addition to what you posted as answer:
boolean requires 1 byte on disk, smallint requires 2. But that's not the whole story.
smallint (like other integer types and unlike boolean) also has special needs for alignment padding. It can only start at an even offset from the start of the tuple data. So another byte is consumed every odd time (literally).
In a worst case scenario, when mixing with types that require 8-byte alignment like bigint or timestamp / timestamptz:
SELECT pg_column_size(row("char" 'a', FALSE )) AS char_bool
, pg_column_size(row("char" 'a', int2 '1')) AS char_int2
, pg_column_size(row(text 'abcdef', TRUE , now())) AS text7_bool_ts
, pg_column_size(row(text 'abcdef', int2 '1', now())) AS text7_int2_ts; -- worst case
char_bool | char_int2 | text7_bool_ts | text7_int2_ts
-----------+-----------+---------------+---------------
26 | 28 | 40 | 48
Details:
If you have many boolean NOT NULL values and want to optimize space on disk:
First, the size of a smallint is two bytes which is twice the size of a bool:
SELECT
pg_column_size(1::bool) AS "bool"
, pg_column_size(1::smallint) AS "si";
bool | si
------+-----
1 | 2
Let's create a small table with 1000 rows and check it out:
CREATE TABLE foo AS
SELECT
CASE WHEN x>0.5 THEN 1::smallint ELSE 0::smallint END AS si,
CASE when x>0.5 THEN true ELSE false END AS b
FROM ( SELECT random() AS x FROM generate_series(1,1e3) ) AS t;
CREATE INDEX si ON foo (si);
CREATE INDEX b ON foo (b);
Now we can see the table with 40kB:
test=# \dt+ foo
List of relations
Schema | Name | Type | Size
--------+------+-------+-------
public | foo | table | 40 kB
Both indexes are the same size (40kB). And all self-joins (returning 500050 rows in my case) using a seq scan and index scan (with SET enable_seqscan=off) complete in the same amount of time:
EXPLAIN ANALYZE SELECT * FROM foo JOIN foo AS f2 USING (b);
EXPLAIN ANALYZE SELECT * FROM foo JOIN foo AS f2 USING (si);
So really the benefits are that:
3.WHERE isEmpty rather than WHERE empty = 1(int4, int2, int2, int2) vs an (int4, bool, bool, bool) index and you'll see a difference.
bool int2 bool int2 is the same size as int2 int2 int2 int2 but bool bool int int is smaller than all of them. iirc.
int4-int2-int2-int2 index needs 10 bytes per row (which is larger than 8. int4-bool-bool-bool needs 7 bytes (<= 8). I'm not entirely sure but the packing is in multiples of 8. Or it could be page splits, as you say. The order of columns also plays a role, at least in tables. Not sure about indexes.