7

I'm trying to calculate hamming distance for pairs of long integers (20 digits each) in a Django app using the pg_similarity extension for Postgres, and am having a hard time figuring out how to do this. Django does not seem to have a current BitString field (which would be ideal, but django_postgres seems to be defunct), so I was trying to just cast the integers into bitstrings in the SQL query itself. My current query:

    sql = ''' SELECT id, hamming(
        "HashString"::BIT(255),
        %s::BIT(255)
    ) as hamming_distance
    FROM images
    HAVING hamming_distance < %s
    ORDER BY hamming_distance;'''

is throwing a DB error: cannot cast type numeric to bit. What am I doing wrong? What else could I try?

4
  • try to convert your integer to BitString with this package: pypi.python.org/pypi/bitstring/3.1.3 Commented Oct 9, 2014 at 20:14
  • I tried that, but the Python BitString appears to cut off at 66 digits. The problem seems to be that Postgres bit fields can't be longer than some maximum (since the code above works fine with smaller integers). Is there a good way around that? Commented Oct 9, 2014 at 20:35
  • In that case I would do it with python. Get the images, iterate through them and check the hamming distance. Commented Oct 9, 2014 at 21:00
  • Because your 20-digit integers are out of range for a 64-bit integer (int8 or bigint) value, you're using numeric, but there's no cast from numeric to bit. This will be "interesting" as there's also no bitshift support for numerics, as they're arbitrary precision decimal floating point. Commented Oct 10, 2014 at 2:00

4 Answers 4

13

Per the manual, casting is the correct approach if your "long integer" is actually a "long integer" i.e. bigint / int8:

regress=> SELECT ('1324'::bigint)::bit(64);
                               bit                                
------------------------------------------------------------------
 0000000000000000000000000000000000000000000000000000010100101100
(1 row)

but (edit) you're actually asking how to cast an integer-only numeric to bit. Not so simple, hold on.

You can't bitshift numeric either, so you can't easily bitshift it into 64-bit chunks, convert, and reassemble.

You'll have to use division and modulus instead.

Given:

SELECT '1792913810350008736973055638379610855835'::numeric(40,0);

you can get it in 'bigint' chunks that, when multiplied by max-long (9223372036854775807) times their place value produce the original value.

e.g. this gets the lowest 64-bits:

SELECT ('1792913810350008736973055638379610855835'::numeric(40,0) / '9223372036854775807'::numeric(256,0)) % '9223372036854775807'::numeric(40,0);

and this gets all the chunks for a given value of up to 256 digits and their exponents

WITH numval(v) AS (VALUES ('1792913810350008736973055638379610855835'::numeric(40,0)))
SELECT exponent, floor(v / ('9223372036854775807'::numeric(256,0) ^ exponent) % '9223372036854775807'::numeric(40,0)) from numval, generate_series(1,3) exponent;

You can reassemble this into the original value:

WITH
  numval(v) AS (
    VALUES ('1792913810350008736973055638379610855835'::numeric(40,0))
  ),
  chunks (exponent, chunk) AS (
     SELECT exponent, floor(v / ('9223372036854775807'::numeric(40,0) ^ exponent) % '9223372036854775807'::numeric(40,0))::bigint from numval, generate_series(1,3) exponent
  )
SELECT floor(sum(chunk::numeric(40,0) * ('9223372036854775807'::numeric(40,0) ^ exponent))) FROM chunks;

so we know it's decomposed correctly.

Now we're working with a series of 64-bit integers, we can convert each into a bitfield. Because we're using signed integers, each only has 63 significant bits, so:

WITH
  numval(v) AS (
    VALUES ('1792913810350008736973055638379610855835'::numeric(40,0))
  ),
  chunks (exponent, chunk) AS (
     SELECT exponent, floor(v / ('9223372036854775807'::numeric(40,0) ^ exponent) % '9223372036854775807'::numeric(40,0))::bigint from numval, generate_series(1,3) exponent
  )
SELECT
  exponent,
  chunk::bit(63)
FROM chunks;

gives us the bit values for each 63-bit chunk. We can then reassemble them. There's no bitfield concatenation operator, but we can shift and bit_or, then wrap it into an SQL function, producing the monstrosity:

CREATE OR REPLACE FUNCTION numericint40_to_bit189(numeric(40,0)) RETURNS bit(189)
LANGUAGE sql
AS
$$
    WITH
      chunks (exponent, chunk) AS (
         SELECT exponent, floor($1 / ('9223372036854775807'::numeric(40,0) ^ exponent) % '9223372036854775807'::numeric(40,0))::bigint 
         FROM generate_series(1,3) exponent
      )
    SELECT
      bit_or(chunk::bit(189) << (63*(exponent-1)))
    FROM chunks;
$$;

which can be seen in use here:

regress=> SELECT numericint40_to_bit189('1792913810350008736973055638379610855835');
                                                                                    numericint40_to_bit189                                                                                     
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010101000100110101101010001110110110101001111100011100011110000010110
(1 row)
Sign up to request clarification or add additional context in comments.

Comments

6

Thanks for the initial answer Craig Ringer! Here is a correct version of the function. It supports up to 300 bits and can be expanded if needed.

CREATE OR REPLACE FUNCTION numeric_to_bit(NUMERIC)
  RETURNS BIT VARYING AS $$
DECLARE
  num ALIAS FOR $1;
  -- 1 + largest positive BIGINT --
  max_bigint NUMERIC := '9223372036854775808' :: NUMERIC(19, 0);
  result BIT VARYING;
BEGIN
  WITH
      chunks (exponent, chunk) AS (
        SELECT
          exponent,
          floor((num / (max_bigint ^ exponent) :: NUMERIC(300, 20)) % max_bigint) :: BIGINT
        FROM generate_series(0, 5) exponent
    )
  SELECT bit_or(chunk :: BIT(300) :: BIT VARYING << (63 * (exponent))) :: BIT VARYING
  FROM chunks INTO result;
  RETURN result;
END;
$$ LANGUAGE plpgsql;

Comments

0

Here's something that will help convert arbitrary length numerics to varbits in PostgreSQL.

-- function to convert numerics to bit(n)
create or replace function numeric2bit(n numeric, typemod integer) returns varbit as
$$
declare
    r varbit := b'';
    width int := 8;
    modulo numeric := 2 ^ width;
    remainder bigint := 0;
begin
    while typemod > 8 loop
        remainder := n % modulo;
        -- make sure to adjust the size of the bit(n) type to match the width var
        r := remainder::bit(8) || r;
        n := (n - remainder) / modulo;
        typemod := typemod - 8;
    end loop;

    if typemod > 0 then
        while typemod != 0 loop
            remainder := n % 2;
            -- make sure to adjust the size of the bit(n) type to match the width var
            r := remainder::bit || r;
            n := (n - remainder) / 2;
            typemod := typemod - 1;
        end loop;
    end if;

    return r;
end
$$ language plpgsql;

-- function to convert numerics to varbit
create or replace function numeric2varbit(n numeric) returns varbit as
$$
begin
    return numeric2bit(n, ceil(log(2, n))::integer);
end
$$ language plpgsql;

-- casts for numeric::bit(n)
create cast (numeric as bit) with function numeric2bit(numeric, integer);

-- casts for numeric::varbit
create cast (numeric as varbit) with function numeric2varbit;

do
$$
begin
    assert 2482139123891291248192194912049102::numeric::varbit =
           b'111101001100001000000000001000011110001001101100011101100001101111111111100000001101110010011011101011111001110';
    assert 255::numeric::bit(16) = b'0000000011111111';
end
$$;

Comments

-1

Try it with python:

sql = sorted(([hamming("HashString", %s,image.id) for image in Image.objects.all() if hamming("HashString",%s) < %s])

4 Comments

Isn't that going to be awfully expensive?
Surely it is O(n) complexity. But, why to calculate hamming distance all over again for every request? In this case I would use database storage, or just a table with already calculated hamming distance. Adding new image can trigger calculating hamming distance and storing the result into db.
Pulling all the data out of the database just to compute a Hamming distance is what will probably be expensive. And a Hamming distance is the distance between two values so precomputing and storing all the possible distances isn't going to be realistic. Also, your syntax seems to be mixing Python and SQL in a strange way; I'm not a Python guy but it looks wrong to me.
Is this Django?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.