0

I'm writing a hash function to create hashes of some given size (e.g. 20 bits).

I have learnt how to write the hashes to files in a binary form (see my related question here), but now I would like to handle these hashes in Python (2.7) using the minimum memory allocation. Right now they are typed as int, so they are allocated 24 bytes each, which is huge for a 20 bits object.

How can I create a custom Python object of arbitrary size (e.g. in my case 3 bytes)?

2
  • 2
    You can't; not at the Python level, at any rate. Commented Apr 27, 2015 at 11:12
  • If you want such low-level control over your memory, Python is the wrong language. It is just not designed for that sort of use-case. Commented Apr 27, 2015 at 11:57

1 Answer 1

3

You could do something like you want by packing the bits for each object into a packed array of bit (or boolean) values. There are a number of existing Python bitarray extension modules available. Implementing a higher level "array of fixed bit width integer values" with one is a relatively straight-forward process.

Here's an example based on one in pypi that's implemented in C for speed. You can also download an unofficial pre-built Windows version of it, created by Christoph Gohlke, from here.

Updated — Now works in Python 2.7 & 3.x.

from __future__ import print_function
# uses https://pypi.python.org/pypi/bitarray
from bitarray import bitarray as BitArray
try:
    from functools import reduce  # Python 3.
except:
    pass


class PackedIntArray(object):
    """ Packed array of unsigned fixed-bit-width integer values. """
    def __init__(self, array_size, item_bit_width, initializer=None):
        self.array_size = array_size
        self.item_bit_width = item_bit_width
        self.bitarray = BitArray(array_size * item_bit_width)
        if initializer is not None:
            try:
                iter(initializer)
            except TypeError:  # not iterable
                self.bitarray.setall(initializer) # set all to bool(initializer)
            else:
                for i in xrange(array_size):
                    self[i] = initializer[i]  # must be same length as array

    def __getitem__(self, index):
        offset = index * self.item_bit_width
        bits = self.bitarray[offset: offset+self.item_bit_width]
        return reduce(lambda x, y: (x << 1) | y, bits, 0)

    def __setitem__(self, index, value):
        bits = BitArray('{:0{}b}'.format(value, self.item_bit_width))
        offset = index * self.item_bit_width
        self.bitarray[offset: offset+self.item_bit_width] = bits

    def __len__(self):
        """ Return the number of items stored in the packed array.. """
        return self.array_size

    def length(self):
        """ Return the number of bits stored in the bitarray.. """
        return self.bitarray.length()

    def __repr__(self):
        return('PackedIntArray({}, {}, ('.format(self.array_size,
                                                    self.item_bit_width) +
               ', '.join((str(self[i]) for i in xrange(self.array_size))) +
               '))')


if __name__ == '__main__':
    from random import randrange

    # hash function configuration
    BW = 8, 8, 4  # bit widths of each integer
    HW = sum(BW)  # total hash bit width

    def myhash(a, b, c):
        return (((((a & (2**BW[0]-1)) << BW[1]) |
                    b & (2**BW[1]-1)) << BW[2]) |
                    c & (2**BW[2]-1))

    hashes = PackedIntArray(3, HW)

    print('hash bit width: {}'.format(HW))
    print('length of hashes array: {:,} bits'.format(hashes.length()))
    print()
    print('populate hashes array:')
    for i in range(len(hashes)):
        hashed = myhash(*(randrange(2**bit_width) for bit_width in BW))
        print('  hashes[{}] <- {:,} (0b{:0{}b})'.format(i, hashed, hashed, HW))
        hashes[i] = hashed
    print()
    print('contents of hashes array:')
    for i in range(len(hashes)):
        print(('  hashes[{}]: {:,} '
                '(0b{:0{}b})'.format(i, hashes[i], hashes[i], HW)))

Sample output:

hash bit width: 20
length of hashes array: 60 bits

populate hashes array:
  hashes[0] <- 297,035 (0b01001000100001001011)
  hashes[1] <- 749,558 (0b10110110111111110110)
  hashes[2] <- 690,468 (0b10101000100100100100)

contents of hashes array:
  hashes[0]: 297,035 (0b01001000100001001011)
  hashes[1]: 749,558 (0b10110110111111110110)
  hashes[2]: 690,468 (0b10101000100100100100)

Note: bitarray.bitarray objects also have methods to write and read their bits to and from files. These could be used to also provide similar functionality to the PackedIntArray class above.

Sign up to request clarification or add additional context in comments.

2 Comments

Great, thank you. One detail: I guess the last argument of the myhash call should be randrange(16).
Technically, yes, it should have been randrange(16), but the third value passed to myhash() is masked to 4-bits, so it didn't cause a noticeable problem. Fixed it anyway and optimized myhash() function, as well.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.