I think the important question is if you need to have access to all the data at the same time.
If you only need to access one chunk of data at the same time
If you only need to access one array at a time, then one Pythonic possibility is to use NumPy arrays with data type uint8 and width as needed. When you need to operate on the data, you expand the compressed data by (here 3-octet numbers into uint32):
import numpy as np
# in this example `compressed` is a Nx3 array of octets (`uint8`)
expanded = np.empty((compressed.shape[0], 4))
expanded[:,:3] = compressed
expanded[:, 3] = 0
expanded = expanded.view('uint32').reshape(-1)
Then the operations are performed on expanded which is a 1-d vector of N uint32 values.
After we are done, the data can be saved back:
# recompress
compressed[:] = expanded.view('uint8').reshape(-1,4)[:,:3]
The time taken for each direction is (in my machine with Python) approximately 8 ns per element for the example above. Using Cython may not give much performance advantage here, because almost all time is spent copying the data between buffers somewhere in the dark depths of NumPy.
This is a high one-time cost, but if you are planning to access each element at least once, it is probably less expensive to pay the one-time cost than a similar cost for each operation.
Of course, the same approach can be taken in C:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <sys/resource.h>
#define NUMITEMS 10000000
int main(void)
{
uint32_t *expanded;
uint8_t * cmpressed, *exp_as_octets;
struct rusage ru0, ru1;
uint8_t *ep, *cp, *end;
double time_delta;
// create some compressed data
cmpressed = (uint8_t *)malloc(NUMITEMS * 3);
getrusage(RUSAGE_SELF, &ru0);
// allocate the buffer and copy the data
exp_as_octets = (uint8_t *)malloc(NUMITEMS * 4);
end = exp_as_octets + NUMITEMS * 4;
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
// copy three octets out of four
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = *cp++;
*ep++ = 0;
}
expanded = (uint32_t *)exp_as_octets;
getrusage(RUSAGE_SELF, &ru1);
printf("Uncompress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
getrusage(RUSAGE_SELF, &ru0);
// compress back
ep = exp_as_octets;
cp = cmpressed;
while (ep < end)
{
*cp++ = *ep++;
*cp++ = *ep++;
*cp++ = *ep++;
ep++;
}
getrusage(RUSAGE_SELF, &ru1);
printf("Compress\n");
time_delta = ru1.ru_utime.tv_sec + ru1.ru_utime.tv_usec * 1e-6
- ru0.ru_utime.tv_sec - ru0.ru_utime.tv_usec * 1e-6;
printf("User: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
time_delta = ru1.ru_stime.tv_sec + ru1.ru_stime.tv_usec * 1e-6
- ru0.ru_stime.tv_sec - ru0.ru_stime.tv_usec * 1e-6;
printf("System: %.6lf seconds, %.2lf nanoseconds per element", time_delta, 1e9 * time_delta / NUMITEMS);
}
This reports:
Uncompress
User: 0.022650 seconds, 2.27 nanoseconds per element
System: 0.016171 seconds, 1.62 nanoseconds per element
Compress
User: 0.011698 seconds, 1.17 nanoseconds per element
System: 0.000018 seconds, 0.00 nanoseconds per element
The code was compiled with gcc -Ofast and is probably relatively close to the optimal speed. The system time is spent with the malloc. To my eye this looks pretty fast, as we are doing memory reads at 2-3 GB/s. (Which also means that while making the code multi-threaded would be easy, there may not be much speed benefit.)
If you want to have the best performance, you'll need to code the compression/decompression routines for each data width separately. (I do not promise the C code above is the absolutely fastest on any machine, I did not take a look at the machine code.)
If you need to random access separate values
If you, instead, need to access only one value here and another there, Python will not offer any resonably fast methods, as the array lookup overhead is huge.
In this case I suggest you create the C routines to fetch and put the data back. See technosaurus's answer. There are a lot of tricks, but the alignment problems cannot be avoided.
One useful trick when reading the odd-sized array might be (here reading 3 octets from an octet array compressed into a uint32_t value):
value = (uint32_t *)&compressed[3 * n] & 0x00ffffff;
Then someone else will take care of the possible misalignment, and there'll be one octet of garbage in the end. Unfortunately this cannot be used when writing the values. And - again - this may or may not be faster or slower than any of the other alternatives.
packedor such to enforce speed inefficient, but space efficient code/data. Venture carefully into those dark packed woods though, you may not come out.