Return to Question

added 231 characters in body

Source Link

edited Aug 21, 2021 at 19:44

Neil

UPDATE: Answer to questions from Reinderein: The images are photographs of documents. They will eventually be OCR'd. I'm not sure how a lossy compression would affect the OCR quality. DPI is 320. Size on disk ~ 800kb each.

Source Link

asked Aug 21, 2021 at 12:33

Neil

Python loading image into memory (numpy arrays) from database bytes field fast

I am looking for feedback on the function below to load a png image stored as bytes from a MongoDB into numpy arrays.

from PIL import Image
import numpy as np


def bytes_to_matricies(image_bytes):
    """image bytes into Pillow image object
    image_bytes: image bytes accessed from Mongodb
    """

    raw_image = Image.open(io.BytesIO(image_bytes))
    greyscale_matrix = np.array(raw_image.convert("L"))
    color_matrix = np.array(raw_image.convert("RGB"))

    n = greyscale_matrix.shape[0]
    m = greyscale_matrix.shape[1]
    return greyscale_matrix, color_matrix, n, m

I have profiled my code with cProfile and found this function to be a big bottleneck. Any way to optimise it would be great. Note, I have compiled most of the project with Cython, which is why you'll see .pyx files. This hasn't affected much.

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       72  331.537    4.605  338.226    4.698 cleaner.pyx:154(clean_image)
        1  139.401  139.401  139.401  139.401 {built-in method builtins.input}
      356   31.144    0.087   31.144    0.087 {method 'recv_into' of '_socket.socket' objects}
    11253   15.421    0.001   15.421    0.001 {method 'encode' of 'ImagingEncoder' objects}
      706   10.561    0.015   10.561    0.015 {method 'decode' of 'ImagingDecoder' objects}
       72    5.044    0.070    5.047    0.070 {built-in method scipy.ndimage._ni_label._label}
     7853    0.881    0.000    0.881    0.000 cleaner.pyx:216(is_period)
       72    0.844    0.012    1.266    0.018 cleaner.pyx:349(get_binarized_matrix)
       72    0.802    0.011    0.802    0.011 {method 'convert' of 'ImagingCore' objects}
       72    0.786    0.011   13.167    0.183 cleaner.pyx:57(bytes_to_matricies)

If you are wondering how the images are encoded before being written into the MongoDB here is that code:

def get_encoded_image(filename: str):
    """Binary encodes image.
    """
    image = filesystem_io.read_as_pillow(filename) # Just reads file on disk into PILLOW Image object
    stream = io.BytesIO()
    image.save(stream, format='PNG')

    encoded_string = stream.getvalue()
    return encoded_string  # This will be written to MongoDB

Things I have tried:

As mentioned above I tried compiling with Cython
I have tried to use the lycon library but could not see how to load from bytes.
I have tried using Pillow SIMD. It made things slower.
I am able to use multiprocessing. But I want to optimise the function before I parallalize it.

Thank you!