Optimizing a for loop for changing pixels values using lookup table

Question

I tried to parallelize the loop, and I got a good result but still not enough. This post is a follow up to a recent one where I optimized other parts of the code using a lookup table and spacial and temporal relationships. This is not included in the following code for simplification.

The loop in question is in hist function. I want your help if you have any suggestion to optimize the loop and run it faster?

I think it is now important to mention the hardware I'll be using. It will be Ambarella’s CV25. I know there exist some hardware optimizations such as SIMD, but I'm not very familiar with that low level programming but I'm open for any solutions.

Here are more details about the hardware:

#include <opencv2/opencv.hpp>
#include <iostream>
#include <vector>

// Structure to hold cached parameters
struct Cache {
    std::vector<int> data_b;
    std::vector<int> data_g;
    std::vector<int> data_r;
    std::vector<uchar> lut_b;
    std::vector<uchar> lut_g;
    std::vector<uchar> lut_r;
};

// Function to compute simple example data and lookup tables
void compute_data(const cv::Mat& image, Cache& cache)
{
    // Simple example to initialize data
    cache.data_b.assign(256, 1);
    cache.data_g.assign(256, 2);
    cache.data_r.assign(256, 3);

    // Compute lookup tables
    cache.lut_b.resize(256);
    cache.lut_g.resize(256);
    cache.lut_r.resize(256);

    for (int i = 0; i < 256; i++) {
        cache.lut_b[i] = static_cast<uchar>(i);
        cache.lut_g[i] = static_cast<uchar>(i);
        cache.lut_r[i] = static_cast<uchar>(i);
    }
}

void hist(cv::Mat& image, Cache& cache, bool use_cache)
{
    if (!use_cache) {
        compute_data(image, cache);
    }

    // Apply transformation using lookup tables in parallel
    cv::parallel_for_(cv::Range(0, image.rows), [&](const cv::Range& range) {
        for (int i = range.start; i < range.end; ++i)
        {
            cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
            for (int j = 0; j < image.cols; ++j)
            {
                cv::Vec3b& pxi = row[j];
                pxi[0] = cache.lut_b[pxi[0]];
                pxi[1] = cache.lut_g[pxi[1]];
                pxi[2] = cache.lut_r[pxi[2]];
            }
        }
    });
}

int main(int argc, char** argv)
{
    // Open the video file
    cv::VideoCapture cap("../video.mp4");
    if (!cap.isOpened()) {
        std::cerr << "Error opening video file" << std::endl;
        return -1;
    }

    // Get the frame rate of the video
    double fps = cap.get(cv::CAP_PROP_FPS);
    int delay = static_cast<int>(1000 / fps);

    // Create a window to display the video
    cv::namedWindow("Processed Video", cv::WINDOW_NORMAL);

    cv::Mat frame;
    Cache cache;
    int frame_count = 0;
    int recompute_interval = 5; // Recompute every 5 frames

    while (true) {
        cap >> frame;
        if (frame.empty()) {
            break;
        }

        // Determine whether to use the cache or recompute the data
        bool use_cache = (frame_count % recompute_interval != 0);

        // Process the frame using cached or recomputed parameters
        hist(frame, cache, use_cache);

        // Display the processed frame
        cv::imshow("Processed Video", frame);

        // Break the loop if 'q' is pressed
        if (cv::waitKey(delay) == 'q') {
            break;
        }

        frame_count++;
    }

    cap.release();
    cv::destroyAllWindows();

    return 0;
}

What is "faster"? Do you want it to run 10% faster, or 10x as fast? — Cris Luengo
– Cris Luengo, Commented Jul 22, 2024 at 19:41
@CrisLuengo I just want to go as far as I can with the optimization, but x10 is good enough — Ja_cpp
– Ja_cpp, Commented Jul 22, 2024 at 20:08
What parallelization method is your OpenCV built with? It does many different ones, maybe some are better than others? Some might start threads every time you start the parallel loop, instead of starting threads only once at the beginning of the program. Have you tried a different parallel model? — Cris Luengo
– Cris Luengo, Commented Jul 22, 2024 at 20:49
You can use any multi-threading library, or OpenMP, to simplify your work. Thread management is not trivial, if you can leave it to a library instead of starting with the low-level stdlib functionality you'll be better off. The only thing that parallel_for_ does is create threads, split the range into the number of threads, and call your worker function once within each thread. Your task is to move that thread creation to the start of the program. — Cris Luengo
– Cris Luengo, Commented Jul 23, 2024 at 20:52

G. Sliepen · Accepted Answer · 2024-07-24 15:26:13Z

Use `std::array` instead of `std::vector`

Since your lookup tables (LUTs) will have exactly 256 entries, just use std::array<…, 256> instead of std::vector for them. This avoids some pointer indirection and memory allocations.

Remove unused variables

Are the vectors data_* ever used? The code you have posted initializes them but doesn't actually use them for anything else. Just remove them.

Naming things

Why is the struct holding the lookup table named a Cache? I would just name it LUT or LookupTable.

The parameter use_cache is named deceptively. It doesn't tell whether to use the cache or not, since the function hist() will always use cache to transform the image. Instead, it determines whether to (re)calculate the lookup table. So I would rather rename it to recalculate_lut, but even better would be to remove that entirely, and if the caller wants the lookup table to be (re)calculated, it can call compute_data() itself.

compute_data() is also a very generic name. The function as it is now will create a linear lookup table, so perhaps rename it to compute_linear_lut().

Move functionality into the lookup table itself

Your struct Cache just holds data, nothing else. Consider turning into a class LUT which also has functions to initialize the lookup tables and apply them to a pixel:

class LUT {
    std::array<std::uint8_t, 256> r;
    std::array<std::uint8_t, 256> g;
    std::array<std::uint8_t, 256> b;

public:
    LUT() {
        std::iota(r.begin(), r.end(), 0);
        std::iota(g.begin(), g.end(), 0);
        std::iota(b.begin(), b.end(), 0);
    }

    cv::Vec3b operator()(cv::Vec3b input) {
        return {b[input[0]], g[input[1]], r[input[2]]};
    }
};

By overloading operator(), you can apply the LUT like this:

void hist(…, const LUT& lut)
{
    …
    for (int j = 0; j < image.cols; ++j)
    {
        cv::Vec3b& pxi = row[j];
        pxi = lut(pxi);
    }
    …
}

It has some other advantages as well, as I'll show below.

You don't care about rows and columns

Your are iterating over rows and columns, but applying a LUT is just a per-pixel operation that doesn't care about which row or column it is in. You can just iterate over all the elements of a cv::Mat. While you can still parallelize that using cv::parallel_for_, it's also possible to use C++'s own parallelization features. For example, you could write:

void hist(cv::Mat& image, const LUT& lut)
{
    std::transform(std::execution::par, image.begin(), image.end(), lut);
}

This makes use of the parallel form of std::transform(), which like OpenCV will automatically create threads to split the work amongst. While lut is a variable of type LUT, since it has an operator(), it can work like a function, so you can pass it to std::transform() here without having to wrap it into a lambda.

Framerate issues

It could very well be that hist() is not fast enough, depending on whether you compiled your code with optimizations enabled or not, and how large the frames of your movie are. However, regardless of how fast it is, your main() function will never display the processed frames at the right framerate. The problem is that in the while-loop, you read a frame, process it, display it, which will all take some amount of time, and only then will you wait for delay time. So each iteration of the loop will take more than delay time.

You will need to check the actual time (using std::chrono::steady_clock::now()), and then choose a delay value that compensates for the time already spent doing the other processing.

Faster image processing

It is likely that you could get better performance using the appropriate Arm NEON instructions, for example by making use of the TBL instruction. You could use compiler intrinsics instead of having to write assembly, but it will still require a good understanding of the Arm instruction set.

The Ambarella CV25S SoC seems to have dedicated hardware to do image processing, including support for color correction, which very likely is done using lookup tables, similar to your code. If you can find out how to make use of those hardware blocks, you can off-load the CPU. If the input video is in H.264 or H.265 format, then that SoC can also do the decoding of that for you. Maybe OpenCV already makes use of that, but if not, then it will have to do it on the CPU, which might be a bit much for a Cortex-A53.

Thank you for the detailed answer, very informative. I've tried to implement you remarks on my answer. For intrinsics solution, I need to learn that, I've no experience yet. — Ja_cpp
– Ja_cpp, Commented Jul 29, 2024 at 19:22

Cris Luengo · Accepted Answer · 2024-07-30 15:49:49Z

This is not a code review, I just wanted to show a way to create threads only once at the start of the program.

I'm using OpenMP for parallelism here, because it's the system I know best. It is very easy to use, but also doesn't allow for very fancy stuff. OpenMP is implemented by your compiler. You need to enable OpenMP both in the compilation and the linking step. The compiler will ignore the OpenMP pragmas if you don't enable it, making the program single-threaded.

This is the exact same code as in the OP, I didn't bother to change anything except adding the OpenMP pragmas. I also had to move the code from hist() into main(), I don't know if the original logic is possible using OpenCV. I have not even tried to compile the code, things might not work as advertised, but this is more or less what it would look like:

int main( int argc, char** argv ) {
   // Open the video file
   cv::VideoCapture cap( "../video.mp4" );
   if( !cap.isOpened() ) {
      std::cerr << "Error opening video file" << std::endl;
      return -1;
   }

   // Get the frame rate of the video
   double fps = cap.get( cv::CAP_PROP_FPS );
   int delay = static_cast< int >( 1000 / fps );

   // Create a window to display the video
   cv::namedWindow( "Processed Video", cv::WINDOW_NORMAL );

   cv::Mat frame;
   Cache cache;
   int frame_count = 0;
   int recompute_interval = 5; // Recompute every 5 frames

   #pragma omp parallel
   while( true ) {                                 // The parallel section starts here, we've got all threads running now

      #pragma omp master                           // The next code block is run only by the master threads
      {
         cap >> frame;
         if( frame.empty() ) {
            break;
         }

         // Recompute the data every few frames
         if (frame_count % recompute_interval == 0) {
            compute_data(image, cache);            // You'll have to figure out how to do this one in parallel too 
         }
      }
      #pragma omp barrier                          // The other threads wait until the master thread is done with the code above

      // Process the frame using cached or recomputed parameters
      #pragma omp for
      for (int i = 0; i < image.rows; ++i) {       // This loop is run in parallel, OpenMP figures out how to split it among the threads
         cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
         for (int j = 0; j < image.cols; ++j) {    // This loop is not parallelized
            cv::Vec3b& pxi = row[j];
            pxi[0] = cache.lut_b[pxi[0]];
            pxi[1] = cache.lut_g[pxi[1]];
            pxi[2] = cache.lut_r[pxi[2]];
         }
      }

      #pragma omp master                           // Again, only the master thread does this part
      {
         // Display the processed frame
         cv::imshow( "Processed Video", frame );

         // Break the loop if 'q' is pressed
         if( cv::waitKey( delay ) == 'q' ) {
            break;
         }

         frame_count++;
      }
      #pragma omp barrier                          // All threads complete this loop iteration at the same time
   }                                               // This is the end of the parallel section

   cap.release();
   cv::destroyAllWindows();

   return 0;
}

You can use any other multithreading library for this. Other libraries might allow you to write more modular or pretty code. But the idea is always the same: don't create threads anew for every image you process, create threads once at the start of the program, and have them do the work of processing each image in parallel. Creating threads takes a bit of time.

Okay thank you very much, I didn't know that right after "#pragma omp parallel" I get all the threads ready. — Ja_cpp
– Ja_cpp, Commented Jul 30, 2024 at 20:15
@Ja_cpp The one statement or block (enclosed in {}) after #pragma amp parallel is executed in parallel on all threads. So the whole while() block, in this case, is run in parallel. Each thread runs the same code, except where another #pragma omp tells them to do something different. — Cris Luengo
– Cris Luengo, Commented Jul 30, 2024 at 20:43

Ja_cpp · Accepted Answer · 2024-07-29 19:25:50Z

I've get inspired from @CrisLuengo and @G.-Sliepen advice and solutions and I implemented a class to call cv::parallel_for_. The results are 1810 fps vs 1974 fps which is already a good start. The fps were computed using std::chrono by iterating 20 times each frame and averaging over 180 frames of the video. We can see the blue plot is after optimization:

Thank you. Here is a working code:

#include <opencv2/opencv.hpp>
#include <iostream>
#include <vector>
#include <thread>
#include <numeric> // For std::iota
#include <array> //Structure to hold cached parameters

struct LookupTable {
    std::array<uchar, 256> lut_b;
    std::array<uchar, 256> lut_g;
    std::array<uchar, 256> lut_r;
};

// ParallelExecutor class
class ParallelExecutor {
public:
    ParallelExecutor(int numThreads) : numThreads(numThreads) {}

    template<typename Func, typename... Args>
    void parallelFor(int start, int end, Func func, Args&&... args) {
        int rangeSize = end - start;
        int chunkSize = (rangeSize + numThreads - 1) / numThreads;

        auto parallelLambda = [&](const cv::Range& range) {
            int localStart = start + range.start * chunkSize;
            int localEnd = std::min(localStart + chunkSize, end);
            func(cv::Range(localStart, localEnd), std::forward<Args>(args)...);
        };

        cv::parallel_for_(cv::Range(0, numThreads), parallelLambda);
    }

private:
    int numThreads;
};

// Function to compute simple example data and lookup tables
void compute_data(const cv::Mat& image, LookupTable& lut) {
    for (int i = 0; i < 256; i++) {
        lut.lut_b[i] = static_cast<uchar>(i);
        lut.lut_g[i] = static_cast<uchar>(i);
        lut.lut_r[i] = static_cast<uchar>(i);
    }
}

void hist_worker(const cv::Range& range, cv::Mat& image, LookupTable& lut) {
    for (int i = range.start; i < range.end; ++i) {
        cv::Vec3b* row = image.ptr<cv::Vec3b>(i);
        for (int j = 0; j < image.cols; ++j) {
            cv::Vec3b& pxi = row[j];
            pxi[0] = lut.lut_b[pxi[0]];
            pxi[1] = lut.lut_g[pxi[1]];
            pxi[2] = lut.lut_r[pxi[2]];
        }
    }
}

void hist(cv::Mat& image, LookupTable& lut, bool use_cache, ParallelExecutor& executor) {
    if (!use_cache) {
        compute_data(image, lut);
    }

    // Apply transformation using lookup tables in parallel
    executor.parallelFor(0, image.rows, hist_worker, image, lut);
}

int main(int argc, char** argv) {
    // Open the video file
    cv::VideoCapture cap("video.mp4");
    if (!cap.isOpened()) {
        std::cerr << "Error opening video file" << std::endl;
        return -1;
    }

    // Get the frame rate of the video
    double fps = cap.get(cv::CAP_PROP_FPS);
    int delay = static_cast<int>(1000 / fps);

    // Create a window to display the video
    cv::namedWindow("Processed Video", cv::WINDOW_NORMAL);

    cv::Mat frame;
    LookupTable lut;
    int frame_count = 0;
    int recompute_interval = 5; // Recompute every 5 frames

    ParallelExecutor executor(24); // Assuming 24 threads

    while (true) {
        cap >> frame;
        if (frame.empty()) {
            break;
        }

        // Determine whether to use the lut or recompute the data
        bool use_cache = (frame_count % recompute_interval != 0);

        // Process the frame using cached or recomputed parameters
        hist(frame, lut, use_cache, executor);

        // Display the processed frame
        cv::imshow("Processed Video", frame);

        // Break the loop if 'q' is pressed
        if (cv::waitKey(delay) == 'q') {
            break;
        }

        frame_count++;
    }

    cap.release();
    cv::destroyAllWindows();

    return 0;
}

Stack Exchange Network

Optimizing a for loop for changing pixels values using lookup table

3 Answers 3

Use `std::array` instead of `std::vector`

Remove unused variables

Naming things

Move functionality into the lookup table itself

You don't care about rows and columns

Framerate issues

Faster image processing

You must log in to answer this question.

Hot Network Questions

Optimizing a for loop for changing pixels values using lookup table

3 Answers 3

Use std::array instead of std::vector

Remove unused variables

Naming things

Move functionality into the lookup table itself

You don't care about rows and columns

Framerate issues

Faster image processing

You must log in to answer this question.

Related

Hot Network Questions

Use `std::array` instead of `std::vector`