Steve-o's answer is good for optimizing your code efficiency. I recommend adding some logic to monitor execution times to help you identify where to spend efforts optimizing.
OpenCV logic for time monitoring (python):
startTime = cv.getTickCount()
# your code execution
time = (cv.getTickCount() - startTime)/ cv.getTickFrequency()
Boost logic for time monitoring:
boost::posix_time::ptime start = boost::posix_time::microsec_clock::local_time();
// do something time-consuming
boost::posix_time::ptime end = boost::posix_time::microsec_clock::local_time();
boost::posix_time::time_duration timeTaken = end - start;
std::cout << timeTaken << std::endl;
How you configure your OpenCV build matters a lot in my experience. IPP isn't the only option to give you better performance. It really is worth kicking the tires on your build to get better hardware utilization.
The other areas to look at are CPU and memory utilization. If you watch your CPU and/or memory utilization, you'll probably find that 10% of your code is working hard and the rest of the time things are largely idle.
- Consider restructuring your logic as a pipeline using threads so that you can process multiple images at once (if you're tracking and need the results of previous images, you need to break up your code into multiple segments such as preprocessing/analysis and use a std::queue to buffer between them, and imshow won't work from worker threads so you'll need to push result images into a queue and imshow from the main thread)
- Consider using persistent/global objects for things like kernels/detectors that don't need to get recreated each time
- Is your throughput slowing down the longer your program runs? You may need to look at how you are handling disposing of images/variables within the main loop's scope
- Segmenting your code in functions makes it more readable, easier to benchmark, and descopes variables earlier (temporary Mat and results variables free up memory when descoped)
- If you're doing low-level processing on Mat pixels where you iterate over a large portion of the image, use a single parallel for and avoid writing
- Depending on how you are running your code, you may be able to disable debugging to get better performance
- If you're streaming and dumping frames, prefer changing the camera settings to throttle the streaming rate instead of dumping frames
- If you're converting from1 12 to 8 bits or only using a region of your image, prefer doing this at the camera hardware level
Here's an example of a parallel for loop:
cv::parallel_for_(cv::Range(0, img.rows * img.cols), [&](const cv::Range& range)
{
for (int r = range.start; r < range.end; r++)
{
int x = r / img.rows;
int y = r % img.rows;
uchar pixelVal = img.at<uchar>(y, x);
//do work here
}
});
If you're hardware constrained (ie fully utilizing CPU and/or memory), then you need to look at priotizing your process/OS perfomance optimizations/freeing system resources/upgrading your hardware
- Increase the priority of the process to be more greedy with respect to other programs running on the computer (in linux you have nice(int inc) in unistd.h, in windows SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS) in windows.h)
- Optimize your power settings for maximum performance in general
- Disable CPU core parking
- Optimize your acquisition hardware settings (increase rx/tx buffers, etc) to offload work from your CPU