Why is numpy faster than my c/c++ code for summing an array of float?

Question

I was testing the efficiency of my simple shared C library and comparing it with the numpy implmentation.

Library creation: The following function is defined in sum_function.c:

float sum_vector(float* data, int num_row){
    float value = 0.0;
    for (int i = 0; i < num_row; i++){
        value += data[i];
    }
    return value;
}

Library compilation: the shared library sum.so is created by

clang -c sum_function.c
clang -shared -o sum.so sum_function.o

Measurement: a simple numpy array is created and the sum of its elements is calculated using the above function.

from ctypes import *
import numpy as np

N = int(1e7)
data = np.arange(N, dtype=np.float32)

libc = cdll.LoadLibrary("sum.so")
libc.sum_vector.restype = c_float
libc.sum_vector(data.ctypes.data_as(POINTER(c_float)),
                c_int(N))

The above function takes 30 ms. However, if I use numpy.sum, the execution time is only 4 ms.

So my question is: what makes numpy a lot faster than my C implementation? I cannot think about any improvement in terms of algorithm for calculating the sum of a vector.

How did you measure the speed of each implementation? Please show the full code and output, per Minimal, complete, verifiable example guidelines. — Prune
– Prune, Commented Jun 26, 2018 at 0:38
Specialized instructions can be used to sum an array more quickly and accurately. In addition, there are ways of optimizing the loop. Finally, you didn't optimize the compiled code. — DrC
– DrC, Commented Jun 26, 2018 at 0:39
clang -c sum_function.c -- Do those command-line parameters enable optimizations? If not, then your timings are meaningless. — PaulMcKenzie
– PaulMcKenzie, Commented Jun 26, 2018 at 0:47
#include <numeric> float sum_vector(float* data, int num_row) { return std::accumulate(data, data + num_row, 0.0f); } -- That is a one liner that you should measure with the proper optimization settings when compiling your code. — PaulMcKenzie
– PaulMcKenzie, Commented Jun 26, 2018 at 0:50
The numpy library is optimized Fortran code. The clang generated code is likely not optimized, and may even be debug code depending on your settings. — bruceg
– bruceg, Commented Jun 26, 2018 at 1:32

Kaveh Vahedipour · Accepted Answer · 2018-06-26 08:59:44Z

There are many reasons that could be involved depending even on the compiler you are using. Your numpy backend is in many cases C/C++. In other words, you have to appreciate that languages like C++ allow for a lot more efficiency and contact to hardware but also demand a lot of knowledge. C++ less that C, as as long as you use the STL like in @PaulMcKenzie's comment. Those are routines that are optimized for runtime performance.

The next thing is memory allocation. Now, your vector seems large enough that the allocator inside <std::vector> will align the memory on the heap. Memory on the stack can end up unaligned keeping std::accumulate even to be slow. Here's an idea how such allocator could be written to avoid that: https://github.com/kvahed/codeare/blob/master/src/matrix/Allocator.hpp. This is part of an MRI image reconstruction library I wrote as a PhD student.

A word on SIMD: Same library other aspect. https://github.com/kvahed/codeare/blob/master/src/matrix/SIMDTraits.hpp How to do state of the art arithmetic is anything but trivial.

Both above concepts culminate into https://github.com/kvahed/codeare/blob/master/src/matrix/Matrix.hpp, where you easily outperform any standardized code on a specific machine.

And last but not least: The compiler and the compiler flags. Your runtime code should once debugged probably be compiled -O2 -g or even -O3. If you have good test coverage you might even be able to get away with -Ofast which ditches ieee math precision. Apart of numerical integration I have never witnessed issues.

max9111 · Accepted Answer · 2018-06-27 10:42:26Z

You need to enable optimizations

In addition to that you have to check if the compiler is able to use autovectorization. If you want distribute a compiled binary, you may want to add multiple codepaths (AVX2,SS2) to get a runable and performant version on all platforms.

A small overview of different implementations and their performance. If you can't beat the numpy sum implementation (binary version installed via pip) on an recent processor you have done something wrong, but also keep the varying implementation and compiler (fastmath) dependent precision in mind. I was too lazy to install clang but used Numba, which has also a LLVM backend (same as clang has).

import numba as nb
import numpy as np
import time

#prints information about SIMD vectorization
import llvmlite.binding as llvm
llvm.set_option('', '--debug-only=loop-vectorize')


@nb.njit(fastmath=True) #eq. O3, march-native,fastmath
def sum_nb(ar):
  s1=0. #double

  for i in range(ar.shape[0]):
    s1+=ar[i+0]

  return s1

N = int(1e7)
ar = np.random.rand(N).astype(np.float32)

#Numba solution float32 with float64 accumulator
#don't measure compilation time
sum_1=sum_nb(ar)
t1=time.time()
for i in range(1000):
  sum_1=sum_nb(ar)

print(time.time()-t1)

#Numba solution float64 with float64 accumulator
#don't measure compilation time
arr_64=ar.astype(np.float64)
sum_2=sum_nb(arr_64)
t1=time.time()
for i in range(1000):
  sum_2=sum_nb(arr_64)

print(time.time()-t1)

#Numpy solution (float32)
t1=time.time()
for i in range(1000):
  sum_3=np.sum(ar)

print(time.time()-t1)

#Numpy solution (float32, with float64 accumulator)
t1=time.time()
for i in range(1000):
  sum_4=np.sum(ar,dtype=np.float64)

print(time.time()-t1)

#Numpy solution (float64)
t1=time.time()
for i in range(1000):
  sum_5=np.sum(arr_64)

print(time.time()-t1)


print(sum_1)
print(sum_2)
print(sum_3)
print(sum_4)
print(sum_5)

Performance

#Numba solution float32 with float64 accumulator: 2.29ms
#Numba solution float64 with float64 accumulator: 4.76ms
#Numpy solution (float32): 5.72ms
#Numpy solution (float32) with float64 accumulator:: 7.97ms
#Numpy solution (float64):: 10.61ms

Collectives™ on Stack Overflow

Why is numpy faster than my c/c++ code for summing an array of float?

2 Answers 2

Comments

You need to enable optimizations

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

You need to enable optimizations

Comments

Your Answer

Sign up or log in

Post as a guest

Related