Vectorizing a function of a class with two arrays as inputs in cython

Question

I am struggling with the problem of optimizing my cython code in order to improve its speed as much as possible. One of the challenges that I could not still figure out how it should be done in cython is mapping an array on a function like what is done in numpy.vectorize function.

The simplify version of my problem is

from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
cdef class Test(object):
    cdef public double M, c, z
    cdef public double[::1] ks, zs, pos

    @cython.boundscheck(False)
    @cython.cdivision(True)
    @cython.wraparound(False)
    @cython.nonecheck(False)
    def __cinit__(self, M, c, z, pos, ks, zs=None):


        if path is None:
           raise ValueError("Could not find a path to the file which contains the table of angular diameter distances")

        self.M = M
        self.c = c
        self.z = z
        self.pos = pos

        if zs is None:
           raise ValueError("You must give an array which contains the steps where the redshift probability distribution are computed!")
        self.zs=zs
        self.ks=ks
    @cython.cdivision(True)    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cpdef np.ndarray[double, ndim=1, mode='c'] __kappa(self, np.ndarray[double, ndim=1, mode='c'] x, double ks):
        cdef Py_ssize_t N = x.shape[0]
        cdef np.ndarray[np.int64_t, ndim=1, mode='c'] mask


        cdef np.ndarray[double, ndim=1, mode='c'] out  = np.zeros(N, dtype=np.float64 , order='C')

        mask = np.where(x < 0.999)[0]
        out[mask] = 2*ks/(x[mask]**2 - 1) * \
                (1 - np.log((1 + ((1 - x[mask])/(x[mask] + 1))**0.5)/(1 - ((1 - x[mask])/(x[mask] + 1))**0.5))/(1 - x[mask]**2)**0.5)

        mask = np.where(x > 1.001)[0]
        out[mask] = 2*ks/(x[mask]**2 - 1) * \
                (1 - 2*np.arctan(((x[mask] - 1)/(x[mask] + 1))**0.5)/(x[mask]**2 - 1)**0.5)


        mask = np.where((x >= 0.999) & (x <= 1.001))[0]
        out[mask] = ks*(22./15. - 0.8*x[mask])           

        return out

    @cython.cdivision(True)    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cpdef np.ndarray[double, ndim=1, mode='c'] __gamma(self, np.ndarray[double, ndim=1, mode='c'] x, double ks):
        cdef Py_ssize_t N=len(x)
        cdef np.ndarray[np.int64_t, ndim=1, mode='c'] mask 
        cdef np.ndarray[double, ndim=1, mode='c'] out = np.zeros(N, dtype=np.float64 , order='C')


        mask = np.where(x > 0.01)[0]
        out[mask] = 4*ks*(np.log(x[mask]/2) + 2* \
                x[mask]**(-2) - self.__kappa(x[mask], ks)


        mask = np.where(x <= 0.01)[0]
        out[mask] = 4*ks*(0.25 + 0.125 * x[mask]**2 * (3.25 + 3.0*np.log(x[mask]/2)))

        return out

    cpdef tuple getSh(self, np.ndarray[double, ndim=2, mode='c'] gpos, np.ndarray[double, ndim=2, mode='c'] pdf_z):
        # Convert to numpy arrays for internal usage:
        cdef np.ndarray[double, ndim=1, mode='c'] g, kappa, r, ks, wg
        cdef np.ndarray[double, ndim=1, mode='c'] pos_x, pos_y 
        if not gpos[:,0].flags.c_contiguous:
           pos_x = gpos[:,0].copy(order='C')
        else:
           pos_x = gpos[:,0]
        if not gpos[:,1].flags.c_contiguous:
           pos_y = gpos[:,1].copy(order='C')
        else:
           pos_y = gpos[:,1]
        cdef Py_ssize_t i, mask, N

        r = ((pos_x - self.pos[0])**2 + (pos_y - self.pos[1])**2)**0.5

        ks  = np.ascontiguousarray(self.ks)
        N   = len(ks)
        mask= np.where(np.ascontiguousarray(self.zs)>(self.z+0.1))[0][0]

        wg  = np.zeros(len(r), dtype=np.float64 , order='C')

        for i from N > i >= 0:  
            g = self.__gamma(r, ks[i])

            kappa = self.__kappa(r, ks[i])
            g /= 1 - kappa
            wg+=g*pdf_z[:,mask+i]

        cdef np.ndarray[double, ndim=1, mode='c'] dx, dy, drsq, cos2phi, sin2phi, g1, g2
        dx = pos_x - self.halo_pos[0]
        dy = pos_y - self.halo_pos[1]
        drsq = dx*dx+dy*dy
        drsq[drsq==0.] = 1. # Avoid division by 0
        cos2phi = (dx*dx-dy*dy)/drsq
        sin2phi = 2*dx*dy/drsq
        g1 = -wg*cos2phi
        g2 = -wg*sin2phi

        return g1, g2

I am wondering whether there is a way that I can vectorize getSh method of Test class over ks array and avoid using the loop by using something that makes my code faster?

Saullo G. P. Castro · Accepted Answer · 2015-04-26 19:28:21Z

2

The vectorization of your code would be accomplished if you could pass the whole array ks to the methods self.__gamma() and self.__kappa(), preventing the overhead of function calls for each loop iteration since you would be moving the loop to the inner-most called methods.

There are some other tips that will give you extra performance:

compute the masks only once outside the loop, since they are related to array r
mask = x > 0.01 instead of mask = np.where(x > 0.01)[0] and similars
reuse the out array since it always has length=N

EDIT: After putting the ideas above in practice, I came up with the following solution, where the methods __kappa() and __gamma() are no longer necessary. It is not tested though:

cpdef tuple getSh(self, np.ndarray[double, ndim=2, mode='c'] gpos, np.ndarray[double, ndim=2, mode='c'] pdf_z):
    # Convert to numpy arrays for internal usage:
    cdef np.ndarray[double, ndim=1] r, ks, wg
    cdef np.ndarray[double, ndim=1] pos_x, pos_y
    cdef np.ndarray[double, ndim=2, mode='c'] gamma, kappa, wgtmp

    if not gpos[:,0].flags.c_contiguous:
       pos_x = gpos[:,0].copy(order='C')
    else:
       pos_x = gpos[:,0]
    if not gpos[:,1].flags.c_contiguous:
       pos_y = gpos[:,1].copy(order='C')
    else:
       pos_y = gpos[:,1]
    cdef Py_ssize_t i, mask, N

    r = ((pos_x - self.pos[0])**2 + (pos_y - self.pos[1])**2)**0.5

    m1 = r > 0.01
    m2 = ~m1
    r1 = r[m1]
    r2 = r[m2]

    ma = r < 0.999
    mb = (r >= 0.999) & (r <= 1.001)
    mc = r > 1.001
    ra = r[ma]
    rb = r[mb]
    rc = r[mc]

    ks  = np.ascontiguousarray(self.ks)
    one = np.ones_like(ks)
    N = len(ks)
    P = len(r)

    kappa = np.zeros((P, N), dtype=np.float64 , order='C')
    gamma = np.zeros((P, N), dtype=np.float64 , order='C')
    wgtmp = np.zeros((P, N), dtype=np.float64 , order='C')
    wg = np.zeros((P,), dtype=np.float64)

    kappa[ma] = (2*ks/(ra**2 - 1)[:, None] *
                 one*(1 - np.log((1 + ((1 - ra)/(ra + 1))**0.5)/(1 - ((1 -
                      ra)/(ra + 1))**0.5))/(1 - ra**2)**0.5)[:, None])

    kappa[mb] = ks*(22./15. - 0.8*rb)[:, None]

    kappa[mc] = (2*ks/(rc**2 - 1)[:, None] *
                 one*(1 - 2*np.arctan(((rc - 1)/(rc + 1))**0.5)/(rc**2 -
                     1)**0.5)[:, None])


    gamma[m1 & ma] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[ma])[:, None]
    gamma[m1 & mb] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[mb])[:, None]
    gamma[m1 & mc] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[mc])[:, None]

    gamma[m2] = 4*ks*(0.25 + 0.125 * r2**2 * (3.25 + 3.0*np.log(r2/2)))[:, None]


    init = np.where(np.ascontiguousarray(self.zs)>(self.z+0.1))[0][0]

    wgtmp = gamma/(1-kappa) * pdf_z[:, init:init+N]
    wg = wgtmp.sum(axis=1)

    cdef np.ndarray[double, ndim=1, mode='c'] dx, dy, drsq, cos2phi, sin2phi, g1, g2
    dx = pos_x - self.halo_pos[0]
    dy = pos_y - self.halo_pos[1]
    drsq = dx*dx+dy*dy
    drsq[drsq==0.] = 1. # Avoid division by 0
    cos2phi = (dx*dx-dy*dy)/drsq
    sin2phi = 2*dx*dy/drsq
    g1 = -wg*cos2phi
    g2 = -wg*sin2phi

    return g1, g2

edited Apr 26, 2015 at 19:28

answered Apr 26, 2015 at 18:02

Saullo G. P. Castro

59.4k28 gold badges191 silver badges244 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Dalek Over a year ago

I concur. I am doing something fishy and I don't know how to fix it regarding to attributing ks instance. I am calling it each time my method is called and attribute this instance which is wrong.

Saullo G. P. Castro Over a year ago

@Dalek you are also computing the mask r > 0.01 all the N times without need...

Dalek Over a year ago

Do you think numpy.broadcasting can be used instead of what you have used to combine summing over ks elements in kappa and gamma.

Saullo G. P. Castro Over a year ago

@Dalek I hardly think so... There is some broadcasting going on already when I'm multiplying ks with x[:, None]

Dalek Over a year ago

what about writing kappa for instance like this:

kappa[ma] = ks*(2/(ra**2 - 1) *                    (1 - np.log((1 + ((1 - ra)/(ra + 1))**0.5)/(1 - ((1 -                         ra)/(ra + 1))**0.5))/(1 - ra**2)**0.5))[:,np.newaxis]

?

hpaulj · Accepted Answer · 2015-04-26 16:26:46Z

0

I don't think 'vectorize' applies in cython. In numpy you vectorize by using fast compiled code, operations like +,*, sum. There is a np.vectorize function, but it just wraps your code in a iterator that understands broadcasting and multidimensional arrays. It does not rewrite your function, and it doesn't speed it up.

Cython is used to speed up numpy code that cannot be expressed in the existing compiled vector operations. It gets its speed by compiling those as fast C iterations.

On the surface the i loop in getSh looks fast (c style), but it calls self.__kappa and self.__gamma. Both are loaded with np calls - np.array, np.where, np.log, etc. With those calls you don't get the kind of speedup that you would get with pure c code.

You need to focus on those two methods, expressing them as simple operations on numbers, and explicitly iterating c style as needed.

answered Apr 26, 2015 at 16:26

hpaulj

233k14 gold badges260 silver badges392 bronze badges

4 Comments

Dalek Over a year ago

Actually I was thinking the same and I was using log and sqrt functions from math library of C and loop over the array with if conditions. Then just as a test I substituted the loop with np.where and np.log. The code was taking 2.9 mins to run and with this change it reduces to 15 second.

Dalek Over a year ago

so I am not totally agreeing with you.

hpaulj Over a year ago

Is this cython giving much of an improvement over pure numpy?

Dalek Over a year ago

certainly! The pure python code which contains integration takes much longer!

Collectives™ on Stack Overflow

Vectorizing a function of a class with two arrays as inputs in cython

2 Answers 2

5 Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related