2

I am struggling with the problem of optimizing my cython code in order to improve its speed as much as possible. One of the challenges that I could not still figure out how it should be done in cython is mapping an array on a function like what is done in numpy.vectorize function.

The simplify version of my problem is

from __future__ import division
import numpy as np
cimport numpy as np
cimport cython
cdef class Test(object):
    cdef public double M, c, z
    cdef public double[::1] ks, zs, pos

    @cython.boundscheck(False)
    @cython.cdivision(True)
    @cython.wraparound(False)
    @cython.nonecheck(False)
    def __cinit__(self, M, c, z, pos, ks, zs=None):


        if path is None:
           raise ValueError("Could not find a path to the file which contains the table of angular diameter distances")

        self.M = M
        self.c = c
        self.z = z
        self.pos = pos

        if zs is None:
           raise ValueError("You must give an array which contains the steps where the redshift probability distribution are computed!")
        self.zs=zs
        self.ks=ks
    @cython.cdivision(True)    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cpdef np.ndarray[double, ndim=1, mode='c'] __kappa(self, np.ndarray[double, ndim=1, mode='c'] x, double ks):
        cdef Py_ssize_t N = x.shape[0]
        cdef np.ndarray[np.int64_t, ndim=1, mode='c'] mask


        cdef np.ndarray[double, ndim=1, mode='c'] out  = np.zeros(N, dtype=np.float64 , order='C')

        mask = np.where(x < 0.999)[0]
        out[mask] = 2*ks/(x[mask]**2 - 1) * \
                (1 - np.log((1 + ((1 - x[mask])/(x[mask] + 1))**0.5)/(1 - ((1 - x[mask])/(x[mask] + 1))**0.5))/(1 - x[mask]**2)**0.5)

        mask = np.where(x > 1.001)[0]
        out[mask] = 2*ks/(x[mask]**2 - 1) * \
                (1 - 2*np.arctan(((x[mask] - 1)/(x[mask] + 1))**0.5)/(x[mask]**2 - 1)**0.5)


        mask = np.where((x >= 0.999) & (x <= 1.001))[0]
        out[mask] = ks*(22./15. - 0.8*x[mask])           

        return out

    @cython.cdivision(True)    
    @cython.boundscheck(False)
    @cython.wraparound(False)
    cpdef np.ndarray[double, ndim=1, mode='c'] __gamma(self, np.ndarray[double, ndim=1, mode='c'] x, double ks):
        cdef Py_ssize_t N=len(x)
        cdef np.ndarray[np.int64_t, ndim=1, mode='c'] mask 
        cdef np.ndarray[double, ndim=1, mode='c'] out = np.zeros(N, dtype=np.float64 , order='C')


        mask = np.where(x > 0.01)[0]
        out[mask] = 4*ks*(np.log(x[mask]/2) + 2* \
                x[mask]**(-2) - self.__kappa(x[mask], ks)


        mask = np.where(x <= 0.01)[0]
        out[mask] = 4*ks*(0.25 + 0.125 * x[mask]**2 * (3.25 + 3.0*np.log(x[mask]/2)))

        return out

    cpdef tuple getSh(self, np.ndarray[double, ndim=2, mode='c'] gpos, np.ndarray[double, ndim=2, mode='c'] pdf_z):
        # Convert to numpy arrays for internal usage:
        cdef np.ndarray[double, ndim=1, mode='c'] g, kappa, r, ks, wg
        cdef np.ndarray[double, ndim=1, mode='c'] pos_x, pos_y 
        if not gpos[:,0].flags.c_contiguous:
           pos_x = gpos[:,0].copy(order='C')
        else:
           pos_x = gpos[:,0]
        if not gpos[:,1].flags.c_contiguous:
           pos_y = gpos[:,1].copy(order='C')
        else:
           pos_y = gpos[:,1]
        cdef Py_ssize_t i, mask, N

        r = ((pos_x - self.pos[0])**2 + (pos_y - self.pos[1])**2)**0.5

        ks  = np.ascontiguousarray(self.ks)
        N   = len(ks)
        mask= np.where(np.ascontiguousarray(self.zs)>(self.z+0.1))[0][0]

        wg  = np.zeros(len(r), dtype=np.float64 , order='C')

        for i from N > i >= 0:  
            g = self.__gamma(r, ks[i])

            kappa = self.__kappa(r, ks[i])
            g /= 1 - kappa
            wg+=g*pdf_z[:,mask+i]

        cdef np.ndarray[double, ndim=1, mode='c'] dx, dy, drsq, cos2phi, sin2phi, g1, g2
        dx = pos_x - self.halo_pos[0]
        dy = pos_y - self.halo_pos[1]
        drsq = dx*dx+dy*dy
        drsq[drsq==0.] = 1. # Avoid division by 0
        cos2phi = (dx*dx-dy*dy)/drsq
        sin2phi = 2*dx*dy/drsq
        g1 = -wg*cos2phi
        g2 = -wg*sin2phi

        return g1, g2

I am wondering whether there is a way that I can vectorize getSh method of Test class over ks array and avoid using the loop by using something that makes my code faster?

2 Answers 2

2

The vectorization of your code would be accomplished if you could pass the whole array ks to the methods self.__gamma() and self.__kappa(), preventing the overhead of function calls for each loop iteration since you would be moving the loop to the inner-most called methods.

There are some other tips that will give you extra performance:

  • compute the masks only once outside the loop, since they are related to array r
  • mask = x > 0.01 instead of mask = np.where(x > 0.01)[0] and similars
  • reuse the out array since it always has length=N

EDIT: After putting the ideas above in practice, I came up with the following solution, where the methods __kappa() and __gamma() are no longer necessary. It is not tested though:

cpdef tuple getSh(self, np.ndarray[double, ndim=2, mode='c'] gpos, np.ndarray[double, ndim=2, mode='c'] pdf_z):
    # Convert to numpy arrays for internal usage:
    cdef np.ndarray[double, ndim=1] r, ks, wg
    cdef np.ndarray[double, ndim=1] pos_x, pos_y
    cdef np.ndarray[double, ndim=2, mode='c'] gamma, kappa, wgtmp

    if not gpos[:,0].flags.c_contiguous:
       pos_x = gpos[:,0].copy(order='C')
    else:
       pos_x = gpos[:,0]
    if not gpos[:,1].flags.c_contiguous:
       pos_y = gpos[:,1].copy(order='C')
    else:
       pos_y = gpos[:,1]
    cdef Py_ssize_t i, mask, N

    r = ((pos_x - self.pos[0])**2 + (pos_y - self.pos[1])**2)**0.5

    m1 = r > 0.01
    m2 = ~m1
    r1 = r[m1]
    r2 = r[m2]

    ma = r < 0.999
    mb = (r >= 0.999) & (r <= 1.001)
    mc = r > 1.001
    ra = r[ma]
    rb = r[mb]
    rc = r[mc]

    ks  = np.ascontiguousarray(self.ks)
    one = np.ones_like(ks)
    N = len(ks)
    P = len(r)

    kappa = np.zeros((P, N), dtype=np.float64 , order='C')
    gamma = np.zeros((P, N), dtype=np.float64 , order='C')
    wgtmp = np.zeros((P, N), dtype=np.float64 , order='C')
    wg = np.zeros((P,), dtype=np.float64)

    kappa[ma] = (2*ks/(ra**2 - 1)[:, None] *
                 one*(1 - np.log((1 + ((1 - ra)/(ra + 1))**0.5)/(1 - ((1 -
                      ra)/(ra + 1))**0.5))/(1 - ra**2)**0.5)[:, None])

    kappa[mb] = ks*(22./15. - 0.8*rb)[:, None]

    kappa[mc] = (2*ks/(rc**2 - 1)[:, None] *
                 one*(1 - 2*np.arctan(((rc - 1)/(rc + 1))**0.5)/(rc**2 -
                     1)**0.5)[:, None])


    gamma[m1 & ma] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[ma])[:, None]
    gamma[m1 & mb] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[mb])[:, None]
    gamma[m1 & mc] = 4*ks*(np.log(r1/2) + 2*r1**(-2) - kappa[mc])[:, None]

    gamma[m2] = 4*ks*(0.25 + 0.125 * r2**2 * (3.25 + 3.0*np.log(r2/2)))[:, None]


    init = np.where(np.ascontiguousarray(self.zs)>(self.z+0.1))[0][0]

    wgtmp = gamma/(1-kappa) * pdf_z[:, init:init+N]
    wg = wgtmp.sum(axis=1)

    cdef np.ndarray[double, ndim=1, mode='c'] dx, dy, drsq, cos2phi, sin2phi, g1, g2
    dx = pos_x - self.halo_pos[0]
    dy = pos_y - self.halo_pos[1]
    drsq = dx*dx+dy*dy
    drsq[drsq==0.] = 1. # Avoid division by 0
    cos2phi = (dx*dx-dy*dy)/drsq
    sin2phi = 2*dx*dy/drsq
    g1 = -wg*cos2phi
    g2 = -wg*sin2phi

    return g1, g2
Sign up to request clarification or add additional context in comments.

5 Comments

I concur. I am doing something fishy and I don't know how to fix it regarding to attributing ks instance. I am calling it each time my method is called and attribute this instance which is wrong.
@Dalek you are also computing the mask r > 0.01 all the N times without need...
Do you think numpy.broadcasting can be used instead of what you have used to combine summing over ks elements in kappa and gamma.
@Dalek I hardly think so... There is some broadcasting going on already when I'm multiplying ks with x[:, None]
what about writing kappa for instance like this:kappa[ma] = ks*(2/(ra**2 - 1) * (1 - np.log((1 + ((1 - ra)/(ra + 1))**0.5)/(1 - ((1 - ra)/(ra + 1))**0.5))/(1 - ra**2)**0.5))[:,np.newaxis]?
0

I don't think 'vectorize' applies in cython. In numpy you vectorize by using fast compiled code, operations like +,*, sum. There is a np.vectorize function, but it just wraps your code in a iterator that understands broadcasting and multidimensional arrays. It does not rewrite your function, and it doesn't speed it up.

Cython is used to speed up numpy code that cannot be expressed in the existing compiled vector operations. It gets its speed by compiling those as fast C iterations.

On the surface the i loop in getSh looks fast (c style), but it calls self.__kappa and self.__gamma. Both are loaded with np calls - np.array, np.where, np.log, etc. With those calls you don't get the kind of speedup that you would get with pure c code.

You need to focus on those two methods, expressing them as simple operations on numbers, and explicitly iterating c style as needed.

4 Comments

Actually I was thinking the same and I was using log and sqrt functions from math library of C and loop over the array with if conditions. Then just as a test I substituted the loop with np.where and np.log. The code was taking 2.9 mins to run and with this change it reduces to 15 second.
so I am not totally agreeing with you.
Is this cython giving much of an improvement over pure numpy?
certainly! The pure python code which contains integration takes much longer!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.