0

I have a function that iterates through a one dimensional array and check if the values are above a threshold to create a mask. It is very fast. But how could I use this to iterate over multiple colums with different threshold on different columns. My approaches so far took 12 µs for a 1-D array with size 18531 . If i iterate over two columns i tried different functions but only with one threshold. How could I do this with multiple thresholds? Furthermore I investigated that having numpy floats 16 or numpy floats32 it is much slower. Why is that?

import numpy as np
import numba
import pandas as pd
#######Column approach
@numba.jit
def compute_expressionCol_Numba(col,threshold):
    n=len(col)
    result = np.empty(n,dtype='bool')
    for i in range(n):
        if col[i] < threshold:
            result[i]=1
        else:
            result[i]=0
    return result

def compute_expressionCol(col,threshold):
    result = compute_expressionCol_Numba(col.values,threshold)
    return result

##### Multiple column approach

def compute_expressionDF(df,threshold):
    for i in df:
        result = compute_expressionCol_Numba(df[i].values,threshold)
    return result

def make_mask(df, threshold):
    result = np.where(df < threshold, 1 , 0)
    return result

def lt(df, thresh):
    return (df.values<thresh).view('i1')

import numexpr as ne

def lt_numexpr(df, thresh):
    return ne.evaluate('a<thresh',{'a':df.values})

Some timeit tests:

for i in [np.float16,np.float32,np.float64]:
    print(i)
    randomDF = pd.DataFrame(np.random.rand(19000,2).astype(i),columns=['col1','col2'])
    thresh = 50
    %timeit compute_expressionCol(randomDF['col1'],50)
    %timeit compute_expressionCol(randomDF['col2'],50)
    %timeit for i in randomDF[['col1','col2']]: compute_expressionCol(randomDF[i],50)
    %timeit (randomDF[['col1','col2']].values < 50).astype(int)
    %timeit (randomDF.values < 50).astype(int)
    %timeit make_mask(randomDF[['col1','col2']],50)
    %timeit randomDF[['col1','col2']]<50
    %timeit randomDF['col1']<50
    %timeit pd.eval('randomDF[["col1","col2"]]<50')
    %timeit lt(randomDF, thresh=50)
    %timeit lt_numexpr(randomDF, thresh=50)

Results:

<class 'numpy.float16'>
40 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
40.6 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
82.8 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.39 ms ± 40.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
416 µs ± 14.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.88 ms ± 31.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.95 ms ± 97.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
400 µs ± 8.08 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
31.2 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
413 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
598 µs ± 22.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
<class 'numpy.float32'>
30.3 µs ± 2.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
35.2 µs ± 2.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.01 ms ± 67 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
973 µs ± 93.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
37.5 µs ± 4.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
3.11 ms ± 544 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
224 µs ± 10.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
32.4 ms ± 3.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
70.8 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
555 µs ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
<class 'numpy.float64'>
26.5 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
27.2 µs ± 836 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
1.09 ms ± 62.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.01 ms ± 38.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
43.1 µs ± 2.19 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
2.78 ms ± 118 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.54 ms ± 35.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
249 µs ± 10.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
30.4 ms ± 1.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
71.8 µs ± 3.05 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
558 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
4
  • Becasue your for loop is outside the numba function. Also, your second function will onlr return result for df[-1] Commented Apr 16, 2019 at 6:51
  • If you are looking for a mask, then randomDF.values<thresh would be good. Commented Apr 16, 2019 at 7:34
  • @Divakar Hey thank you for your advice. I tested the functions. And I interested in how things work. I read the enhance performace article from pandas: pandas.pydata.org/pandas-docs/stable/user_guide/…. So i thought numba would be the right way. I also heard about numepr which is used in pd.eval. So the approach to converting the dataframe to values and using a threshold is indeed the fastest way for multiple columns, except your view approach(lt). Why is that faster and what is the negative about it? And when should someone use numba if not here in calculating stuff? Commented Apr 16, 2019 at 7:56
  • And how could I use different thresholds on different columns without loosing much perofrmance? Furtheremore I investigated that having np.float16 or np.float32 it is slower. Why? Commented Apr 16, 2019 at 8:13

1 Answer 1

1

I think np.where might be what you are after. You can feed it a dataframe or a series

import numpy as np

def make_mask(df, threshold):
    result = np.where(df < threshold, 1 , 0)
    return result
Sign up to request clarification or add additional context in comments.

2 Comments

Yeah, this isn't a job for numba, just use standard numpy broadcasting. even df.values < threshhold).astype(int) would work better
And how could I interprate the timeit test then?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.