0

Hello I'd like to know how this code is on complexity notation big O, df1 has ``N rows and df2 has M rows, M << N. Evry x in var_ref will be searched in set(df2.var0). does this equal to N*N == O(n^2) ??

df1['var1'] = df1['var_ref'].apply(lambda x: True if x in df2.var0.unique() else False) * 1
1
  • 1
    Searching in set is constant, so this should be O(MN). Since you mentioned M<<N, so it's kind of still O(N). If you save your unique set somewhere first and then call that directly, then it will be O(N) for sure (so that you don't need to do var.unique() every time. Commented Mar 9, 2020 at 14:31

1 Answer 1

1

Should be O(N * M). With M the number of unique in df2.

And you should save the unique list somewhere to not calculate it each time.

u = df2.var0.unique()
df1['var1'] = df1['var_ref'].apply(lambda x: True if x in u else False) * 1

I pass from 159 ms to 5 ms (600 rows)

Sign up to request clarification or add additional context in comments.

7 Comments

Why when I save it as u = df2.var.unique() it take more time than when I save it as set(df2.var)??
Iterations over set might be faster than on numpy.ndarray .... On which size of dataset did you make this comparaison ?
My N is about 1million fows, M when is set() is 40K. It was very fast about 5s!!
When I use set() it’s very fast but when ndarray it take minutes. Thanks
pd.col.unique() is called at each apply on each row, so it recompute the uniques values. If I recall well set use hashmap for search so it's O(1) complexity, on the other half a numpy array will be iterated to find the value.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.