I would not be sure that using a nested numpy.where will increase performance: here some test performace with 4M rows
import numpy as np
import pandas as pd
ls = ['Abc', 'Bert', 'Colv', 'Dia']
df = pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})
df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A',
np.where(df['Log'].str.contains('B'), 'Situation B',
np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
def classification(x):
if 'A' in x:
return 'Situation A'
elif 'B' in x:
return 'Situation B'
elif 'C' in x:
return 'Situation C'
return 'Check'
df['Log_apply'] = df['Log'].apply(classification)
Nested np.where Performance
%timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Applymap Performance
%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
At least with my machine using nested np.where is almost 10x times slower than a applymap.
A final remark: using the solution suggested in the comments, i.e. something like:
d = {'A': 'Situation A',
'B': 'Situation B',
'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')
has the following problems:
It won't necessarely be faster - testing on my machine:
%timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The .extract method follows string order i.e. from the string 'AB' will extract 'A' and from 'BA' will extract 'B'. On the other hand the OP function classification has an hierarchical ordering of extraction, thus extract 'A' in both cases.
np.selectdf['Log'].str.exrtact('(A|B|C)').fillna('Check').ABwill extractAwhile fromBAwill extractB. On the other hand the OP functionclassificationhas an hierarchical ordering of extraction.df['Log'].str.exrtact('(A|B|C)').fillna('Check')is excelent for my case. Wall Time is less than 3 sec.