Applying a vectorized function with several returns to pandas dataframe

Question

I have a dataframe that contains a column holding 'Log' strings. I'd like to create a new column based on the values I've parsed from the 'Log' column. Currently, I'm using .apply() with the following function:

def classification(row):
    if 'A' in row['Log']:
        return 'Situation A'
    elif 'B' in row['Log']:
        return 'Situation B'
    elif 'C' in row['Log']:
        return 'Situation C'
    return 'Check'

it looks like: df['Classification'] = df.apply(classification, axis=1) The issue is that it takes a lot of time (~3min to a dataframe with 4M rows) and I'm looking for a faster way. I saw some examples of users using vectorized functions that run much faster but those don't have if statements in the function. My question - is it possible to vectorize the function I've added and what is the fastest way to perform
this task?

Or you can do df['Log'].str.exrtact('(A|B|C)').fillna('Check'). — Quang Hoang
– Quang Hoang, Commented Jan 13, 2020 at 18:35
@QuangHoang Problem with that solution is that consider the order of the string i.e. from AB will extract A while from BA will extract B. On the other hand the OP function classification has an hierarchical ordering of extraction. — FBruzzesi
– FBruzzesi, Commented Jan 13, 2020 at 19:23
@Moti laluom from your profile it appears you are not aware of someone-answers — FBruzzesi
– FBruzzesi, Commented Jan 14, 2020 at 21:55
@QuangHoang suggestion df['Log'].str.exrtact('(A|B|C)').fillna('Check') is excelent for my case. Wall Time is less than 3 sec. — Moti laluom
– Moti laluom, Commented Jan 15, 2020 at 21:09

FBruzzesi · Accepted Answer · 2020-01-13 20:24:07Z

I would not be sure that using a nested numpy.where will increase performance: here some test performace with 4M rows

import numpy as np
import pandas as pd

ls = ['Abc', 'Bert', 'Colv', 'Dia']
df =  pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})

df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A', 
                      np.where(df['Log'].str.contains('B'), 'Situation B', 
                          np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))


def classification(x):
    if 'A' in x:
        return 'Situation A'
    elif 'B' in x:
        return 'Situation B'
    elif 'C' in x:
        return 'Situation C'
    return 'Check'


df['Log_apply'] = df['Log'].apply(classification)

Nested np.where Performance

 %timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applymap Performance

%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

At least with my machine using nested np.where is almost 10x times slower than a applymap.

A final remark: using the solution suggested in the comments, i.e. something like:

d = {'A': 'Situation A',
     'B': 'Situation B',
     'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')

has the following problems:

It won't necessarely be faster - testing on my machine:

%timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The .extract method follows string order i.e. from the string 'AB' will extract 'A' and from 'BA' will extract 'B'. On the other hand the OP function classification has an hierarchical ordering of extraction, thus extract 'A' in both cases.

Collectives™ on Stack Overflow

Applying a vectorized function with several returns to pandas dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related