2

I have a dataframe that contains a column holding 'Log' strings. I'd like to create a new column based on the values I've parsed from the 'Log' column. Currently, I'm using .apply() with the following function:

def classification(row):
    if 'A' in row['Log']:
        return 'Situation A'
    elif 'B' in row['Log']:
        return 'Situation B'
    elif 'C' in row['Log']:
        return 'Situation C'
    return 'Check'

it looks like: df['Classification'] = df.apply(classification, axis=1) The issue is that it takes a lot of time (~3min to a dataframe with 4M rows) and I'm looking for a faster way. I saw some examples of users using vectorized functions that run much faster but those don't have if statements in the function. My question - is it possible to vectorize the function I've added and what is the fastest way to perform
this task?

5
  • 2
    check out np.select Commented Jan 13, 2020 at 18:35
  • 3
    Or you can do df['Log'].str.exrtact('(A|B|C)').fillna('Check'). Commented Jan 13, 2020 at 18:35
  • 2
    @QuangHoang Problem with that solution is that consider the order of the string i.e. from AB will extract A while from BA will extract B. On the other hand the OP function classification has an hierarchical ordering of extraction. Commented Jan 13, 2020 at 19:23
  • @Moti laluom from your profile it appears you are not aware of someone-answers Commented Jan 14, 2020 at 21:55
  • @QuangHoang suggestion df['Log'].str.exrtact('(A|B|C)').fillna('Check') is excelent for my case. Wall Time is less than 3 sec. Commented Jan 15, 2020 at 21:09

1 Answer 1

2

I would not be sure that using a nested numpy.where will increase performance: here some test performace with 4M rows

import numpy as np
import pandas as pd

ls = ['Abc', 'Bert', 'Colv', 'Dia']
df =  pd.DataFrame({'Log': np.random.choice(ls, 4_000_000)})

df['Log_where'] = np.where(df['Log'].str.contains('A'), 'Situation A', 
                      np.where(df['Log'].str.contains('B'), 'Situation B', 
                          np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))


def classification(x):
    if 'A' in x:
        return 'Situation A'
    elif 'B' in x:
        return 'Situation B'
    elif 'C' in x:
        return 'Situation C'
    return 'Check'


df['Log_apply'] = df['Log'].apply(classification)

Nested np.where Performance

 %timeit np.where(df['Log'].str.contains('A'), 'Situation A', np.where(df['Log'].str.contains('B'), 'Situation B',np.where(df['Log'].str.contains('C'), 'Situation C', 'check')))
8.59 s ± 1.71 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Applymap Performance

%timeit df['Log'].apply(classification)
911 ms ± 146 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

At least with my machine using nested np.where is almost 10x times slower than a applymap.

A final remark: using the solution suggested in the comments, i.e. something like:

d = {'A': 'Situation A',
     'B': 'Situation B',
     'C': 'Situation C'}
df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
df['Log_extract'] = df['Log_extract'].map(d).fillna('Check')

has the following problems:

  1. It won't necessarely be faster - testing on my machine:

    %timeit df['Log_extract'] = df['Log'].str.extract('(A|B|C)')
    3.74 s ± 70.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
  2. The .extract method follows string order i.e. from the string 'AB' will extract 'A' and from 'BA' will extract 'B'. On the other hand the OP function classification has an hierarchical ordering of extraction, thus extract 'A' in both cases.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.