Pandas - replace for loop for efficiency

Question

I have a data frame (df)

df = pd.DataFrame({'No': [123,234,345,456,567,678], 'text': ['60 ABC','1nHG','KL HG','21ABC','K 200','1g HG'], 'reference':['ABC','HG','FL','','200',''], 'result':['','','','','','']}, columns=['No', 'text', 'reference', 'result'])

    No    text reference result
0  123  60 ABC       ABC       
1  234    1nHG        HG       
2  345   KL HG        FL       
3  456   21ABC                 
4  567   K 200       200       
5  678   1g HG

and a list with elements

list
['ABC','HG','FL','200','CP1']

Now I have the following coding:

for idx, row in df.iterrows(): 

    for item in list:

        if row['text'].strip().endswith(item):

            if pd.isnull(row['reference']):
                df.at[idx, 'result'] = item

            elif pd.notnull(row['reference']) and row['reference'] != item:                
                df.at[idx, 'result'] = 'wrong item'

            if pd.isnull(row['result']):
                break

I run through df and the list and check for matches.

Output:

    No    text reference      result
0  123  60 ABC       ABC            
1  234    1nHG        HG            
2  345   KL HG        FL  wrong item
3  456   21ABC                   ABC
4  567   K 200       200            
5  678   1g HG                    HG

The break instruction is important because otherwise a second element could be found within the list and then this second element would overwrite the content in result.

Now I need another solution because the data frame is huge and for loops are inefficient. Think using apply could work but how?

Thank you!

Please don't use ascii borders for DataFrames. Instead just copy the DataFrame as formatted in your shell. — Alex
– Alex, Commented Oct 2, 2018 at 16:52

jpp · Accepted Answer · 2018-10-02 17:19:34Z

2

Instead of iterating rows, you can iterate your suffixes, which is likely a much smaller iterable. This way, you can take advantage of series-based methods and Boolean indexing.

I've also created an extra series to identify when a row has been updated. The cost of this extra check should be small versus the expense of iterating by row.

L = ['ABC', 'HG', 'FL', '200', 'CP1']

df['text'] = df['text'].str.strip()
null = df['reference'].eq('')
df['updated'] = False

for item in L:
    ends = df['text'].str.endswith(item)
    diff = df['reference'].ne(item)

    m1 = ends & null & ~df['updated']
    m2 = ends & diff & ~null & ~df['updated']

    df.loc[m1, 'result'] = item
    df.loc[m2, 'result'] = 'wrong item'

    df.loc[m1 | m2, 'updated'] = True

Result:

    No    text reference      result updated
0  123  60 ABC       ABC               False
1  234    1nHG        HG               False
2  345   KL HG        FL  wrong item    True
3  456   21ABC                   ABC    True
4  567   K 200       200               False
5  678   1g HG                    HG    True

You can drop the final column, but you may find it useful for other purposes.

edited Oct 2, 2018 at 17:19

answered Oct 2, 2018 at 17:06

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MaMo Over a year ago

Thank you! It is so much faster this way. You do not see a solution where I do not need any for loop right? Because at this point I can not say how many elements my list will carry. So far there are hundreds, but more will follow for sure.

jpp Over a year ago

@MaMo, I can't see anything obvious. Feel free to keep this question open for a while. Someone else might be able to think of a smarter way!

Collectives™ on Stack Overflow

Pandas - replace for loop for efficiency

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related