1

I have a data frame (df)

df = pd.DataFrame({'No': [123,234,345,456,567,678], 'text': ['60 ABC','1nHG','KL HG','21ABC','K 200','1g HG'], 'reference':['ABC','HG','FL','','200',''], 'result':['','','','','','']}, columns=['No', 'text', 'reference', 'result'])

    No    text reference result
0  123  60 ABC       ABC       
1  234    1nHG        HG       
2  345   KL HG        FL       
3  456   21ABC                 
4  567   K 200       200       
5  678   1g HG                 

and a list with elements

list
['ABC','HG','FL','200','CP1']

Now I have the following coding:

for idx, row in df.iterrows(): 

    for item in list:

        if row['text'].strip().endswith(item):

            if pd.isnull(row['reference']):
                df.at[idx, 'result'] = item

            elif pd.notnull(row['reference']) and row['reference'] != item:                
                df.at[idx, 'result'] = 'wrong item'

            if pd.isnull(row['result']):
                break

I run through df and the list and check for matches.

Output:

    No    text reference      result
0  123  60 ABC       ABC            
1  234    1nHG        HG            
2  345   KL HG        FL  wrong item
3  456   21ABC                   ABC
4  567   K 200       200            
5  678   1g HG                    HG

The break instruction is important because otherwise a second element could be found within the list and then this second element would overwrite the content in result.

Now I need another solution because the data frame is huge and for loops are inefficient. Think using apply could work but how?

Thank you!

2
  • Please don't use ascii borders for DataFrames. Instead just copy the DataFrame as formatted in your shell. Commented Oct 2, 2018 at 16:52
  • 1
    @Alex - hope that is what you requested Commented Oct 2, 2018 at 17:16

1 Answer 1

2

Instead of iterating rows, you can iterate your suffixes, which is likely a much smaller iterable. This way, you can take advantage of series-based methods and Boolean indexing.

I've also created an extra series to identify when a row has been updated. The cost of this extra check should be small versus the expense of iterating by row.

L = ['ABC', 'HG', 'FL', '200', 'CP1']

df['text'] = df['text'].str.strip()
null = df['reference'].eq('')
df['updated'] = False

for item in L:
    ends = df['text'].str.endswith(item)
    diff = df['reference'].ne(item)

    m1 = ends & null & ~df['updated']
    m2 = ends & diff & ~null & ~df['updated']

    df.loc[m1, 'result'] = item
    df.loc[m2, 'result'] = 'wrong item'

    df.loc[m1 | m2, 'updated'] = True

Result:

    No    text reference      result updated
0  123  60 ABC       ABC               False
1  234    1nHG        HG               False
2  345   KL HG        FL  wrong item    True
3  456   21ABC                   ABC    True
4  567   K 200       200               False
5  678   1g HG                    HG    True

You can drop the final column, but you may find it useful for other purposes.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! It is so much faster this way. You do not see a solution where I do not need any for loop right? Because at this point I can not say how many elements my list will carry. So far there are hundreds, but more will follow for sure.
@MaMo, I can't see anything obvious. Feel free to keep this question open for a while. Someone else might be able to think of a smarter way!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.