Pandas dataframe - Select rows where one column's values contains a string and another column's values starts with specific strings

Question

I'm looking to select rows where state contains the word Traded and trading _book does not start with letters 'E','L','N'

Test_Data = [('originating_system_id', ['RBCL', 'RBCL', 'RBCL','RBCL']),
             ('rbc_security_type1', ['CORP', 'CORP','CORP','CORP']),
             ('state', ['Traded', 'Traded Away','Traded','Traded Away']),
             ('trading_book', ['LCAAAAA','NUBBBBB','EDFGSFG','PDFEFGR'])
             ]
dfTest_Data = pd.DataFrame.from_items(Test_Data)
display(dfTest_Data)

originating_system_id   rbc_security_type1     state        trading_book
        RBCL                   CORP            Traded          LCAAAAA
        RBCL                   CORP            Traded Away     NUBBBBB
        RBCL                   CORP            Traded          EDFGSFG
        RBCL                   CORP            Traded Away     PDFEFGR

Desired output:

originating_system_id   rbc_security_type1     state        trading_book
        RBCL                   CORP            Traded Away     PDFEFGR

I though this would do the trick:

prefixes = ['E','L','N']
df_Traded_Away_User = dfTest_Data[
                                    dfTest_Data[~dfTest_Data['trading_book'].str.startswith(tuple(prefixes))]  &
                                    (dfTest_Data['state'].str.contains('Traded')) 
                                ][['originating_system_id','rbc_security_type1','state','trading_book']]
display(df_Traded_Away_User)

but I'm getting:

ValueError: Must pass DataFrame with boolean values only

jezrael · Accepted Answer · 2018-07-17 05:38:57Z

5

I suggest create each boolean mask separately for better readable code and then chain them by &:

prefixes = ['E','L','N']

m1 = ~dfTest_Data['trading_book'].str.startswith(tuple(prefixes))
m2 = dfTest_Data['state'].str.contains('Traded')

cols = ['originating_system_id','rbc_security_type1','state','trading_book']
df_Traded_Away_User = dfTest_Data.loc[m1 & m2, cols]
print (df_Traded_Away_User)
  originating_system_id rbc_security_type1        state trading_book
3                  RBCL               CORP  Traded Away      PDFEFGR

answered Jul 17, 2018 at 5:38

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Peter Lucas Over a year ago

Working. Using .loc is prefereable when filtering rows?

jezrael Over a year ago

@PeterLucas - It depends what need. If want filter by all columns then df_Traded_Away_User = dfTest_Data[m1 & m2] is better, but if want filter by only some columns e.g. 2 columns like cols = ['originating_system_id', 'trading_book'] df_Traded_Away_User = dfTest_Data.loc[m1 & m2, cols] then is loc necessary.

Collectives™ on Stack Overflow

Pandas dataframe - Select rows where one column's values contains a string and another column's values starts with specific strings

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related