filter based on partial string for multiple columns headers

Question

Consider the following Dataframe:

           AAA3 ABB3 DAT4 DEE3 ABB4 AAA4 DAA3 EAV3 DAC4 DEE4
01/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
02/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
03/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
04/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5

The values are not important, so I am giving all columns the same value.

What I want to do is see if the alphabets characters part of my columns headers, has an match among the headers, and if it does, remove the header that has an 4, leaving only the name that has a 3.

For example:

There is an AAA3, as well as an AAA4. I want to drop the AAA4 column, leaving only AAA3.

Note that there is a column named DAC4, but there isn't a DAC3. So I want to keep my DAC4 column.

I couldn't solve my problem with the following question : Select by partial string from a pandas DataFrame

Andy L. · Accepted Answer · 2020-07-26 23:13:43Z

2

Create a mask on duplicated of alphabet part. Create another mask where the last char is 3. Finally slicing using these masks

m = df.columns.str.extract(r'(^[A-Za-z]+)').duplicated(keep=False)
m1 = df.columns.str.endswith('3')
df_final =  df.loc[:,(~m | m1).values]

Out[146]:
            AAA3  ABB3  DAT4  DEE3  DAA3  EAV3  DAC4
01/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
02/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
03/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
04/01/2020     1   1.1   1.5   1.2   1.2     1   0.9

answered Jul 26, 2020 at 23:13

Andy L.

25.3k4 gold badges20 silver badges30 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

khouzam Over a year ago

Thank you! Could you explain what (r'(^[A-Za-z]+)') means ?

Andy L. Over a year ago

@khouzam: it's the regex pattern for getting all alphabetic chars both uppercase and lowercase ignore digits. The ^ to indicate starting from the front.

sammywemmy · Accepted Answer · 2020-07-26 23:00:35Z

2

Step 1: Get a dictionary of similar columns :

from collections import defaultdict
from itertools import chain

d = defaultdict(list)
for entry in df.columns:
    d[entry[:-1]].append(entry)

d

defaultdict(list,
            {'AAA': ['AAA3', 'AAA4'],
             'ABB': ['ABB3', 'ABB4'],
             'DAT': ['DAT4'],
             'DEE': ['DEE3', 'DEE4'],
             'DAA': ['DAA3'],
             'EAV': ['EAV3'],
             'DAC': ['DAC4']})

Step 2: Get columns that end with 4 :

from itertools import chain

cols_to_drop = list(chain.from_iterable([[ent for ent in value 
                                         if ent.endswith("4")]
                                         for key,value in d.items() 
                                         if len(value) > 1]))

cols_to_drop
['AAA4', 'ABB4', 'DEE4']

Step 3: Drop columns :

df.drop(columns=cols_to_drop)

    AAA3    ABB3    DAT4    DEE3    DAA3    EAV3    DAC4
0   01/01/2020  1   1.1 1.5 1.2 1   0.9
1   02/01/2020  1   1.1 1.5 1.2 1   0.9
2   03/01/2020  1   1.1 1.5 1.2 1   0.9
3   04/01/2020  1   1.1 1.5 1.2 1   0.9

answered Jul 26, 2020 at 23:00

sammywemmy

28.9k4 gold badges21 silver badges35 bronze badges

1 Comment

khouzam Over a year ago

This solution also worked perfectly for me! thank you for your time helping me

Collectives™ on Stack Overflow

filter based on partial string for multiple columns headers

2 Answers 2

2 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related