2

Consider the following Dataframe:

           AAA3 ABB3 DAT4 DEE3 ABB4 AAA4 DAA3 EAV3 DAC4 DEE4
01/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
02/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
03/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5
04/01/2020  1    1.1  1.5  1.2  1.32 1.2  1.2  1    0.9  0.5

The values are not important, so I am giving all columns the same value.

What I want to do is see if the alphabets characters part of my columns headers, has an match among the headers, and if it does, remove the header that has an 4, leaving only the name that has a 3.

For example: 

There is an AAA3, as well as an AAA4. I want to drop the AAA4 column, leaving only AAA3.

Note that there is a column named DAC4, but there isn't a DAC3. So I want to keep my DAC4 column.

I couldn't solve my problem with the following question : Select by partial string from a pandas DataFrame

2 Answers 2

2

Create a mask on duplicated of alphabet part. Create another mask where the last char is 3. Finally slicing using these masks

m = df.columns.str.extract(r'(^[A-Za-z]+)').duplicated(keep=False)
m1 = df.columns.str.endswith('3')
df_final =  df.loc[:,(~m | m1).values]

Out[146]:
            AAA3  ABB3  DAT4  DEE3  DAA3  EAV3  DAC4
01/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
02/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
03/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
04/01/2020     1   1.1   1.5   1.2   1.2     1   0.9
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you! Could you explain what (r'(^[A-Za-z]+)') means ?
@khouzam: it's the regex pattern for getting all alphabetic chars both uppercase and lowercase ignore digits. The ^ to indicate starting from the front.
2

Step 1: Get a dictionary of similar columns :

from collections import defaultdict
from itertools import chain

d = defaultdict(list)
for entry in df.columns:
    d[entry[:-1]].append(entry)

d

defaultdict(list,
            {'AAA': ['AAA3', 'AAA4'],
             'ABB': ['ABB3', 'ABB4'],
             'DAT': ['DAT4'],
             'DEE': ['DEE3', 'DEE4'],
             'DAA': ['DAA3'],
             'EAV': ['EAV3'],
             'DAC': ['DAC4']})

Step 2: Get columns that end with 4 :

from itertools import chain

cols_to_drop = list(chain.from_iterable([[ent for ent in value 
                                         if ent.endswith("4")]
                                         for key,value in d.items() 
                                         if len(value) > 1]))

cols_to_drop
['AAA4', 'ABB4', 'DEE4']

Step 3: Drop columns :

df.drop(columns=cols_to_drop)

    AAA3    ABB3    DAT4    DEE3    DAA3    EAV3    DAC4
0   01/01/2020  1   1.1 1.5 1.2 1   0.9
1   02/01/2020  1   1.1 1.5 1.2 1   0.9
2   03/01/2020  1   1.1 1.5 1.2 1   0.9
3   04/01/2020  1   1.1 1.5 1.2 1   0.9

1 Comment

This solution also worked perfectly for me! thank you for your time helping me

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.