How to process column names and create new columns

Question

This is my pandas DataFrame with original column names.

old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt
1               3            0                0
2               1            1                5

Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
Finally in each new column I should store the total count of non-zero original column values, i.e.

old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt    cm1    cm2    
1               3            0                0         2      0        
2               1            1                5         2      1

I implemented the first step as follows:

cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns

How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.

this seems a bit more string operation then dataframe stuff, you could set it up to create the new name on cm and the character after it? — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented Mar 11, 2016 at 14:37
an answer from here could help get the character after cm for each column. — Tadhg McDonald-Jensen
– Tadhg McDonald-Jensen, Commented Mar 11, 2016 at 14:41
@JoeBlack Can be in column names twice strings starting with cm, e.g. cmd_cm1_aa ? — jezrael
– jezrael, Commented Mar 11, 2016 at 15:59

jezrael · Accepted Answer · 2016-03-11 18:58:06Z

2

You can use:

print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt
0              1           3               0       0
1              2           1               1       5

First you can filter columns contains string cm, so columns without cm are removed.

df1 = df.filter(regex='cm')

Now you can change columns to new values like cm1, cm2, cm3.

print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']

df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
   cm1  cm1  cm2
0    1    3    0
1    2    1    1

Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.

df1 = df1.astype(bool)
print df1
    cm1   cm1    cm2
0  True  True  False
1  True  True   True

print df1.groupby(df1.columns, axis=1).sum()
   cm1  cm2
0    2    0
1    2    1

You need unique columns, which are added to original df:

print df1.columns.unique()
['cm1' 'cm2']

Last you can add new columns by df[['cm1','cm2']] from groupby function:

df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt  cm1  cm2
0              1           3               0       0    2    0
1              2           1               1       5    2    1

edited Mar 11, 2016 at 18:58

answered Mar 11, 2016 at 15:25

jezrael

868k103 gold badges1.4k silver badges1.3k bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Tadhg McDonald-Jensen Over a year ago

This is clearly a great method but you need a bit more explanation as I don't know much about pandas and wouldn't be able to use this other then copy-paste which is not all that helpful to the OP in the long run.

jezrael Over a year ago

Ok, no problem. I try explain more.

Tadhg McDonald-Jensen Over a year ago

I'd also like to point out that [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm'] will raise an error if cm is present in a column name more then once (like cmd_cm1) where as ['cm'+c[c.index('cm') + 2] for c in df.columns if 'cm' in c] will only use the first occurrence of the substring.

Tadhg McDonald-Jensen Over a year ago

Although two methods work on different principles, maybe just don't use list comprehension and add appropriate break statements to prevent keeping multiple unique names from a single column.

jezrael Over a year ago

Thank you for checking my code. Now I have to go away. I ask OP if this strings like cmd can be in column names. If yes, I update answer. If not, no problem.

|

Community · Accepted Answer · 2017-05-23 11:45:20Z

Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:

col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
                                   #   ^ if you are hard coding this in you might as well use 2

so that instead of the string after cm it is cm and the character directly following, in this case it would be:

{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}

Then add the new columns to the DataFrame by iterating over the dict:

for col,new_col in col_map.items():
    if new_col not in df:
        df[new_col] =[int(a!=0) for a in df[col]]
    else:
        df[new_col]+=[int(a!=0) for a in df[col]]

note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)

import operator

for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
    if new_col in df:
        df[new_col]+=[int(a!=0) for a in df[col]]
    else:
        df[new_col] =[int(a!=0) for a in df[col]]

to ensure the new columns are inserted in order.

Collectives™ on Stack Overflow

How to process column names and create new columns

2 Answers 2

6 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related