4

This is my pandas DataFrame with original column names.

old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt
1               3            0                0
2               1            1                5
  1. Firstly I want to extract all unique variations of cm, e.g. in this case cm1 and cm2.
  2. After this I want to create a new column per each unique cm. In this example there should be 2 new columns.
  3. Finally in each new column I should store the total count of non-zero original column values, i.e.
old_dt_cm1_tt   old_dm_cm1   old_rr_cm2_epf   old_gt    cm1    cm2    
1               3            0                0         2      0        
2               1            1                5         2      1

I implemented the first step as follows:

cols = pd.DataFrame(list(df.columns))
ind = [c for c in df.columns if 'cm' in c]
df.ix[:, ind].columns

How to proceed with steps 2 and 3, so that the solution is automatic (I don't want to manually define column names cm1 and cm2, because in original data set I might have many cm variations.

4
  • this seems a bit more string operation then dataframe stuff, you could set it up to create the new name on cm and the character after it? Commented Mar 11, 2016 at 14:37
  • 1
    an answer from here could help get the character after cm for each column. Commented Mar 11, 2016 at 14:41
  • 1
    @JoeBlack Can be in column names twice strings starting with cm, e.g. cmd_cm1_aa ? Commented Mar 11, 2016 at 15:59
  • @jezrael: No, it's impossible. Commented Mar 11, 2016 at 16:26

2 Answers 2

2

You can use:

print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt
0              1           3               0       0
1              2           1               1       5

First you can filter columns contains string cm, so columns without cm are removed.

df1 = df.filter(regex='cm')

Now you can change columns to new values like cm1, cm2, cm3.

print [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
['cm1', 'cm1', 'cm2']

df1.columns = [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm']
print df1
   cm1  cm1  cm2
0    1    3    0
1    2    1    1

Now you can count non - zero values - change df1 to boolean DataFrame and sum - True are converted to 1 and False to 0. You need count by unique column names - so groupby columns and sum values.

df1 = df1.astype(bool)
print df1
    cm1   cm1    cm2
0  True  True  False
1  True  True   True

print df1.groupby(df1.columns, axis=1).sum()
   cm1  cm2
0    2    0
1    2    1

You need unique columns, which are added to original df:

print df1.columns.unique()
['cm1' 'cm2']

Last you can add new columns by df[['cm1','cm2']] from groupby function:

df[df1.columns.unique()] = df1.groupby(df1.columns, axis=1).sum()
print df
   old_dt_cm1_tt  old_dm_cm1  old_rr_cm2_epf  old_gt  cm1  cm2
0              1           3               0       0    2    0
1              2           1               1       5    2    1
Sign up to request clarification or add additional context in comments.

6 Comments

This is clearly a great method but you need a bit more explanation as I don't know much about pandas and wouldn't be able to use this other then copy-paste which is not all that helpful to the OP in the long run.
Ok, no problem. I try explain more.
I'd also like to point out that [cm for c in df1.columns for cm in c.split('_') if cm[:2] == 'cm'] will raise an error if cm is present in a column name more then once (like cmd_cm1) where as ['cm'+c[c.index('cm') + 2] for c in df.columns if 'cm' in c] will only use the first occurrence of the substring.
Although two methods work on different principles, maybe just don't use list comprehension and add appropriate break statements to prevent keeping multiple unique names from a single column.
Thank you for checking my code. Now I have to go away. I ask OP if this strings like cmd can be in column names. If yes, I update answer. If not, no problem.
|
0

Once you know which columns have cm in them you can map them (with a dict) to the desired new column with an adapted version of this answer:

col_map = {c:'cm'+c[c.index('cm') + len('cm')] for c in ind}
                                   #   ^ if you are hard coding this in you might as well use 2

so that instead of the string after cm it is cm and the character directly following, in this case it would be:

{'old_dm_cm1': 'cm1', 'old_dt_cm1_tt': 'cm1', 'old_rr_cm2_epf': 'cm2'}

Then add the new columns to the DataFrame by iterating over the dict:

for col,new_col in col_map.items():
    if new_col not in df:
        df[new_col] =[int(a!=0) for a in df[col]]
    else:
        df[new_col]+=[int(a!=0) for a in df[col]]

note that int(a!=0) will simply give 0 if the value is 0 and 1 otherwise. The only issue with this is because dicts are inherently unordered it may be preferable to add the new columns in order according to the values: (like the answer here)

import operator

for col,new_col in sorted(col_map.items(),key=operator.itemgetter(1)):
    if new_col in df:
        df[new_col]+=[int(a!=0) for a in df[col]]
    else:
        df[new_col] =[int(a!=0) for a in df[col]]

to ensure the new columns are inserted in order.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.