2

I have a dataframe with layout according to below, not including "flag_common":

cat      flag_1   flag_2  flag_3   pop      state       year    flag_common
value1   1        0       0        1.5      Ohio        2000    1
value3   1        1       0        1.7      Ohio        2001    1
value2   1        1       0        3.6      Ohio        2002    1
value11  0        1       0        2.4      Nevada      2001    2
value5   0        0       0        2.9      Nevada      2002    0
value9   0        0       1        11.1     New York    2003    3
value13  0        0       0        23.4     New York    2004    0
value10  1        1       0        0.1      California  2009    1
value7   0        0       0        0.3      California  2010    0
value14  0        1       1        1.1      California  2009    2

The column "flag_common" should be set by looking at the the binary flags and inserting value 1-3 depending if the flags are 1 or 0. When two of the flag are set to 1 for same row, the flag with the lowest number is inserted into "flag_common". This has to be dynamic, being able to handle flag_1 to "flag_n".

I have sort of solved it using an row iteration method and a for-loop, but my data is very big and its becomes quite slow, so I hope there is a "pythonic" way to write this which is vectorized.

Code for data frame is below:

df = pd.DataFrame({'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada', 'New York', 'New York', 'California', 'California', 'California'],
                 'year' : [2000, 2001, 2002, 2001, 2002, 2003, 2004, 2009, 2010, 2009],
                 'pop' : [1.5, 1.7, 3.6, 2.4, 2.9, 11.1, 23.4, 0.1, 0.3, 1.1],
               'cat' : ['value1', 'value3', 'value2', 'value11', 'value5', 'value9', 'value13', 'value10', 'value7', 'value14'],
               'flag_1' : [1, 1,1,0,0,0,0,1,0,0],
               'flag_2' : [0, 1,1,1,0,0,0,1,0,1],
               'flag_3' : [0, 0, 0, 0,0,1,0,0,0, 1]
                })

Thanks i advance for any thoughts and suggestions!

1 Answer 1

1

You can use idxmax of columns in subset by columns flag_1, flag_2 and flag_3, then find positions by list comprehension with get_loc.

But positions with all 0 values are not 0, but flag_1. So use numpy.where for correct it.

#get min value of columns 'flag_1','flag_2','flag_3'
print df[['flag_1','flag_2','flag_3']].idxmax(axis=1)
0    flag_1
1    flag_1
2    flag_1
3    flag_2
4    flag_1
5    flag_3
6    flag_1
7    flag_1
8    flag_1
9    flag_2
dtype: object

#get position of flag
print df.columns.get_loc('flag_1')
1

#get positions all flags
flag = [df.columns.get_loc(k) for k in df[['flag_1','flag_2','flag_3']].idxmax(axis=1)] 
print flag
[1, 1, 1, 2, 1, 3, 1, 1, 1, 2]

#alternative solution for positions of flags - last digit has to be number
print [int(x[-1]) for x in df[['flag_1','flag_2','flag_3']].idxmax(axis=1)]
[1, 1, 1, 2, 1, 3, 1, 1, 1, 2]
#if all values in 'flag_1','flag_2','flag_3' are 0, get 0 else flag
df['new'] = np.where((df[['flag_1','flag_2','flag_3']].sum(axis=1)) == 0, 0, flag)
print df
       cat  flag_1  flag_2  flag_3   pop       state  year  flag_common  new
0   value1       1       0       0   1.5        Ohio  2000            1    1
1   value3       1       1       0   1.7        Ohio  2001            1    1
2   value2       1       1       0   3.6        Ohio  2002            1    1
3  value11       0       1       0   2.4      Nevada  2001            2    2
4   value5       0       0       0   2.9      Nevada  2002            0    0
5   value9       0       0       1  11.1    New York  2003            3    3
6  value13       0       0       0  23.4    New York  2004            0    0
7  value10       1       1       0   0.1  California  2009            1    1
8   value7       0       0       0   0.3  California  2010            0    0
9  value14       0       1       1   1.1  California  2009            2    2

EDIT:

You can also dynamically check columns with text flag:

#get columns where first value before _ is text 'flag'
cols = [x for x in df.columns if x.split('_')[0] == 'flag']
print cols
['flag_1', 'flag_2', 'flag_3']

#get min value of columns 'flag_1','flag_2','flag_3'
print df[cols].idxmax(axis=1)
0    flag_1
1    flag_1
2    flag_1
3    flag_2
4    flag_1
5    flag_3
6    flag_1
7    flag_1
8    flag_1
9    flag_2
dtype: object

#get positions of flag
print df.columns.get_loc('flag_1')
1

#get positions all flags
flag = [df.columns.get_loc(k) for k in df[cols].idxmax(axis=1)] 
print flag
[1, 1, 1, 2, 1, 3, 1, 1, 1, 2]

#alternative solution for positions of flags - last digit has to be number
print [int(x[-1]) for x in df[cols].idxmax(axis=1)]
[1, 1, 1, 2, 1, 3, 1, 1, 1, 2]
#if all values in 'flag_1','flag_2','flag_3' are 0, get 0 else flag
df['new'] = np.where((df[cols].sum(axis=1)) == 0, 0, flag)
print df
       cat  flag_1  flag_2  flag_3   pop       state  year  new
0   value1       1       0       0   1.5        Ohio  2000    1
1   value3       1       1       0   1.7        Ohio  2001    1
2   value2       1       1       0   3.6        Ohio  2002    1
3  value11       0       1       0   2.4      Nevada  2001    2
4   value5       0       0       0   2.9      Nevada  2002    0
5   value9       0       0       1  11.1    New York  2003    3
6  value13       0       0       0  23.4    New York  2004    0
7  value10       1       1       0   0.1  California  2009    1
8   value7       0       0       0   0.3  California  2010    0
9  value14       0       1       1   1.1  California  2009    2
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you, exactly what I wanted, and performance wise its really good!
Glad can help you! Good luck! I add one improvement to solution, maybe help.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.