1

I have a dataframe that is similar to the example below:

sample = {'col1': [50.6, 30.67, 40.5, 0, 0, 0],
          'col2': [40.74, 30.33, 41.00, 0, 0, 0]}
df_sample = pd.DataFrame(sample)

Now, in both col2 and col3 however, the entries represent two different values. For example, for entry 50.6, that represents val1 = 5 and val2 = 0.6. Another example would be 41.00. This value represents 4 and 1.0.

Basically, what I want to get is a column which can be computed as follows:

df_sample['res'] = df_sample.apply(lambda x: 
    ((x['col2']//10)*(x['col2']%10) + (x['col3']//10)*(x['col3']%10)) 
            / (x['col2']//10 + x['col3']//10), axis=1)
df_sample.fillna(0)

Basically, it gets the weighted average from the values obtained from each column. Now, what I want to do is scale this method to work with let's say twenty columns without hardcoding each column name in the DataFrame. Please advise.

2 Answers 2

1

Just create a subset of the columns you want to use for computing and you can perform the operation on the subsetted df itself and not calling functions on every series object:

np.random.seed(42)
df = pd.DataFrame(np.random.uniform(0, 100, (100, 25))).add_prefix('col')
df.shape
(100, 25)

# Take first 20 columns (for eg)
df_sample = df.iloc[:, :20] 
df['res'] = (df_sample // 10 * df_sample % 10).sum(1)/(df_sample // 10).sum(1)
Sign up to request clarification or add additional context in comments.

Comments

0

You can omit apply and rather use Series (columns of Dataframes):

sample = {'col2': [50.6, 30.67, 40.5, 0, 0, 0],
          'col3': [40.74, 30.33, 41.00, 0, 0, 0],
          'col4': [70.6, 80.67, 70.5, 0, 0, 0],
          'col5': [10.74, 50.33, 51.00, 0, 0, 0]}
df_sample = pd.DataFrame(sample)
print (df_sample)
    col2   col3   col4   col5
0  50.60  40.74  70.60  10.74
1  30.67  30.33  80.67  50.33
2  40.50  41.00  70.50  51.00
3   0.00   0.00   0.00   0.00
4   0.00   0.00   0.00   0.00
5   0.00   0.00   0.00   0.00

I think you need:

print ((((df_sample['col2']//10 * df_sample['col2']%10) + 
        (df_sample['col3']//10 * df_sample['col3']%10) +
        (df_sample['col4']//10 * df_sample['col4']%10) +
        (df_sample['col5']//10 * df_sample['col5']%10)) 
         / (df_sample['col2']//10 + df_sample['col3']//10 + 
            df_sample['col4']//10 + df_sample['col5']//10)).fillna(0))

0    0.641176
1    0.526842
2    0.725000
3    0.000000
4    0.000000
5    0.000000
dtype: float64

print (((df_sample//10 * df_sample%10).sum(axis=1).div((df_sample//10).sum(axis=1)))
         .fillna(0))
0    0.641176
1    0.526842
2    0.725000
3    0.000000
4    0.000000
5    0.000000
dtype: float64

Timings:

In [114]: %timeit ((((df_sample['col2']//10 * df_sample['col2']%10) + (df_sample['col3']//10 * df_sample['col3']%10) + (df_sample['col4']//10 * df_sample['col4']%10) + (df_sample['col5']//10 * df_sample['col5']%10))  / (df_sample['col2']//10 + df_sample['col3']//10 + df_sample['col4']//10 + df_sample['col5']//10)).fillna(0))
100 loops, best of 3: 2.03 ms per loop

In [115]: %timeit (((df_sample//10 * df_sample%10).sum(axis=1).div((df_sample//10).sum(axis=1))).fillna(0))
1000 loops, best of 3: 897 µs per loop

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.