Efficient way to apply function with multiple operations on dataframe row

Question

I have a pandas dataframe that looks like this:

            X[m]      Y[m]      Z[m]  ...      beta  newx  newy
0       1.439485  0.087100  0.029771  ...  0.063807  1439    87
1       1.439485  0.089729  0.029121  ...  0.065871  1439    89
2       1.439485  0.091992  0.030059  ...  0.067653  1439    91
3       1.439485  0.082073  0.030721  ...  0.059883  1439    82
4       1.439485  0.084095  0.028952  ...  0.061458  1439    84
5       1.439485  0.085937  0.028019  ...  0.062897  1439    85

There are hundreds of thousands of such lines, while I have multiple dataframes like this. X and Y are coordinates on plane (Z is not important) that is moved 45 degrees by the middle to the right. I need to put all points back to the original place, -45 degrees from its location. I have variables newx and newy that represent coordinates before changing, I want to edit these two columns to have values of new coordinates. As I know coordinates of middle point, the point itself, the angle of middle-to-point (alpha) and angle middle-to-fixedpoint (beta), I can use approach presented in mathematics SO. I have transformed the code to python like this:

for i in range(len(df)):
    if df.iloc[i].alpha == math.pi/2 or df.iloc[i].alpha == 3*math.pi/2:
        df.newx[i] = mid
        df.newy[i] = int(math.tan(df.iloc[i].beta*(df.iloc[i].x-mid)+mid))
    elif df.iloc[i].beta == math.pi/2 or df.iloc[i].beta == 3*math.pi/2:
        #df.newx[i] = df.iloc[i].x -- this is already set
        df.newy[i] = int(math.tan(df.iloc[i].alpha*(mid-df.iloc[i].x)+mid))
    else:
        m0 = math.tan(df.iloc[i].alpha)
        m1 = math.tan(df.iloc[i].beta)
        x = ((m0 * df.iloc[i].x - m1 * mid) - (df.iloc[i].y - mid)) / (m0 - m1)
        df.newx[i] = int(x)
        df.newy[i] = int(m0 * (x - df.iloc[i].x) + df.iloc[i].y)

Although this does what I need and moves the point to the correct position, the time complexity is enormous and I have too much files to proceed it like this. I know that there are way faster methods, such as serialization, apply and list comprehension. I however can't figure out how to use it with this function.

Here are first 10 lines as dictionary:

{'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}}

@JoshuaVoskamp I will deal with that problem later, I am aware that it may not end up as I wish but now I have to make it run in reasonable time — Ruli
– Ruli, Commented Nov 12, 2021 at 21:25
can you provide a small proof-of-concept input df and expected output to test against, perhaps df.head(10).to_dict()? — Joshua Voskamp
– Joshua Voskamp, Commented Nov 12, 2021 at 21:26
Can you explain more thoroughly the problem? I do not understand 'X and Y are coordinates on plane that is moved 45 degrees by the middle to the right.". What is "middle"? What does mean moving a plane? (I would understand "rotate", "translate" or "scale"). Can you state the question using a transformation matrix? — hpchavaz
– hpchavaz, Commented Nov 12, 2021 at 21:38

Joshua Voskamp · Accepted Answer · 2021-11-13 08:27:58Z

I suspect how we're using mid as from your code may be causing you problems. Is mid a numeric? Are the x- and y-coordinates of your middle point the same value?

@OP, please confirm your variable names as compared to your linked source are as I have translated them:

linked name	your name
`a0`	`beta`
`a1`	`alpha`
`(x0, y0)`	`(df.x, df.y)`
`(x1, y1)`	`(mid, mid)`

Update this answer shares some ideas with @mitoRibo's answer, but I re-translated from OP's linked source and suspect OP made some transcription error. Noted in comments. Both of us used a strategy of "selectively calculate newx/newy using masking, where the masks are equivalent to the if/elif/else conditions provided".

#setup
import pandas as pd
import numpy as np
import math

df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}})

# make the new columns
df['newx'] = np.nan
df['newy'] = np.nan
# if any of the values are np.nan when we're done, something went wrong

# Do the float `between` comparison but cleverly
EPSILON = 1e-6
# windows = ((pi/2 ± EPSILON), (3pi/2 ± EPSILON))
windows = tuple(tuple(d*math.pi/2 + s*EPSILON for s in (1, -1)) for d in (1, 3))
# challenge: make this more DRY (don't repeat yourself)
alpha_cond = sum([df.alpha.between(*w) for w in windows]).astype(bool)
beta_cond  = sum([ df.beta.between(*w) for w in windows]).astype(bool)\
                 & ~alpha_cond
neither = (~alpha_cond & ~beta_cond)

# Handle `alpha near pi/2 or 3pi/2`:
c1 = df.loc[alpha_cond]
df.loc[alpha_cond,'newx'] = mid
                                         # changed `tan` parenthesis
                                         # |    changed `df.x - mid` to `mid - df.x`
                                         # |    |             changed to `df.y` from `mid`
df.loc[alpha_cond,'newy'] = (np.tan(c1.beta) * (mid - c1.x) + c1.y).astype(int)

# Handle `beta near pi/2 or 3pi/2`:
c2 = df.loc[beta_cond]
df loc[beta_cond,'newx'] = c2.x
                                         # changed `tan` parenthesis
                                         # |    changed `mid - df.x` to `df.x - mid`
df.loc[beta_cond,'newy'] = (np.tan(c2.alpha) * (c2.x - mid) + mid).astype(int)

# Handle the remainder:
c3 = df.loc[neither]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)
t = ((m0 * c3.x - m1 * mid) - (c3.y - mid)) / (m0 - m1)

df.loc[neither,'newx'] = t.astype(int)
df.loc[neither,'newy'] = (m0 * (t - c3.x) + c3.y).astype(int)

looks great, and I bet its really fast. one suggestion is that df.beta.between(dn_min, dn_max) saves some typing
@Ruli I checked source against your linked SO answer and made some changes; noted in comments. Short version: I think you may have made some transcription errors. Would you confirm my translation table?
alpha and beta are reverse, a0 is alpha, I am going to work on it shortly and will see where I have made a mistake :)
the code works now, I had some logical errors as well in code irrelevant to this, which caused incorrect angles to be count. As I fixed those your code works (keeping in mind you have swapped alpha and beta) and is significantly faster than mine which was aim of question.

mitoRibo · Accepted Answer · 2021-11-12 21:54:20Z

Same approach as @Joshua Voskamp, but I still wanted to share

import pandas as pd
import numpy as np
import math

df = pd.DataFrame({'X[m]': {0: 1.439484727008419, 1: 1.439484727008419, 2: 1.439484727008419, 3: 1.439484727008419, 4: 1.439484727008419, 5: 1.439484727008419, 6: 1.439484727008419, 7: 1.439484727008419, 8: 1.439484727008419, 9: 1.439484727008419}, 'Y[m]': {0: 0.08709958190841899, 1: 0.08972904270841897, 2: 0.091991981408419, 3: 0.08207325440841898, 4: 0.08409548540841899, 5: 0.08593746080841899, 6: 0.09416210370841899, 7: 0.08874029660841898, 8: 0.09168940400841899, 9: 0.09434491760841898}, 'Z[m]': {0: 0.029770726299999998, 1: 0.0291213803, 2: 0.030058834700000002, 3: 0.0307212565, 4: 0.028951926200000002, 5: 0.0280194897, 6: 0.030717188500000003, 7: 0.026446931099999998, 8: 0.0269318204, 9: 0.0273838975}, 'Velocity[ms^-1]': {0: ['-1.67570162e+00', '-2.59946979e-15', '-2.54510192e-15'], 1: ['-1.63915336e+00', '-2.54277343e-15', '-2.48959140e-15'], 2: ['-1.69191790e+00', '-2.62462561e-15', '-2.56973173e-15'], 3: ['-1.72920227e+00', '-2.68246377e-15', '-2.62636012e-15'], 4: ['-1.62961555e+00', '-2.52797767e-15', '-2.47510523e-15'], 5: ['-1.57713342e+00', '-2.44656340e-15', '-2.39539372e-15'], 6: ['-1.72897375e+00', '-2.68210929e-15', '-2.62601305e-15'], 7: ['-1.48862195e+00', '-2.30925809e-15', '-2.26096006e-15'], 8: ['-1.51591396e+00', '-2.35159534e-15', '-2.30241195e-15'], 9: ['-1.54135919e+00', '-2.39106792e-15', '-2.34105888e-15']}, 'L': {0: 0.9582306809661671, 1: 0.9564957485824027, 2: 0.9550059224371557, 3: 0.9615583774318917, 4: 0.9602177760259737, 5: 0.9589987519260235, 6: 0.9535800607266656, 7: 0.9571476500665267, 8: 0.9552049510914844, 9: 0.953460072490227}, 'x': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'y': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}, 'alpha': {0: -0.7215912027987663, 1: -0.719527331916007, 2: -0.7177451479100487, 3: -0.7255156166536015, 4: -0.7239399868865558, 5: -0.7225009735356016, 6: -0.7160308360594005, 7: -0.7203042790640757, 8: -0.7179837655204843, 9: -0.7158861861473951}, 'beta': {0: 0.06380696059868196, 1: 0.06587083148144124, 2: 0.06765301548739955, 3: 0.05988254674384674, 4: 0.06145817651089247, 5: 0.06289718986184667, 6: 0.06936732733804774, 7: 0.0650938843333726, 8: 0.06741439787696402, 9: 0.0695119772500532}, 'newx': {0: 1439, 1: 1439, 2: 1439, 3: 1439, 4: 1439, 5: 1439, 6: 1439, 7: 1439, 8: 1439, 9: 1439}, 'newy': {0: 87, 1: 89, 2: 91, 3: 82, 4: 84, 5: 85, 6: 94, 7: 88, 8: 91, 9: 94}})

mid = 0 #not sure what mid value should be

near_threshold = 0.001

alpha_near_half_pi = df.alpha.sub(math.pi/2).abs().le(near_threshold)
alpha_near_three_half_pi = df.alpha.sub(3*math.pi/2).abs().le(near_threshold)
beta_near_half_pi = df.beta.sub(math.pi/2).abs().le(near_threshold)
beta_near_three_half_pi = df.beta.sub(3*math.pi/2).abs().le(near_threshold)

cond1 = alpha_near_half_pi | alpha_near_three_half_pi
cond2 = beta_near_half_pi | beta_near_three_half_pi
cond2 = cond2 & (~cond1) #if cond1 is true, we don't want to do cond2
cond3 = ~(cond1 | cond2) #if neither cond1 nor cond2, then we are in cond3

#Process cond1 rows
c1 = df.loc[cond1]
df.loc[cond1,'newx'] = mid
df.loc[cond1,'newy'] = np.tan(c1.beta*(c1.x-mid)+mid)

#Process cond2 rows
c2 = df.loc[cond2]
df.loc[cond2,'newy'] = np.tan(c2.alpha*(mid-c2.x)+mid)

#Process cond3 rows
c3 = df.loc[cond3]
m0 = np.tan(c3.alpha)
m1 = np.tan(c3.beta)

#                       Is this a mistake? always 0?
#                                   |
#                             --------------
x = ((m0 * c3.x - m1 * mid) - (c3.y - c3.y)) / (m0 - m1)
df.loc[cond3,'newx'] = x.astype(int)
df.loc[cond3,'newy'] = (m0 * (x - c3.x) + c3.y).astype(int)

df

yes it should be -mid, checking out both solutions, thanks both, I would like to accept both answers but I can only one :)

Collectives™ on Stack Overflow

Efficient way to apply function with multiple operations on dataframe row

2 Answers 2

4 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related