As long as you use CPython (different implementations can, but realistically shouldn't, exhibit significantly different behavior in this specific case). If you take a look at reduce implementation you'll see it is just a for-loop with minimal exception handling.
The core is exactly equivalent to the loop you use
for element in it:
value = function(value, element)
and there is no evidence supporting claims of any special behavior.
Additionally simple tests with number of frames practical limitations of Spark joins (joins are among the most expensive operations in Spark)
dfs = [
spark.range(10000).selectExpr(
"rand({}) AS id".format(i), "id AS value", "{} AS loop ".format(i)
)
for i in range(200)
]
Show no significant difference in timing between direct for-loop
def f(dfs):
df1 = dfs[0]
for df2 in dfs[1:]:
df1 = df1.join(df2, ["id"])
return df1
%timeit -n3 f(dfs)
## 6.25 s ± 257 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
and reduce invocation
from functools import reduce
def g(dfs):
return reduce(lambda x, y: x.join(y, ["id"]), dfs)
%timeit -n3 g(dfs)
### 6.47 s ± 455 ms per loop (mean ± std. dev. of 7 runs, 3 loops each)
Similarly overall JVM behavior patterns are comparable between for-loop
For loop CPU and Memory Usage - VisualVM
and reduce
reduce CPU and Memory Usage - VisualVM
Finally both generate identical execution plans
g(dfs)._jdf.queryExecution().optimizedPlan().equals(
f(dfs)._jdf.queryExecution().optimizedPlan()
)
## True
which indicates no difference when plans is evaluated and OOMs are likely to occur.
In other words you correlation doesn't imply causation, and observed performance problems are unlikely to be related to the method you use to combine DataFrames.