1

I have a Spark dataframe with around 1 million rows. I am using pyspark and have to apply box-cox transformation from scipy library on each column of the dataframe. But the box-cox function allows only 1-d numpy array as input. How can I do this efficiently?

Is numpy array distributed on spark or it collects all the elements to single node on which driver program is running?

suppose df is my dataframe with column as C1 then, I want to perform the operation similar to this

stats.boxcox(df.select("C1"))
1
  • There is pretty much no case when you can benefit from having Spark DataFrame and be able process individual columns using Numpy. Basically either your data is small enough (cleaned, aggregated) that you can process it locally by converting to Pandas for example or you need a method that can work on distributed data which is not something that can be typically done with Numpy alone. Commented Jul 11, 2016 at 23:04

2 Answers 2

0

The dataframes/RDD in Spark allow abstracting from how the processing is distributed.

To do what you require, I think a UDF can be very useful. Here you can see an example of its use:

Functions from Python packages for udf() of Spark dataframe

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for the reply. I have to apply the following function from scipy library, which accepts only ndarray as input not the single element. stats.boxcox(x) where x is 1-d numpy array
0

I have a workaround that solve the issue but not sure is the optimal solution in term of performance as you are switching between pyspark and pandas dataframes:

dfpd = df.toPandas()
colName = 'YOUR_COLUMN_NAME'
colBCT_Name = colName + '_BCT'
print colBCT_Name
maxVal = dfpd[colName][dfpd[colName].idxmax()]
minVal = dfpd[colName][dfpd[colName].idxmin()]
print maxVal
print minVal

col_bct, l = stats.boxcox(dfpd[colName]- minVal +1)
col_bct = col_bct*l/((maxVal +1)**l-1)
col_bct =pd.Series(col_bct)
dfpd[colBCT_Name] = col_bct
df = sqlContext.createDataFrame(dfpd)
df.show(2)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.