1

I have 1000s of columns in my Spark dataframe. I have a function below to convert the column type one by one. But I want to be able to convert all column types to type double at once. Below code is useful to do for one column at a time.

def convertcolumn(df, name, new_type):
    df_1 = df.withColumnRenamed(name, "swap")
    return df_1.withColumn(name, df_1["swap"].cast(new_type)).drop("swap")

1 Answer 1

2

You can for example fold over the columns:

from functools import reduce

mapping = [("x", "double"), ("y", "integer")]
df = sc.parallelize([("1.0", "1", "foo")]).toDF(["x", "y", "z"])
reduce(lambda df, kv: convertcolumn(*(df, ) + kv), mapping, df)

or simply build a list of expressions and select:

from pyspark.sql.functions import col

mapping_dict = dict(mapping)

exprs = [col(c).cast(mapping[c]) if c in mapping_dict else c for c in df.columns]
df.select(*exprs)
Sign up to request clarification or add additional context in comments.

2 Comments

@ zero323 Thanks for the reply. I am already using spark 1.6.3(pyspark) and still facing the problem. My code took more than 3 hours still it was not finished. Please give some suggestions. Thanks
Also in pyspark ml.pipeline in 1.6 does not have setCheckpointInterval parameter.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.