How to convert many Spark dataframe column types at once?

Question

I have 1000s of columns in my Spark dataframe. I have a function below to convert the column type one by one. But I want to be able to convert all column types to type double at once. Below code is useful to do for one column at a time.

def convertcolumn(df, name, new_type):
    df_1 = df.withColumnRenamed(name, "swap")
    return df_1.withColumn(name, df_1["swap"].cast(new_type)).drop("swap")

zero323 · Accepted Answer · 2017-05-11 21:30:26Z

2

You can for example fold over the columns:

from functools import reduce

mapping = [("x", "double"), ("y", "integer")]
df = sc.parallelize([("1.0", "1", "foo")]).toDF(["x", "y", "z"])
reduce(lambda df, kv: convertcolumn(*(df, ) + kv), mapping, df)

or simply build a list of expressions and select:

from pyspark.sql.functions import col

mapping_dict = dict(mapping)

exprs = [col(c).cast(mapping[c]) if c in mapping_dict else c for c in df.columns]
df.select(*exprs)

edited May 11, 2017 at 21:30

answered Feb 12, 2016 at 12:59

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Neo Over a year ago

@ zero323 Thanks for the reply. I am already using spark 1.6.3(pyspark) and still facing the problem. My code took more than 3 hours still it was not finished. Please give some suggestions. Thanks

Neo Over a year ago

Also in pyspark ml.pipeline in 1.6 does not have setCheckpointInterval parameter.

Collectives™ on Stack Overflow

How to convert many Spark dataframe column types at once?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related