Using a columns value in casting another column in a spark dataframe

Question

I have a dataframe like this:

rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))

+---+---+---------+
| id|dec|      val|
+---+---+---------+
|100|  2|1234.5678|
|101|  3|1234.5678|
+---+---+---------+

Based on the value available in dec column, I want the casting to be done on the val column. Like if dec = 2, then I want the val to be cast to DecimalType(7,2).

I tried to do the below, but it is not working:

 df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
    jdt = spark._jsparkSession.parseDataType(dataType.json())
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
    return json.dumps(self.jsonValue(),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
    return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column

The same works if I hard code the value to a specific number, which is straight forward.

df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()

+---+---+---------+--------+
| id|dec|      val|  modVal|
+---+---+---------+--------+
|100|  2|1234.5678|1234.568|
|101|  3|1234.5678|1234.568|
+---+---+---------+--------+

Please help me with this.

user10281832 · Accepted Answer · 2018-08-27 21:15:39Z

2

Columns in Spark (or any relational system for that matter) have to be homogeneous - operation like this, where you cast each row to different type, is not only not supported, but also doesn't make much sense.

answered Aug 27, 2018 at 21:15

user10281832

234 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

vishnu ram Over a year ago

May be I am missing something here.But, please help me understand why casting of every row of a column in a spark data frame is not supported/invalid. I get that, casting a column based on other columns values may be an unpopular use case, but not sure why you would say the entire idea is not making any sense.

Shaido Over a year ago

@vishnuram: The data type of all rows in the same column must be the same. However, if you are just after the formatting you could use a string in this case which would make the data type the same while allowing different number of decimals.

vishnu ram Over a year ago

@shaido & user10281832 Thanks. Now I understand why there is a concern over datatype being heterogenous and have updated my commands in such a way to cast it to StringType() to add more clarity to my request

Shaido · Accepted Answer · 2018-08-28 02:37:21Z

1

As mentioned by user10281832, you can't have different data types in the same column.

Since the formatting is in focus you can convert the column to string type and then do the formatting. Since the number of decimals for each row is different, you can't use any inbuilt Spark functions but need to define a custom UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def format_val(num, prec):
    return "%0.*f" % (prec, num)

format_val_udf = udf(format_val, StringType())

df.withColumn('modVal', format_val_udf('val', 'dec'))

answered Aug 28, 2018 at 2:37

Shaido

28.6k26 gold badges76 silver badges82 bronze badges

Collectives™ on Stack Overflow

Using a columns value in casting another column in a spark dataframe

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related