1

I have a dataframe like this:

rdd1 = sc.parallelize([(100,2,1234.5678),(101,3,1234.5678)])
df = spark.createDataFrame(rdd1,(['id','dec','val']))

+---+---+---------+
| id|dec|      val|
+---+---+---------+
|100|  2|1234.5678|
|101|  3|1234.5678|
+---+---+---------+

Based on the value available in dec column, I want the casting to be done on the val column. Like if dec = 2, then I want the val to be cast to DecimalType(7,2).

I tried to do the below, but it is not working:

 df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,col('dec'))).cast(StringType()).alias('modVal')).show()

Error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/column.py", line 419, in cast
    jdt = spark._jsparkSession.parseDataType(dataType.json())
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 69, in json
    return json.dumps(self.jsonValue(),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 225, in jsonValue
    return "decimal(%d,%d)" % (self.precision, self.scale)
TypeError: %d format: a number is required, not Column

The same works if I hard code the value to a specific number, which is straight forward.

df.select(col('id'),col('dec'),col('val'),col('val').cast(DecimalType(7,3)).cast(StringType()).alias('modVal')).show()

+---+---+---------+--------+
| id|dec|      val|  modVal|
+---+---+---------+--------+
|100|  2|1234.5678|1234.568|
|101|  3|1234.5678|1234.568|
+---+---+---------+--------+

Please help me with this.

2 Answers 2

2

Columns in Spark (or any relational system for that matter) have to be homogeneous - operation like this, where you cast each row to different type, is not only not supported, but also doesn't make much sense.

Sign up to request clarification or add additional context in comments.

3 Comments

May be I am missing something here.But, please help me understand why casting of every row of a column in a spark data frame is not supported/invalid. I get that, casting a column based on other columns values may be an unpopular use case, but not sure why you would say the entire idea is not making any sense.
@vishnuram: The data type of all rows in the same column must be the same. However, if you are just after the formatting you could use a string in this case which would make the data type the same while allowing different number of decimals.
@shaido & user10281832 Thanks. Now I understand why there is a concern over datatype being heterogenous and have updated my commands in such a way to cast it to StringType() to add more clarity to my request
1

As mentioned by user10281832, you can't have different data types in the same column.

Since the formatting is in focus you can convert the column to string type and then do the formatting. Since the number of decimals for each row is different, you can't use any inbuilt Spark functions but need to define a custom UDF:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def format_val(num, prec):
    return "%0.*f" % (prec, num)

format_val_udf = udf(format_val, StringType())

df.withColumn('modVal', format_val_udf('val', 'dec'))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.