18

How do you set the display precision in PySpark when calling .show()?

Consider the following example:

from math import sqrt
import pyspark.sql.functions as f

data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()

Which outputs:

#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+

How can I change it so that it only displays 3 digits after the decimal point?

Desired output:

#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.

2 Answers 2

21

Round

The easiest option is to use pyspark.sql.functions.round():

from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This will maintain the values as numeric types.

Format Number

The functions are the same for scala and python. The only difference is the import.

You can use format_number to format a number to desired decimal places as stated in the official api document:

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.

from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

The transformed columns would of StringType and a comma is used as a thousands separator:

#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+

As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want

Replace all substrings of the specified string value that match regexp with rep.

from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+
Sign up to request clarification or add additional context in comments.

Comments

0

just envelop the answer to a function witch only deal with float and double columns.

import pyspark.sql.functions as F
from pyspark.sql import DataFrame

def dataframe_format_float(df: DataFrame, num_decimals=4) -> DataFrame:
    r = []
    for c in df.dtypes:
        name, dtype = c[0], c[1]
        if dtype in ['float', 'double']:
            r.append(F.round(name, num_decimals).alias(name))
        else:
            r.append(name)
    df = df.select(r)
    return df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.