How to set display precision in PySpark Dataframe show

Question

How do you set the display precision in PySpark when calling .show()?

Consider the following example:

from math import sqrt
import pyspark.sql.functions as f

data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()

Which outputs:

#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+

How can I change it so that it only displays 3 digits after the decimal point?

Desired output:

#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.

6 revs, 2 users 72% · Accepted Answer · 2019-04-15 14:46:03Z

Round

The easiest option is to use pyspark.sql.functions.round():

from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

This will maintain the values as numeric types.

Format Number

The functions are the same for scala and python. The only difference is the import.

You can use format_number to format a number to desired decimal places as stated in the official api document:

Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.

from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+

The transformed columns would of StringType and a comma is used as a thousands separator:

#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+

As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want

Replace all substrings of the specified string value that match regexp with rep.

from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+

Chao Liang · Accepted Answer · 2023-02-01 05:05:12Z

0

just envelop the answer to a function witch only deal with float and double columns.

import pyspark.sql.functions as F
from pyspark.sql import DataFrame

def dataframe_format_float(df: DataFrame, num_decimals=4) -> DataFrame:
    r = []
    for c in df.dtypes:
        name, dtype = c[0], c[1]
        if dtype in ['float', 'double']:
            r.append(F.round(name, num_decimals).alias(name))
        else:
            r.append(name)
    df = df.select(r)
    return df

answered Feb 1, 2023 at 5:05

Chao Liang

414 bronze badges

Collectives™ on Stack Overflow

How to set display precision in PySpark Dataframe show

2 Answers 2

Round

Format Number

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Round

Format Number

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related