1

I have a spark job written in Python which is reading data from the CSV files using DataBricks CSV reader.

I want to convert some columns from string to double by applying an udf function which actually is also changing the floating point separator.

convert_udf = F.udf(
    lambda decimal_str: _to_float(decimal_separator, decimal_str), 
    returnType=FloatType())

for name in columns:
     df = df.withColumn(name, convert_udf(df[name]))

def _to_float(decimal_separator, decimal_str):
    if isinstance(decimal_str, str) or isinstance(decimal_str, unicode):
        return (None if len(decimal_str.strip()) == 0 
               else float(decimal_str.replace(decimal_separator, '.')))
    else:
        return decimal_str

The Spark job is getting stuck when the udf function is called. I tried to return a fixed double value from the _to_float function without success. It looks like there is something wrong between the udf and data frame using SQL context.

1 Answer 1

3

Long story short don't use Python UDFs (and UDFs in general) unless it is necessary:

  • it is inefficient due to full round-trip through Python interpreter
  • cannot be optimized by Catalyst
  • creates long lineages if used iteratively

For simple operations like this one just use built-in functions:

from pyspark.sql.functions import regexp_replace

decimal_separator = ","
exprs = [
    regexp_replace(c, decimal_separator, ".").cast("float").alias(c) 
    if c in columns else c 
    for c in df.columns
]

df.select(*exprs)
Sign up to request clarification or add additional context in comments.

2 Comments

thanks. after replacing the udf function everything works as expected.
I am glad to hear this. I suspect that your UDF would work as well sooner or later but calling UDFs is just an expensive operation.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.