0

I have the following PySpark dataframe:

df = spark.createDataFrame(
    [
        ('31,2', 'foo'),
        ('33,1', 'bar'),
    ],
    ['cost', 'label']
)

I need to cast the ´cost´ column to float. I do it as follows:

df = df.withColumn('cost', df.cost.cast('float'))

However, as I result I get null values instead of numbers in the cost column.

How can I convert cost to float numbers?

3
  • 1
    "," is not a valid char for the float. you need to replace with "." Commented Nov 17, 2022 at 19:39
  • @Emma: Thanks, but how can I do it? Commented Nov 17, 2022 at 19:46
  • 1
    take a look at regex_replace function. spark.apache.org/docs/3.1.1/api/python/reference/api/… Commented Nov 17, 2022 at 19:47

2 Answers 2

2

This should work for you.

df = (df.withColumn('cost', F.regexp_replace(df.cost, ',', '.')
        .withColumn('cost', df.cost.cast('float')))

Sign up to request clarification or add additional context in comments.

1 Comment

I'm not sure 'float will work'. Best to try DecimalType() form spark.sql.types
1

I think a simple lambda expression should take care of most things.

    df.loc[:, 'cost'] = df.cost.apply(lambda x: float(x.replace(',', '.')))

3 Comments

Hmm, I use PySpark, not pandas. Is it also applicable to PySpark?
I haven't used PySpark in the past, but a quick glance at the description on Pypi.org shows that it makes use of numpy and Pandas, so I believe it's applicable. More importantly, have you tried the snippet I provided? Thanks!
Just for reference, I tend to avoid adding additional dependencies, so that's why my answer only uses what I know exists in the language itself.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.