1

I'm using Scala and Apache Spark 2.3.0 with a CSV file. I'm doing this because when I try to use the csv for k means it tells me that I have null values but it keeps appearing the same issue even if I try to fill those nulls

scala>val df = sqlContext.read.format("com.databricks.spark.csv")
    .option("header", "true")
    .option("delimiter",";")
    .schema(schema).load("33.csv")

scala> df.na.fill(df.columns.zip(
  df.select(df.columns.map(mean(_)): _*).first.toSeq
).toMap)

scala> val featuresCols = Array("LONGITUD","LATITUD")
featuresCols: Array[String] = Array(LONGITUD, LATITUD)

scala> val featureCols = Array("LONGITUD","LATITUD")
featureCols: Array[String] = Array(LONGITUD, LATITUD)

scala> val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
assembler: org.apache.spark.ml.feature.VectorAssembler = vecAssembler_440117601217

scala> val df2 = assembler.transform(df)
df2: org.apache.spark.sql.DataFrame = [ID_CALLE: int, TIPO: int ... 6 more fields]

scala> df2.show

Caused by: org.apache.spark.SparkException: Values to assemble cannot be null

1 Answer 1

1

Looks like you did na.fill() but didn't assign it to a DataFrame.

Try val nonullDF = df.na.fill(...)

Sign up to request clarification or add additional context in comments.

2 Comments

I alredy try it but, when I try to do the VectorAssembler to transform and return a new dataframe I still have the same issue val nonullDF = df.na.fill(df.columns.zip(df.select(df.columns.map(mean()):*).first.toSeq).toMap)
I am unable to replicate your issue. Can you provide runnable code and data that creates the issue so that I can investigate it?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.