7

How can I convert a single column in spark 2.0.1 into an array?

+---+-----+
| id| dist| 
+---+-----+
|1.0|2.0|
|2.0|4.0|
|3.0|6.0|
|4.0|8.0|
+---+-----+

should return Array(1.0, 2.0, 3.0, 4.0)

A

import scala.collection.JavaConverters._ 
df.select("id").collectAsList.asScala.toArray

fails with

java.lang.RuntimeException: Unsupported array type: [Lorg.apache.spark.sql.Row;
java.lang.RuntimeException: Unsupported array type: [Lorg.apache.spark.sql.Row;
1

2 Answers 2

9

Why do you use JavaConverters if you then re-transform the Java List to a Scala List ? You just need to collect the dataset and then map this array of Rows to an array of doubles, like this :

df.select("id").collect.map(_.getDouble(0))
Sign up to request clarification or add additional context in comments.

2 Comments

collect() on dataframe is not a scalable way.
Who talked about scalability here ?
6

I'd try something like this with dataframe aggregate function - collect_list() to avoid memory overhead on the driver JVM. With this approach only selected column values will be copied to driver JVM.

df.select(collect_list("id")).first().getList[Double](0)

This returns java.util.List[Double].

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.