5

I can collect a column like this using the RDD API.

df.map(r => r.getAs[String]("column")).collect

However, as I am initially using a Dataset I rather would like to not switch the API level. A simple df.select("column).collect returns an Array[Row] where the .flatten operator no longer works. How can I collect to Array[T e.g. String] directly?

2
  • Have you tried df.select("column).as[String].collect? Writing from memory, if works I'll post normal answer :) Commented Nov 22, 2016 at 21:07
  • Array of the type of the selected column e.g. string. Commented Nov 22, 2016 at 21:07

1 Answer 1

16

With Datasets ( Spark version >= 2.0.0 ), you just need to convert the dataframe to dataset and then collect it.

df.select("column").as[String].collect()

would return you an Array[String]

Sign up to request clarification or add additional context in comments.

2 Comments

nice. That works fine. But why isn't the type inferred from the schema of the dataset automatically?
@GeorgHeiler yes and not :) It is inferred, but it will be Row[String], because in select you can put many column names. Converter will automatically convert to String

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.