Unpacking a list to select multiple columns from a spark data frame

Question

I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

scala> df.columns
res0: Array[String] = Array("a", "b", "c", "d")

I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

MaxU - stand with Ukraine · Accepted Answer · 2017-10-29 21:31:56Z

103

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

edited Oct 29, 2017 at 21:31

MaxU - stand with Ukraine

212k37 gold badges402 silver badges436 bronze badges

answered Jan 22, 2016 at 4:15

Shagun Sodhani

3,7674 gold badges33 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Ben Over a year ago

Thanks! Worked like a charm. Could explain a bit more about the syntax? Specifically what does col.tail: _ * do?

Ben Over a year ago

I think I understand now. The key is the method signature of select select(col: String, cols: String*). The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

Shagun Sodhani Over a year ago

Cool! You got it right :) Sorry I got both the notifications just now so couldn't reply earlier. :)

Ben Over a year ago

No problem. Thanks again!

Kshitij Kulshrestha · Accepted Answer · 2016-07-12 05:58:39Z

33

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)

edited Jul 12, 2016 at 5:58

answered Jun 9, 2016 at 11:45

Kshitij Kulshrestha

2,0721 gold badge21 silver badges27 bronze badges

1 Comment

Olfa2 Over a year ago

can you elaborate plz ?

vEdwardpc · Accepted Answer · 2016-10-01 20:33:35Z

25

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)

answered Oct 1, 2016 at 20:33

vEdwardpc

3513 silver badges3 bronze badges

Comments

Eranga Atugoda · Accepted Answer · 2019-03-27 06:51:47Z

3

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));

answered Mar 27, 2019 at 6:51

Eranga Atugoda

1411 gold badge1 silver badge6 bronze badges

Comments

raam86 · Accepted Answer · 2017-01-16 13:41:07Z

2

You can pass arguments of type Column* to select:

val df = spark.read.json("example.json")
val cols: List[String] = List("a", "b")
//convert string to Column
val col: List[Column] = cols.map(df(_))
df.select(col:_*)

edited Jan 16, 2017 at 13:41

answered Jan 16, 2017 at 13:15

raam86

6,8812 gold badges33 silver badges46 bronze badges

1 Comment

MaxU - stand with Ukraine Over a year ago

What about a bit shorter version: df.select(cols.map(df(_)): _*) ?

geosmart · Accepted Answer · 2018-05-10 06:21:27Z

2

you can do like this

String[] originCols = ds.columns();
ds.selectExpr(originCols)

spark selectExp source code

     /**
   * Selects a set of SQL expressions. This is a variant of `select` that accepts
   * SQL expressions.
   *
   * {{{
   *   // The following are equivalent:
   *   ds.selectExpr("colA", "colB as newName", "abs(colC)")
   *   ds.select(expr("colA"), expr("colB as newName"), expr("abs(colC)"))
   * }}}
   *
   * @group untypedrel
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def selectExpr(exprs: String*): DataFrame = {
    select(exprs.map { expr =>
      Column(sparkSession.sessionState.sqlParser.parseExpression(expr))
    }: _*)
  }

answered May 10, 2018 at 6:21

geosmart

6668 silver badges16 bronze badges

Comments

USB · Accepted Answer · 2019-02-13 04:30:03Z

2

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

answered Feb 13, 2019 at 4:30

USB

6,15915 gold badges69 silver badges96 bronze badges

1 Comment

user1326784 Over a year ago

Can you please share how to do the same(pass the column names) in java while doing dataframeResult = inpDataframe.select("col1","col2",....)

SARVESH DESHPANDE · Accepted Answer · 2022-12-07 06:56:55Z

0

Prepare a list where all the requirement features are listed then use spark inbuilt function using *, reference given below.

lst = ["col1", "col2", "col3"]
result = df.select(*lst)

Some time we get an error of:" Analysis Exception: cannot resolve 'col1' given input columns" try to convert features to string type as mentioned below:

from pyspark.sql.functions import lit
from pyspark.sql.types import StringType
for i in lst:
   if i not in df.columns:
      df = df.withColumn(i, lit(None).cast(StringType()))

And finally you will get the dataset with required features.

edited Dec 7, 2022 at 6:56

SARVESH DESHPANDE

35 bronze badges

answered Jun 2, 2022 at 9:25

Hrushi

3351 gold badge5 silver badges16 bronze badges

1 Comment

lemon Over a year ago

As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.

Collectives™ on Stack Overflow

Unpacking a list to select multiple columns from a spark data frame

8 Answers 8

4 Comments

1 Comment

Comments

Comments

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

8 Answers 8

4 Comments

1 Comment

Comments

Comments

1 Comment

Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related