Spark Dataframe select based on column index

Question

How do I select all the columns of a dataframe that has certain indexes in Scala?

For example if a dataframe has 100 columns and i want to extract only columns (10,12,13,14,15), how to do the same?

Below selects all columns from dataframe df which has the column name mentioned in the Array colNames:

df = df.select(colNames.head,colNames.tail: _*)

If there is similar, colNos array which has

colNos = Array(10,20,25,45)

How do I transform the above df.select to fetch only those columns at the specific indexes.

zero323 · Accepted Answer · 2017-04-23 19:58:13Z

10

You can map over columns:

import org.apache.spark.sql.functions.col

df.select(colNos map df.columns map col: _*)

or:

df.select(colNos map (df.columns andThen col): _*)

or:

df.select(colNos map (col _ compose df.columns): _*)

All the methods shown above are equivalent and don't impose performance penalty. Following mapping:

colNos map df.columns

is just a local Array access (constant time access for each index) and choosing between String or Column based variant of select doesn't affect the execution plan:

val df = Seq((1, 2, 3 ,4, 5, 6)).toDF

val colNos = Seq(0, 3, 5)

df.select(colNos map df.columns map col: _*).explain

== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]

df.select("_1", "_4", "_6").explain

== Physical Plan ==
LocalTableScan [_1#46, _4#49, _6#51]

edited Apr 23, 2017 at 19:58

answered Apr 22, 2017 at 0:38

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Anahcolus · Accepted Answer · 2017-04-24 12:39:43Z

4

@user6910411's answer above works like a charm and the number of tasks/logical plan is similar to my approach below. BUT my approach is a bit faster.
So,
I would suggest you to go with the column names rather than column numbers. Column names are much safer and much ligher than using numbers. You can use the following solution :

val colNames = Seq("col1", "col2" ...... "col99", "col100")

val selectColNames = Seq("col1", "col3", .... selected column names ... )

val selectCols = selectColNames.map(name => df.col(name))

df = df.select(selectCols:_*)

If you are hesitant to write all the 100 column names then there is a shortcut method too

val colNames = df.schema.fieldNames

edited Apr 24, 2017 at 12:39

answered Apr 22, 2017 at 9:04

Anahcolus

42.1k6 gold badges75 silver badges101 bronze badges

Comments

kevin_theinfinityfund · Accepted Answer · 2020-08-16 15:46:14Z

Example: Grab first 14 columns of Spark Dataframe by Index using Scala.

import org.apache.spark.sql.functions.col

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols.map(name=>col(name)):_*)

You cannot simply do this (as I tried and failed):

// Gives array of names by index (first 14 cols for example)
val sliceCols = df.columns.slice(0, 14)
// Maps names & selects columns in dataframe
val subset_df = df.select(sliceCols)

The reason is that you have to convert your datatype of Array[String] to Array[org.apache.spark.sql.Column] in order for the slicing to work.

OR Wrap it in a function using Currying (high five to my colleague for this):

// Subsets Dataframe to using beg_val & end_val index.
def subset_frame(beg_val:Int=0, end_val:Int)(df: DataFrame): DataFrame = {
  val sliceCols = df.columns.slice(beg_val, end_val)
  return df.select(sliceCols.map(name => col(name)):_*)
}

// Get first 25 columns as subsetted dataframe
val subset_df:DataFrame = df_.transform(subset_frame(0, 25))

Collectives™ on Stack Overflow

Spark Dataframe select based on column index

3 Answers 3

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related