Select columns whose name contains a specific string from spark scala DataFrame

Question

I have a DataFrame like this.

Name   City  Name_index   City_index
Ali    lhr     2.0          0.0
abc    swl     0.0          2.0
xyz    khi     1.0          1.0

I want to drop columns that don't contain string like "index".

Expected Output should be like:

Name_index   City_index
 2.0           0.0
 0.0           2.0
 1.0           1.0

I have tried this.

val cols = newDF.columns
    val regex = """^((?!_indexed).)*$""".r
    val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
    cols.diff(selection)
    val res =newDF.select(selection.head, selection.tail : _*)
    res.show()

But I am getting this:

Name   City
Ali    lhr
abc    swl
xyz    khi

you can use "cols.filterNot" instead of "cols.filter".

pasha701
– pasha701

2020-04-23 08:23:41 +00:00
Commented Apr 23, 2020 at 8:23 — pasha701
– pasha701, Commented Apr 23, 2020 at 8:23

QuickSilver · Accepted Answer · 2020-04-23 08:46:35Z

1

There is a typo in your regex , fixed it in below code

import org.apache.spark.sql.SparkSession

object FilterColumn {

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").getOrCreate()
    import spark.implicits._
    val newDF = List(PersonCity("Ali","lhr",2.0,0.0)).toDF()
    newDF.show()
    val cols = newDF.columns
    val regex = """^((?!_index).)*$""".r
    val selection = cols.filter(s => regex.findFirstIn(s).isDefined)
    val finalCols = cols.diff(selection)
    val res =newDF.select(finalCols.head,finalCols.tail: _*)
    res.show()
  }

}

case class PersonCity(Name : String,   City :String, Name_index : Double,   City_index: Double)

answered Apr 23, 2020 at 8:46

QuickSilver

4,0452 gold badges15 silver badges31 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Emiliano Martinez · Accepted Answer · 2020-04-23 11:11:01Z

0

import org.apache.spark.sql.functions.col

val regex = """^((?!_indexed).)*$""".r
val schema = StructType(
      Seq(StructField("Name", StringType, false),
          StructField("City", StringType, false),
          StructField("Name_indexed", IntegerType, false),
          StructField("City_indexed", LongType, false)))

val empty: DataFrame = spark.createDataFrame(spark.sparkContext.emptyRDD[Row], schema = schema)
val columns = schema.map(_.name).filter(el => regex.pattern.matcher(el).matches())
empty.select(columns.map(col):_*).show()

It gives

+----+----+
|Name|City|
+----+----+
+----+----+

edited Apr 23, 2020 at 11:11

answered Apr 23, 2020 at 8:54

Emiliano Martinez

4,1432 gold badges13 silver badges21 bronze badges

3 Comments

Ayeza Malik Over a year ago

But I want the columns which contain "index" string only.

Emiliano Martinez Over a year ago

You can change the regex in that case

Ayeza Malik Over a year ago

I have some slight changes in regex and problem solved. Thanks

Collectives™ on Stack Overflow

Select columns whose name contains a specific string from spark scala DataFrame

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related