Apply same function to all fields of spark dataframe row

Question

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

Take row
Find schema and store in array and find how many fields are there.
map through each row in data frame and upto limit of number of elements in array
apply function to upper case each fields and return row

Community · Accepted Answer · 2017-05-23 12:00:24Z

40

If you simply want to apply the same functions to all columns something like this should be enough:

import org.apache.spark.sql.functions.{col, upper}

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "F"))).toDF("x", "y", "z")
df.select(df.columns.map(c => upper(col(c)).alias(c)): _*).show

// +---+---+---+
// |  x|  y|  z|
// +---+---+---+
// |  A|  B|  C|
// |  D|  E|  F|
// +---+---+---+

or in Python

from pyspark.sql.functions import col, upper

df = sc.parallelize([("a", "B", "c"), ("D", "e", "F")]).toDF(("x", "y", "z"))
df.select(*(upper(col(c)).alias(c) for c in df.columns)).show()

##  +---+---+---+
##  |  x|  y|  z|
##  +---+---+---+
##  |  A|  B|  C|
##  |  D|  E|  F|
##  +---+---+---+

See also: SparkSQL: apply aggregate functions to a list of column

edited May 23, 2017 at 12:00

CommunityBot

11 silver badge

answered Dec 2, 2015 at 8:38

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

LifeLongStudent Over a year ago

Thanks what is this doing in English .alias(c)): _*

zero323 Over a year ago

alias sets a name for the column. :_* denotes varargs syntax in Scala. In other words it passes each element of the sequence as an argument for select.

Tracy Over a year ago

getting this error

File "<ipython-input-10-bd092d3f0048>", line 1     pivoted.select(pivoted.columns.map(c => encodeUDF(col(c)).alias(c)): _*).show(2)                                           ^ SyntaxError: invalid syntax

Picachieu · Accepted Answer · 2019-03-06 22:05:01Z

I needed to do similar but had to write my own function to convert empty strings within a dataframe to null. This is what I did.

import org.apache.spark.sql.functions.{col, udf} 
import spark.implicits._ 

def emptyToNull(_str: String): Option[String] = {
  _str match {
    case d if (_str == null || _str.trim.isEmpty) => None
    case _ => Some(_str)
  }
}
val emptyToNullUdf = udf(emptyToNull(_: String))

val df = Seq(("a", "B", "c"), ("D", "e ", ""), ("", "", null)).toDF("x", "y", "z")
df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*).show

+----+----+----+
|   x|   y|   z|
+----+----+----+
|   a|   B|   c|
|   D|  e |null|
|null|null|null|
+----+----+----+

Here's a more refined function of emptyToNull using options instead of null.

def emptyToNull(_str: String): Option[String] = Option(_str) match {
  case ret @ Some(s) if (s.trim.nonEmpty) => ret
  case _ => None
}

Collectives™ on Stack Overflow

Apply same function to all fields of spark dataframe row

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related