15

I have dataframe in which I have about 1000s ( variable) columns.

I want to make all values upper case.

Here is the approach I have thought of , can you suggest if this is best way.

  • Take row
  • Find schema and store in array and find how many fields are there.
  • map through each row in data frame and upto limit of number of elements in array
  • apply function to upper case each fields and return row
0

2 Answers 2

40

If you simply want to apply the same functions to all columns something like this should be enough:

import org.apache.spark.sql.functions.{col, upper}

val df = sc.parallelize(
  Seq(("a", "B", "c"), ("D", "e", "F"))).toDF("x", "y", "z")
df.select(df.columns.map(c => upper(col(c)).alias(c)): _*).show

// +---+---+---+
// |  x|  y|  z|
// +---+---+---+
// |  A|  B|  C|
// |  D|  E|  F|
// +---+---+---+

or in Python

from pyspark.sql.functions import col, upper

df = sc.parallelize([("a", "B", "c"), ("D", "e", "F")]).toDF(("x", "y", "z"))
df.select(*(upper(col(c)).alias(c) for c in df.columns)).show()

##  +---+---+---+
##  |  x|  y|  z|
##  +---+---+---+
##  |  A|  B|  C|
##  |  D|  E|  F|
##  +---+---+---+

See also: SparkSQL: apply aggregate functions to a list of column

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks what is this doing in English .alias(c)): _*
alias sets a name for the column. :_* denotes varargs syntax in Scala. In other words it passes each element of the sequence as an argument for select.
getting this error File "<ipython-input-10-bd092d3f0048>", line 1 pivoted.select(pivoted.columns.map(c => encodeUDF(col(c)).alias(c)): _*).show(2) ^ SyntaxError: invalid syntax
3

I needed to do similar but had to write my own function to convert empty strings within a dataframe to null. This is what I did.

import org.apache.spark.sql.functions.{col, udf} 
import spark.implicits._ 

def emptyToNull(_str: String): Option[String] = {
  _str match {
    case d if (_str == null || _str.trim.isEmpty) => None
    case _ => Some(_str)
  }
}
val emptyToNullUdf = udf(emptyToNull(_: String))

val df = Seq(("a", "B", "c"), ("D", "e ", ""), ("", "", null)).toDF("x", "y", "z")
df.select(df.columns.map(c => emptyToNullUdf(col(c)).alias(c)): _*).show

+----+----+----+
|   x|   y|   z|
+----+----+----+
|   a|   B|   c|
|   D|  e |null|
|null|null|null|
+----+----+----+

Here's a more refined function of emptyToNull using options instead of null.

def emptyToNull(_str: String): Option[String] = Option(_str) match {
  case ret @ Some(s) if (s.trim.nonEmpty) => ret
  case _ => None
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.