How to convert each column in a dataframe to a row with ColumnName and ColumnValue

Question

I need to convert the following dataframe:

╔══════╦════════╦════════╦════════╗
║ Year ║  ColA  ║  ColB  ║  ColC  ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║      1 ║      2 ║      3 ║
║ 2018 ║      4 ║      5 ║      6 ║
║ 2019 ║      7 ║      8 ║      9 ║
╚══════╩════════╩════════╩════════╝

Into this:

╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║  ColA  ║     1 ║
║ 2017 ║  ColB  ║     2 ║
║ 2017 ║  ColC  ║     3 ║
║ 2018 ║  ColA  ║     4 ║
║ 2018 ║  ColB  ║     5 ║
║ 2018 ║  ColC  ║     6 ║
║ 2019 ║  ColA  ║     7 ║
║ 2019 ║  ColB  ║     8 ║
║ 2019 ║  ColC  ║     9 ║
╚══════╩════════╩═══════╝

This needs to support any number of columns besides the first "Year" one, which could be 1 or many. And it should be a generic solution, meaning it should not use hard-coded column names anywhere, but it should read the column names directly from the original dataframe.

I'm using Databricks with a notebook written in Scala. Very new to both Spark and Scala.

UPDATE

I've found this solution in Python that works well, but I'm having a hard time converting it to Scala.

def columnsToRows(df, by):
  # Filter dtypes and split into column names and type description.
  # Only get columns not in "by".
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))

  # Create and explode an array of (column_name, column_value) structs
  kvs = F.explode(F.array([
    F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
  ])).alias("kvs")

  return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])

Possible duplicate of Transpose column to row with Spark

Raphael Roth
– Raphael Roth

2019-08-15 15:24:18 +00:00
Commented Aug 15, 2019 at 15:24 — Raphael Roth
– Raphael Roth, Commented Aug 15, 2019 at 15:24

empz · Accepted Answer · 2019-08-15 15:14:51Z

2

You can use stack to transpose the data

val fixedColumns = Seq("Year", "FixedColumn")

val cols = df.columns
  .filter(c => !(fixedColumns.contains(c)))
  .map(c => (s"'${c}', ${c}" ))

val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")


df.select($"Year", expr(exp))

Output:

  +----+------+-----+
  |Year|Point |Value|
  +----+------+-----+
  |2017|PointA|1    |
  |2017|PointB|2    |
  |2017|PointC|3    |
  |2018|PointA|4    |
  |2018|PointB|5    |
  |2018|PointC|6    |
  |2019|PointA|7    |
  |2019|PointB|8    |
  |2019|PointC|9    |
  +----+------+-----+

edited Aug 15, 2019 at 15:14

empz

12k17 gold badges72 silver badges110 bronze badges

answered Aug 15, 2019 at 13:39

koiralo

23.2k6 gold badges57 silver badges77 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

empz Over a year ago

Uhm, okay, but I need a generic solution as mentioned that would work for any number of columns (the first one would always be "Year") and doesn't use hardcoded column names as these could just be anything.

empz Over a year ago

Great, thanks. One more thing. Suppose I have a dataframe that besides Year, has some extra columns that doesn't need to be "transformed" into rows? Is there a way to have a list of colums that I want to keep as colums? Something like using a Seq("Year", "Another Fixed Column", "Another one") to avoid transforming these into rows?

empz Over a year ago

I've edited your answer to match what I need. Thanks!

koiralo Over a year ago

Thanks for updating, I hope the answer helped you!

Raphael Roth · Accepted Answer · 2019-08-15 15:23:30Z

0

Your python-code translates like this:

val colsToKeep = Seq("year").map(col) 
val colsToTransform = Seq("colA","colB","colC")

df.select((colsToKeep :+
  explode(
    array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
  ).as("NameValue")):_*)
  .select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
  .show()

answered Aug 15, 2019 at 15:23

Raphael Roth

27.3k19 gold badges98 silver badges152 bronze badges

Collectives™ on Stack Overflow

How to convert each column in a dataframe to a row with ColumnName and ColumnValue

2 Answers 2

4 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related