0

I need to convert the following dataframe:

╔══════╦════════╦════════╦════════╗
║ Year ║  ColA  ║  ColB  ║  ColC  ║
╠══════╬════════╬════════╬════════╣
║ 2017 ║      1 ║      2 ║      3 ║
║ 2018 ║      4 ║      5 ║      6 ║
║ 2019 ║      7 ║      8 ║      9 ║
╚══════╩════════╩════════╩════════╝

Into this:

╔══════╦════════╦═══════╗
║ Year ║ColName ║ Value ║
╠══════╬════════╬═══════╣
║ 2017 ║  ColA  ║     1 ║
║ 2017 ║  ColB  ║     2 ║
║ 2017 ║  ColC  ║     3 ║
║ 2018 ║  ColA  ║     4 ║
║ 2018 ║  ColB  ║     5 ║
║ 2018 ║  ColC  ║     6 ║
║ 2019 ║  ColA  ║     7 ║
║ 2019 ║  ColB  ║     8 ║
║ 2019 ║  ColC  ║     9 ║
╚══════╩════════╩═══════╝

This needs to support any number of columns besides the first "Year" one, which could be 1 or many. And it should be a generic solution, meaning it should not use hard-coded column names anywhere, but it should read the column names directly from the original dataframe.

I'm using Databricks with a notebook written in Scala. Very new to both Spark and Scala.

UPDATE

I've found this solution in Python that works well, but I'm having a hard time converting it to Scala.

def columnsToRows(df, by):
  # Filter dtypes and split into column names and type description.
  # Only get columns not in "by".
  cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))

  # Create and explode an array of (column_name, column_value) structs
  kvs = F.explode(F.array([
    F.struct(F.lit(c.strip()).alias("ColName"), F.col(c).alias("Value")) for c in cols
  ])).alias("kvs")

  return df.select(by + [kvs]).select(by + ["kvs.ColName", "kvs.Value"])
1

2 Answers 2

2

You can use stack to transpose the data

val fixedColumns = Seq("Year", "FixedColumn")

val cols = df.columns
  .filter(c => !(fixedColumns.contains(c)))
  .map(c => (s"'${c}', ${c}" ))

val exp= cols.mkString(s"stack(${cols.size}, ", "," , ") as (Point, Value)")


df.select($"Year", expr(exp))

Output:

  +----+------+-----+
  |Year|Point |Value|
  +----+------+-----+
  |2017|PointA|1    |
  |2017|PointB|2    |
  |2017|PointC|3    |
  |2018|PointA|4    |
  |2018|PointB|5    |
  |2018|PointC|6    |
  |2019|PointA|7    |
  |2019|PointB|8    |
  |2019|PointC|9    |
  +----+------+-----+
Sign up to request clarification or add additional context in comments.

4 Comments

Uhm, okay, but I need a generic solution as mentioned that would work for any number of columns (the first one would always be "Year") and doesn't use hardcoded column names as these could just be anything.
Great, thanks. One more thing. Suppose I have a dataframe that besides Year, has some extra columns that doesn't need to be "transformed" into rows? Is there a way to have a list of colums that I want to keep as colums? Something like using a Seq("Year", "Another Fixed Column", "Another one") to avoid transforming these into rows?
I've edited your answer to match what I need. Thanks!
Thanks for updating, I hope the answer helped you!
0

Your python-code translates like this:

val colsToKeep = Seq("year").map(col) 
val colsToTransform = Seq("colA","colB","colC")

df.select((colsToKeep :+
  explode(
    array(colsToTransform.map(c => struct(lit(c).alias("colName"),col(c).alias("colValue"))):_*)
  ).as("NameValue")):_*)
  .select((colsToKeep :+ $"nameValue.colName":+$"nameValue.colValue"):_*)
  .show()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.