0

I'm looking a way to append column names to data frame row's data . Number of columns could be different from time to time

I've Spark 1.4.1

I've a dataframe :

Edit: : all data is String type only

+---+----------+
|key|     value|
+---+----------+
|foo|       bar|
|bar|  one, two|
+---+----------+

I'd like to get :

  +-------+---------------------+
  |key    |                value|
  +-------+---------------------+
  |key_foo|            value_bar|
  |key_bar| value_one, value_two|
  +---+-------------------------+ 

I tried

 import org.apache.spark.sql._
 import org.apache.spark.sql.functions._

 val concatColNamesWithElems = udf { seq: Seq[Row] =>
     seq.map { case Row(y: String) => (col +"_"+y)}}
2
  • what type is your value column? Commented Jan 30, 2017 at 14:52
  • @mtoto value has String values only Commented Jan 30, 2017 at 16:03

1 Answer 1

1

Save DataFrame as Table (Ex: dfTable), So that you write SQL on it.

df.registerTempTable("dfTable")

Create UDF and Register: I'd assume your value column type is String

sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
  columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)

Use UDF in Query

//prepare columns which have UDF and all column names with AS 
//Ex: prefix(key, "key") AS key // you can this representation 
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")


println(columns) //for testing how columns framed

val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")
Sign up to request clarification or add additional context in comments.

10 Comments

Thanks for fast answer ! As I understand only col names "key" ,"value" will work . Unfortunately, column names and number of columns could be variant each time I execute the code . I look for more general solution
You can use udf like prefixCol(column_name_in_table, "prefix_string") in select. With this you can add any prefix to the values in column.
@Toren: this udf provides maximum generic solution. Let me know if you think this won't work in generic way. As I understand only col names "key" ,"value" will work It works with any column(string type) and any prefix.
I'm still thinking how to loop over whole df(all columns) with this solution
Depends , could be various
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.