Convert String Datatype Column to MapType in Spark Dataframe

Question

I have a dataframe as shown below. I want to convert the last column Trandata from String Type to MapType. The output should look something similar I have shown in 2nd table.

I have written udf but it takes string and converts to Maptype, I'm struggle to get similar output with sql.row as input. :(

def stringToMap(value: String): Map[String, String] = {
  var valMap = collection.mutable.Map[String, String]()
  val values = value.split(",")
  for (i <- values) {
    valMap = valMap + (i.split("=")(0) -> i.split("=")(1))
  }
  return valMap
}


+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |threadID=123sada,ProcType=InfraLogging,TxnID=4mjx8wfogf
|03011         |A       |threadID=xmjxe2j0jz,ProcType=InfraLogging,TxnID=4mjxe2j0tf
|09941         |D       |compTxnID=xmawdew0tf,to=ABCD,threadID=4mjxe2j0jz,ProcType=InfraLogging
|00994         |D       |compTxnID=xmjxe2j0tf,to=XYZA,threadID=34jxasde0jz,ProcType=InfraLogging
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Table2: Expceted output - 3rd col to be MapType

+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID     |CATEGORY|TRANDATA                                                                                                                                                                                                                                                                                       |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010         |A       |Map(threadID -> 123sada,ProcType -> InfraLogging,TxnID -> 4mjx8wfogf)

werner · Accepted Answer · 2020-10-04 11:38:00Z

3

For Spark 2.4+ you can split the string into key value pairs, then use transform to separate keys and values into two array columns and then use map_from_arrays to create the final map.

df.withColumn("entry", split('TRANDATA, ","))
  .withColumn("key", expr("transform(entry, x -> split(x, '=')[0])"))
  .withColumn("value", expr("transform(entry, x -> split(x, '=')[1])"))
  .withColumn("map", map_from_arrays('key, 'value))
  .drop("entry", "key", "value", "TRANDATA")
  .show(false)

Output:

+---------+--------+----------------------------------------------------------------------------------------+
|MESSAGEID|CATEGORY|map                                                                                     |
+---------+--------+----------------------------------------------------------------------------------------+
|03010    |A       |[threadID -> 123sada, ProcType -> InfraLogging, TxnID -> 4mjx8wfogf]                    |
|03011    |A       |[threadID -> xmjxe2j0jz, ProcType -> InfraLogging, TxnID -> 4mjxe2j0tf]                 |
|09941    |D       |[compTxnID -> xmawdew0tf, to -> ABCD, threadID -> 4mjxe2j0jz, ProcType -> InfraLogging] |
|00994    |D       |[compTxnID -> xmjxe2j0tf, to -> XYZA, threadID -> 34jxasde0jz, ProcType -> InfraLogging]|
+---------+--------+----------------------------------------------------------------------------------------+

answered Oct 4, 2020 at 11:38

werner

15k6 gold badges36 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Shankar Over a year ago

Thank you so much Werner. Spark 2.4+ contains lot more options, it's good my production servers are upgraded to 2.4.x version, So above code should work fine.

Collectives™ on Stack Overflow

Convert String Datatype Column to MapType in Spark Dataframe

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related