How to extract data from df_raw in col("label") which is Mapstruct?
I'm using Spark 1.6. I got data from Hive by hivesql in Spark, then I got a dataframe, but one column in dataframe is Mapstruct, I tried to extract data from it but failed, hope some help from stackoverflow, 3Q very much.
After I got data from Hive, I got a dataframe named df_raw, the schema is :
root
|-- subscriberid: string (nullable = true)
|-- time: string (nullable = true)
|-- itemid: string (nullable = true)
|-- label: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
|-- partitiondate: string (nullable = true)
and df_raw.show(3) is :
+------------+-------------------+------+--------------------+-------------+
|subscriberid| time|itemid| label|partitiondate|
+------------+-------------------+------+--------------------+-------------+
| 1569960|2019-09-08 08:00:01| 46611|Map(license -> yo...| 20190908|
| 1104555|2019-09-08 08:00:29| 46445|Map(license -> wa...| 20190908|
| 1309036|2019-09-08 08:00:55| 45219|Map(license -> yo...| 20190908|
+------------+-------------------+------+--------------------+-------------+
In order to get it clearly, I transform df_raw to rdd and take 2 data from it:
val rawRDD: RDD[String] = df_raw.rdd.map(pojo => pojo.mkString("\t"))
println("——————————" + "\n")
rawRDD.take(2).foreach(println)
the data is:
1545807 2019-09-10 07:29:41 4706 Map(license -> wa, videoid -> 4706, mediapaytype -> 1, duration -> 131) 20190908
1496840 2019-09-10 07:30:43 4535 Map(license -> you, videoid -> 4535, mediapaytype -> 1, duration -> 137) 20190908
I wanna know how to extract data from df_raw in col("label") separately?
I tried to get a new dataframe like this:
val df_userBehaviorsRow_1 = rawUserBehaviorsData.map(line => {
val splits = line.split("\t")
val subscriberid = splits(0)
val time= splits(1)
val itemid = splits(2)
val label = splits(3)
val resultant = label.map{m=>
val seq=m.values.toSeq
(seq(0),seq(1),seq(2))
}
val license = resultant._1
val duration = resultant._3
(subscriberid , time, itemid, label, license,duration)
}).toDF
I failed, and IntelliJ IDEA can't even recognize "val resultant = label.map{m=>val seq=m.values.toSeq(seq(0),seq(1),seq(2))}"
Hope some help please, 3Q very much.