I have a hive table which is ingested from system logs. The data is encoded in a weird format (an array of maps) in which each element of the array contains the field_name and it's value. The column type is STRING. Just like in the example below:
select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info;
Which creates something like this:
| user_id | user_info |
|---|---|
| 1 | [{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}] |
| 2 | [{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}] |
Notice that the array size is not always the same. I'm trying to convert the array of maps to a simple map. Then, this is what I expect as result:
| user_id | user_info |
|---|---|
| 1 | {"name":"Bob", "gender":"M"} |
| 2 | {"name":"Ana", "gender":"F", "age":22} |
I was planning to reach that in 3 steps: (1) parse the string column to create an array of maps, (2) explode the array (using lateral view), (3) collect the list of fields and group them by user_id
I'm struggling to complete the first step: parse the string column to create an array of maps. Any help would be much appreciated :D