1

I have a hive table which is ingested from system logs. The data is encoded in a weird format (an array of maps) in which each element of the array contains the field_name and it's value. The column type is STRING. Just like in the example below:

select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info;

Which creates something like this:

user_id user_info
1 [{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]
2 [{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]

Notice that the array size is not always the same. I'm trying to convert the array of maps to a simple map. Then, this is what I expect as result:

user_id user_info
1 {"name":"Bob", "gender":"M"}
2 {"name":"Ana", "gender":"F", "age":22}

I was planning to reach that in 3 steps: (1) parse the string column to create an array of maps, (2) explode the array (using lateral view), (3) collect the list of fields and group them by user_id

I'm struggling to complete the first step: parse the string column to create an array of maps. Any help would be much appreciated :D

1 Answer 1

1

See comments in the code. Array of strings to be transformed to maps is produced by this split(user_info, '(?<=\\}) *, *(?=\\{)'). Then it is exploded and each element converted to map.

with mydata as
(select 1 as user_id, '[{"field":"name", "value":"Bob"}, {"field":"gender", "value":"M"}]' as user_info
union all
select 2 as user_id, '[{"field":"gender", "value":"F"}, {"field":"age", "value":22}, {"field":"name", "value":"Ana"}]' as user_info
)

select user_id,
       --build new map
       str_to_map(concat('name:', name, nvl(concat(',','gender:', gender),''),  nvl(concat(',','age:', age),'') )) as user_info
from 
(
select user_id, 
      --get name, gender, age, aggregate by user_id
      max(case when user_info['field'] = 'name' then user_info['value'] end) name,
      max(case when user_info['field'] = 'gender' then user_info['value'] end) gender,
      max(case when user_info['field'] = 'age' then user_info['value'] end) age
      
from      
(
select s.user_id, 
       --remove {} and ", convert to map
       str_to_map(regexp_replace(e.element,'^\\{| *"|\\}$','')) as user_info 
from
(
select user_id, regexp_replace(user_info, '^\\[|\\]$','') as user_info -- remove []
 from mydata
)s lateral view outer explode(split(user_info, '(?<=\\}) *, *(?=\\{)'))e as element --split by comma between }{ with optional spaces in between
) s
group by user_id
)s

Result:

user_id   user_info 
1        {"name":"Bob","gender":"M"}
2        {"name":"Ana","gender":"F","age":"22"}
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.