1

I have a field called geo_data_display which contains country, region and dma. The 3 values are contained between = and & characters - country between the first "=" and the first "&", region between the second "=" and the second "&" and DMA between the third "=" and the third "&". Here's a re-producible version of the table. country is always character but region and DMA can be either numeric or character and DMA doesn't exist for all countries.

A few sample values are:

country=us&region=tx&dma=625&domain=abc.net&zipcodes=76549
country=us&region=ca&dma=803&domain=abc.com&zipcodes=90404 
country=tw&region=hsz&domain=hinet.net&zipcodes=300
country=jp&region=1&dma=a&domain=hinet.net&zipcodes=300  

I have some sample SQL but the geo_dma code line isn't working at all and the geo_region code line only works for character values

SELECT 

UPPER(REGEXP_REPLACE(split(geo_data_display, '\\&')[0], 'country=', '')) AS geo_country
,UPPER(split(split(geo_data_display, '\\&')[1],'\\=')[1]) AS geo_region
,split(split(cast(geo_data_display as int), '\\&')[2],'\\=')[2] AS geo_dma
FROM mytable

3 Answers 3

2

You can use str_to_map like so:

select  geo_map['country']  as geo_country
       ,geo_map['region']   as geo_region
       ,geo_map['dma']      as geo_dma

from   (select  str_to_map(geo_data_display,'&','=')    as geo_map
        from    mytable
        ) t
;

+--------------+-------------+----------+
| geo_country  | geo_region  | geo_dma  |
+--------------+-------------+----------+
| us           | tx          | 625      |
| us           | ca          | 803      |
| tw           | hsz         | NULL     |
| jp           | 1           | a        |
+--------------+-------------+----------+
Sign up to request clarification or add additional context in comments.

Comments

1

Source

regexp_extract(string subject, string pattern, int index)

Returns the string extracted using the pattern. For example, regexp_extract('foothebar', 'foo(.*?)(bar)', 1) returns 'the'

select 
      regexp_extract(geo_data_display, 'country=(.*?)(&region)', 1),
      regexp_extract(geo_data_display, 'region=(.*?)(&dma)', 1),
      regexp_extract(geo_data_display, 'dma=(.*?)(&domain)', 1)

1 Comment

Overcomplicated and returns wrong results when DMA doesn't exist.
0

Please try the following,

create table ch8(details map string,string>)

row format delimited

collection items terminated by '&'

map keys terminated by '=';

Load the data into the table.

create another table using CTAS

create table ch9 as select details["country"] as country, details["region"] as region, details["dma"] as dma, details["domain"] as domain, details["zipcodes"] as zipcode from ch8;

Select * from ch9;

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.