pyspark hive sql convert array(map(varchar, varchar)) to string by rows

Question

I would like to transform a column of

   array(map(varchar, varchar))

to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.

example

user_id     sport_ids
 'aca'       [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]

expected results

  user_id.    sport_ids
  'aca'.          '5815'
  'aca'.          '5712'
  'aca'.          '1065'

I have tried

     sql_q= """
            select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
            from tab """
            
     spark.sql(sql_q)

but got error:

   '->' cannot be resolved

I have also tried

  sql_q= """
            select distinct, user_id, sport_ids
            from tab"""
            
     spark.sql(sql_q)

but got error:

    org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;

Did I miss something ?

I have tried this, but helpful hive convert array<map<string, string>> to string Extract map(varchar, array(varchar)) - Hive SQL

thanks

wwnde · Accepted Answer · 2022-07-06 20:57:54Z

0

Lets try use higher order functions to find map values and explode into individual rows

df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()


+-------+---------+
|user_id|sport_ids|
+-------+---------+
|    aca|     5818|
|    aca|     6712|
|    aca|     1065|
+-------+---------+

answered Jul 6, 2022 at 20:57

wwnde

26.7k6 gold badges22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

user3448011 Over a year ago

thanks for your reply. But, I cannot generate a 'df' with the sql query due to the error "Cannot have map type columns in DataFrame".

wwnde Over a year ago

Not sure I understand what you mean. Alternative would be df.selectExpr('user_id',"explode(transform(sport_ids, x->map_values(x)[0]))").show()

user3448011 Over a year ago

in your code "df.selectExpr('user_id',"explode(transform(sport_ids, x->map_values(x)[0]))").show()", how to get 'df' ? thanks

wwnde Over a year ago

I created the df as follws

df=spark.createDataFrame([('aca' ,      [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ])], ('user_id'   ,  'sport_ids'))

. For you, save your example into a df. For instance, you can say df= """ select user_id, sport_ids from tab""" `

user3448011 Over a year ago

thanks, but, I have to get the original data from the table by running the query "select distinct, user_id, sport_ids from tab" in jupyter notebook. It will access the presto db and get the "df". But, I got error "Cannot have map type columns in DataFrame" in OP.

Guru Stron · Accepted Answer · 2022-07-07 15:32:14Z

0

You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:

-- sample data
WITH dataset(user_id, sport_ids) AS (
    VALUES 
        ('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
) 

-- query
select user_id,
    json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
    unnest(cast(json_parse(sport_ids) as array(json))) as t(record)

Output:

user_id	sport_id
aca	5818
aca	6712
aca	1065

answered Jul 7, 2022 at 15:32

Guru Stron

151k11 gold badges187 silver badges234 bronze badges

Collectives™ on Stack Overflow

pyspark hive sql convert array(map(varchar, varchar)) to string by rows

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related