1

I would like to transform a column of

   array(map(varchar, varchar))

to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.

example

user_id     sport_ids
 'aca'       [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]

expected results

  user_id.    sport_ids
  'aca'.          '5815'
  'aca'.          '5712'
  'aca'.          '1065'

I have tried

     sql_q= """
            select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
            from tab """
            
     spark.sql(sql_q)

but got error:

   '->' cannot be resolved  

I have also tried

  sql_q= """
            select distinct, user_id, sport_ids
            from tab"""
            
     spark.sql(sql_q)

but got error:

    org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;

Did I miss something ?

I have tried this, but helpful hive convert array<map<string, string>> to string Extract map(varchar, array(varchar)) - Hive SQL

thanks

2 Answers 2

0

Lets try use higher order functions to find map values and explode into individual rows

df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()


+-------+---------+
|user_id|sport_ids|
+-------+---------+
|    aca|     5818|
|    aca|     6712|
|    aca|     1065|
+-------+---------+
Sign up to request clarification or add additional context in comments.

5 Comments

thanks for your reply. But, I cannot generate a 'df' with the sql query due to the error "Cannot have map type columns in DataFrame".
Not sure I understand what you mean. Alternative would be df.selectExpr('user_id',"explode(transform(sport_ids, x->map_values(x)[0]))").show()
in your code "df.selectExpr('user_id',"explode(transform(sport_ids, x->map_values(x)[0]))").show()", how to get 'df' ? thanks
I created the df as follws df=spark.createDataFrame([('aca' , [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ])], ('user_id' , 'sport_ids')). For you, save your example into a df. For instance, you can say df= """ select user_id, sport_ids from tab""" `
thanks, but, I have to get the original data from the table by running the query "select distinct, user_id, sport_ids from tab" in jupyter notebook. It will access the presto db and get the "df". But, I got error "Cannot have map type columns in DataFrame" in OP.
0

You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:

-- sample data
WITH dataset(user_id, sport_ids) AS (
    VALUES 
        ('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
) 

-- query
select user_id,
    json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
    unnest(cast(json_parse(sport_ids) as array(json))) as t(record)

Output:

user_id sport_id
aca 5818
aca 6712
aca 1065

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.