Extract and explode inner nested element as rows from string nested structure

Question

I would like to explode a column to rows in a dataframe on pyspark hive. There are two columns in the dataframe. The column "business_id" is a string. The column "sports_info" is a struct type, each element value is an array of string.

Data:

business_id     sports_info
 "abc-123"       {"sports_type":
                  ["{sport_name:most_recent,
                    sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
                   ]
                 }

I need to get a dataframe like:

 business_id.    sport_id
  "abc-123"       568
  "abc-123"       171

I defined:

schema = StructType([ \
    StructField("sports_type",ArrayType(),True)
])
 
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)

def get_ids(val):
    sports_type = 'sports_type'
    sport_events = 'sport_events'
    sport_id = 'sport_id'
    sport_ids_vals = eval(val.sports_type[0])['sport_events']
    ids = [s['sport_id'] for s in sport_ids_scores]
    return ids

df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x), 
                                         ArrayType(ArrayType(StringType())))('sports_info'))

How could I create the df and extract/explode the inner nested elements?

Are you sure you need id=171? It is id, not sport_id. Your column will be named sport_id — ZygD
– ZygD, Commented Jun 27, 2022 at 23:28

Matt Andruff · Accepted Answer · 2022-06-28 13:45:04Z

1

df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()

Explained:

expr( #use a SQL expression, only way to access transform (pre spark 3)
 "transform ( # run a SQL function on an array
  sports_type, # declare column to use
  x # declare the name of the variable to use for each element in the array
  -> # Start writing SQL code to run on each element in the array
  regexp_extract( # user SQL regex functions to pull out from the string
  x, #string to run regex on
  'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
  )

THis will likely run faster than a UDF as it can be vectorized.

answered Jun 28, 2022 at 13:45

Matt Andruff

5,1901 gold badge7 silver badges25 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract and explode inner nested element as rows from string nested structure

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related