1

I would like to explode a column to rows in a dataframe on pyspark hive. There are two columns in the dataframe. The column "business_id" is a string. The column "sports_info" is a struct type, each element value is an array of string.

Data:

business_id     sports_info
 "abc-123"       {"sports_type":
                  ["{sport_name:most_recent,
                    sport_events:[{sport_id:568, val:10.827},{id:171,score:8.61}]}"
                   ]
                 }

I need to get a dataframe like:

 business_id.    sport_id
  "abc-123"       568
  "abc-123"       171

I defined:

schema = StructType([ \
    StructField("sports_type",ArrayType(),True)
])
 
df = spark.createDataFrame(data=data, schema=schema) # I am not sure how to create the df
df.printSchema()
df.show(truncate=False)

def get_ids(val):
    sports_type = 'sports_type'
    sport_events = 'sport_events'
    sport_id = 'sport_id'
    sport_ids_vals = eval(val.sports_type[0])['sport_events']
    ids = [s['sport_id'] for s in sport_ids_scores]
    return ids

df2 = df.withColumn('sport_new', F.udf(lambda x: get_ids(x), 
                                         ArrayType(ArrayType(StringType())))('sports_info'))

How could I create the df and extract/explode the inner nested elements?

1
  • Are you sure you need id=171? It is id, not sport_id. Your column will be named sport_id Commented Jun 27, 2022 at 23:28

1 Answer 1

1
df2 = df.withColumn('sport_new', expr("transform (sports_type, x -> regexp_extract( x, 'sport_id:([0-9]+)',1))")).show()

Explained:

expr( #use a SQL expression, only way to access transform (pre spark 3)
 "transform ( # run a SQL function on an array
  sports_type, # declare column to use
  x # declare the name of the variable to use for each element in the array
  -> # Start writing SQL code to run on each element in the array
  regexp_extract( # user SQL regex functions to pull out from the string
  x, #string to run regex on
  'sport_id:([0-9]+)',1))" # find sport_id and capture the number following it.
  )

THis will likely run faster than a UDF as it can be vectorized.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.