0

i've got a databricks dataframe with a load of questionnaire results in them, the questionnaires on the df vary in length and do not always the same questions.

How can I get the questions and answers out of the string field Responses so i have a 3 column list of "CustomerID, Questions, Answers"

So from this

CustomerID     Responses
1              [{"question1":"answer 1"},{"question 2":"answer2"}]
2              [{"question1":"answer 1a"},{"question 2":"answer2b"}]
3              [{"question1":"answer 1b"},{"question 3":"answer3"}]

to resolve to

CustomerID   Questions  Answers
1            question1  answer1
1            question2  answer2
2            question1  answer1a
2            question2  answer2b 
3            question1  answer1b 
3            question3  answer3 

2 Answers 2

1

Since your data is String-based, not JSON-based, you'd have to parse it first with a schema, then you will be able to explode it

from pyspark.sql import functions as F
from pyspark.sql import types as T

(df
    .withColumn('Responses', F.from_json('Responses', T.ArrayType(T.MapType(T.StringType(), T.StringType()))))
    .withColumn('Response', F.explode('Responses'))
    .withColumn('Question', F.map_keys('Response')[0])
    .withColumn('Answer', F.map_values('Response')[0])
    .drop('Responses', 'Response')
    .show(10, False)
)

# Output
# +----------+----------+---------+
# |CustomerID|Question  |Answer   |
# +----------+----------+---------+
# |1         |question1 |answer 1 |
# |1         |question 2|answer2  |
# |2         |question1 |answer 1a|
# |2         |question 2|answer2b |
# |3         |question1 |answer 1b|
# |3         |question 3|answer3  |
# +----------+----------+---------+
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, yes that's exactly what I needed
0
from pyspark.sql.functions import explode

df2 = df.withColumn("firstExplode", explode(df.responses))
final_df = df2.select(df2.customerId, explode(df.firstExplode))

1 Comment

Thanks for response the data held in the responses is currently a string, although it is in the format as above, is there an easy way to cast as an array? I get this error "cannot resolve 'explode(`responses`)' due to data type mismatch: input to function explode should be array or map type, not string;;"

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.