0

My list of (tuples of) JSON values looks as follows:

testJson = [('{"drivernumber":1, "speed" : ["30.5", "40", "50", "25.25"]}',),
            ('{"drivernumber":2, "speed" : ["25.25", "10.11", "11", "50"]}',),
            ('{"drivernumber":3, "speed" : ["40", "50", "80", "42"]}',)
           ]

I created the below data structure:

from pyspark.sql.types import StructType, StructField, IntegerType, ArrayType, StringType
readSchema = StructType([
                   StructField("drivernumber", IntegerType(), True), 
                   StructField("speed", StringType(FloatType(), True), True)])

Then created a DataFrame:

df = (spark.read.schema(readSchema).json(sc.parallelize(testJson)))
display(df)

Ultimately, I need to get the below output but at the moment, my DF (after above step) only has NULLS, and I don't know why. Any leads or tips would be much appreciated. Thank you :)

speed  drivercount
50          3
40          2
25.25       2
11          1
....        ....
2
  • Is there a reason you need a list of one-element tuples? Commented Mar 5, 2020 at 19:50
  • Hi @JohnGordon - Just that data is passed by the broker in this format. Commented Mar 5, 2020 at 19:55

1 Answer 1

1

You don't have to define the schema for it, Simply Use

df = spark.read.json(sc.parallelize(testJson))
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.