how to Convert JSON String to JSON object in pyspark

Question

I have one of column type of data frame is string but actually it is containing json object of 4 schema where few fields are common. I need to convert that into jason object.

Here is schema of data frame :

query.printSchema()

root
 |-- test: string (nullable = true)

value of DF looks like

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

What solution i applied ::

write into text file

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

read as json

df = spark.read.json("s3a://bucketname/temp/")

now print Schema, It is json string for each row already converted into json object

df.printSchema()

root
 |-- EventDate: string (nullable = true)
 |-- EventId: string (nullable = true)
 |-- EventNotificationType: long (nullable = true)
 |-- Interaction: struct (nullable = true)
 |    |-- ContextId: string (nullable = true)
 |    |-- Created: string (nullable = true)
 |    |-- Description: string (nullable = true)
 |    |-- Id: string (nullable = true)
 |    |-- ModelContextId: string (nullable = true)
 |-- PurchaseActivity: struct (nullable = true)
 |    |-- BillingCity: string (nullable = true)
 |    |-- BillingCountry: string (nullable = true)
 |    |-- ShippingAndHandlingAmount: double (nullable = true)
 |    |-- ShippingDiscountAmount: double (nullable = true)
 |    |-- SubscriberId: long (nullable = true)
 |    |-- SubscriptionOriginalEndDate: string (nullable = true)
 |-- SubscriptionChurn: struct (nullable = true)
 |    |-- PaymentTypeCode: long (nullable = true)
 |    |-- PaymentTypeName: string (nullable = true)
 |    |-- PreviousPaidAmount: double (nullable = true)
 |    |-- SubscriptionRemoved: string (nullable = true)
 |    |-- SubscriptionStartDate: string (nullable = true)
 |-- TransactionDetail: struct (nullable = true)
 |    |-- Amount: double (nullable = true)
 |    |-- OrderShipToCountry: string (nullable = true)
 |    |-- PayPalUserName: string (nullable = true)
 |    |-- PaymentSubTypeCode: long (nullable = true)
 |    |-- PaymentSubTypeName: string (nullable = true)

Is there any best way where i don't need to write dataframe as text file and read it again as json file to get expected output

Have you tried the solution in this answer or perhaps this answer? — pault
– pault, Commented Apr 12, 2018 at 1:09

Ibnu Akbar · Accepted Answer · 2018-12-31 05:00:31Z

-1

You can use from_json() before you write into text file, but you need to define the schema first.

the code look like this :

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

answered Dec 31, 2018 at 5:00

Ibnu Akbar

95 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

how to Convert JSON String to JSON object in pyspark

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related