2

I have one of column type of data frame is string but actually it is containing json object of 4 schema where few fields are common. I need to convert that into jason object.

Here is schema of data frame :

query.printSchema()

root
 |-- test: string (nullable = true)

value of DF looks like

query.show(10)

+--------------------+
|                test|
+--------------------+
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"Interaction":{"...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
|{"PurchaseActivit...|
+--------------------+
only showing top 10 rows

What solution i applied ::

  1. write into text file

query.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

  1. read as json

df = spark.read.json("s3a://bucketname/temp/")

  1. now print Schema, It is json string for each row already converted into json object

df.printSchema()

root
 |-- EventDate: string (nullable = true)
 |-- EventId: string (nullable = true)
 |-- EventNotificationType: long (nullable = true)
 |-- Interaction: struct (nullable = true)
 |    |-- ContextId: string (nullable = true)
 |    |-- Created: string (nullable = true)
 |    |-- Description: string (nullable = true)
 |    |-- Id: string (nullable = true)
 |    |-- ModelContextId: string (nullable = true)
 |-- PurchaseActivity: struct (nullable = true)
 |    |-- BillingCity: string (nullable = true)
 |    |-- BillingCountry: string (nullable = true)
 |    |-- ShippingAndHandlingAmount: double (nullable = true)
 |    |-- ShippingDiscountAmount: double (nullable = true)
 |    |-- SubscriberId: long (nullable = true)
 |    |-- SubscriptionOriginalEndDate: string (nullable = true)
 |-- SubscriptionChurn: struct (nullable = true)
 |    |-- PaymentTypeCode: long (nullable = true)
 |    |-- PaymentTypeName: string (nullable = true)
 |    |-- PreviousPaidAmount: double (nullable = true)
 |    |-- SubscriptionRemoved: string (nullable = true)
 |    |-- SubscriptionStartDate: string (nullable = true)
 |-- TransactionDetail: struct (nullable = true)
 |    |-- Amount: double (nullable = true)
 |    |-- OrderShipToCountry: string (nullable = true)
 |    |-- PayPalUserName: string (nullable = true)
 |    |-- PaymentSubTypeCode: long (nullable = true)
 |    |-- PaymentSubTypeName: string (nullable = true)

Is there any best way where i don't need to write dataframe as text file and read it again as json file to get expected output

1
  • Have you tried the solution in this answer or perhaps this answer? Commented Apr 12, 2018 at 1:09

1 Answer 1

-1

You can use from_json() before you write into text file, but you need to define the schema first.

the code look like this :

data = query.select(from_json("test",schema=schema).alias("value")).selectExpr("value.*")

data.write.format("text").mode('overwrite').save("s3://bucketname/temp/")

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.