2

I have converted data frame to JSON by using toJSON in pyspark that gives me each row as JSON string. but I want to reformat a bit

My code is given below:

spark=SparkSession.builder.config("spark.sql.warehouse.dir", "C:\spark\spark-warehouse").appName("TestApp").enableHiveSupport().getOrCreate()
sqlstring="SELECT lflow1.LeaseType as LeaseType, lflow1.Status as Status, lflow1.Property as property, lflow1.City as City, lesflow2.DealType as DealType, lesflow2.Area as Area, lflow1.Did as DID, lesflow2.MID as MID from lflow1, lesflow2  WHERE lflow1.Did = lesflow2.MID"

def queryBuilder(sqlval):
    df=spark.sql(sqlval)
    df.show()
    return df

result=queryBuilder(sqlstring)
resultlist=result.toJSON().collect()
print(resultlist)
print("Type of",type(resultlist))

After this, the output is:

[
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"10230104","City":"Edmonton","DealType":"Renewal","Area":"2312","DID":"79cc3959ffc8403f943ff0e7e93584f8","MID":"79cc3959ffc8403f943ff0e7e93584f8"}',
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"784","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}', 
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"2223","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}', 
    '{"LeaseType":"Offer to Lease","Status":"Conditional","property":"106PORTW","City":"Toronto","DealType":"Renewal","Area":"2212","DID":"69c3af0527014fd99d1ccf156c0bce93","MID":"69c3af0527014fd99d1ccf156c0bce93"}', 
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"106PORTW","City":"Toronto","DealType":"0","Area":"","DID":"04aedb01da5d44fead7e1315115c2530","MID":"04aedb01da5d44fead7e1315115c2530"}'
]

But I want to format this JSON Array like for example: the following two rows:

[
    {
        "LeaseType": "Offer to Lease",
        "Status": "Fully Executed",
        "property": "10230104",
        "City": "Edmonton",
        "DealType": "Renewal",
        "Area": "2312",
        "DID": "79cc3959ffc8403f943ff0e7e93584f8",
        "MID": "79cc3959ffc8403f943ff0e7e93584f8"
    },
    {
        "LeaseType": "Offer to Renew",
        "Status": "Fully Executed",
        "property": "1040HAMI",
        "City": "Vancouver",
        "DealType": "Renewal",
        "Area": "784",
        "DID": "ecf922d0583247c0a4cb591bd4b3843e",
        "MID": "ecf922d0583247c0a4cb591bd4b3843e"
    }
]

I want to omit the ' here.

2 Answers 2

2
import re
import json

resultlist = [
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"10230104","City":"Edmonton","DealType":"Renewal","Area":"2312","DID":"79cc3959ffc8403f943ff0e7e93584f8","MID":"79cc3959ffc8403f943ff0e7e93584f8"}',
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"784","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}',
    '{"LeaseType":"Offer to Renew","Status":"Fully Executed","property":"1040HAMI","City":"Vancouver","DealType":"Renewal","Area":"2223","DID":"ecf922d0583247c0a4cb591bd4b3843e","MID":"ecf922d0583247c0a4cb591bd4b3843e"}',
    '{"LeaseType":"Offer to Lease","Status":"Conditional","property":"106PORTW","City":"Toronto","DealType":"Renewal","Area":"2212","DID":"69c3af0527014fd99d1ccf156c0bce93","MID":"69c3af0527014fd99d1ccf156c0bce93"}',
    '{"LeaseType":"Offer to Lease","Status":"Fully Executed","property":"106PORTW","City":"Toronto","DealType":"0","Area":"","DID":"04aedb01da5d44fead7e1315115c2530","MID":"04aedb01da5d44fead7e1315115c2530"}'
]

data_to_dump = re.sub(r"\'", "", str(resultlist))
json_data= json.dumps(data_to_dump)
print json_data
Sign up to request clarification or add additional context in comments.

1 Comment

Don't use re module. Properly json.loads the json strings
2

You have a list of JSON strings, so if you want to get that entire list as a JSON block, you can load the JSON back to python dictionaries, then serialize the whole list

import json

resultlist_json = [json.loads(x) for x in resultlist] 
print(json.dumps(resultlist_json, sort_keys=True, indent=4))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.