Load json file to spark dataframe

Question

I try to load the following data.json file in a spark dataframe:

{"positionmessage":{"callsign": "PPH1", "name": 0.0, "mmsi": 100}}
{"positionmessage":{"callsign": "PPH2", "name": 0.0, "mmsi": 200}}
{"positionmessage":{"callsign": "PPH3", "name": 0.0, "mmsi": 300}}

by the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Create a schema for the dataframe
schema = StructType([
    StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
])

# Create data frame
json_file_path = "data.json"
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
print(df.head(3))

It prints: [Row(callsign=None, name=None, mmsi=None)]. What do I do wrong? I have set my environment variables in the system settings.

Yes, thanks. I just started learning pyspark and exploring similarities/differences with Pandas. — Jeroen
– Jeroen, Commented May 19, 2020 at 6:29

notNull · Accepted Answer · 2020-05-18 21:03:20Z

1

You are having positionmessage struct field and missing in schema.

Change the schema to include struct field as shown below:

schema = StructType([StructField("positionmessage",StructType([StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
]))])

spark.read.schema(schema).json("<path>").\
select("positionmessage.*").\
show()
#+--------+----+----+
#|callsign|name|mmsi|
#+--------+----+----+
#|    PPH1| 0.0| 100|
#|    PPH2| 0.0| 200|
#|    PPH3| 0.0| 300|
#+--------+----+----+

answered May 18, 2020 at 21:03

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Jeroen Over a year ago

How do load a timestamp yyyy-mm-dd- hh:mm:ss when I have per row an extra field called: "timestamplast": "2019-08-01T00:00:01Z"

notNull Over a year ago

Not sure what you mean by extra field, You can try adding field in schema and read as string type then use to_timestamp function to get the desired format. If this doesn't work please open new question and tag me in it :) ..!

Jeroen Over a year ago

Just made a new question "pyspark clean data within dataframe" where it is more clear what I try to do. Thanks. stackoverflow.com/questions/61899539/…

Collectives™ on Stack Overflow

Load json file to spark dataframe

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related