0

I try to load the following data.json file in a spark dataframe:

{"positionmessage":{"callsign": "PPH1", "name": 0.0, "mmsi": 100}}
{"positionmessage":{"callsign": "PPH2", "name": 0.0, "mmsi": 200}}
{"positionmessage":{"callsign": "PPH3", "name": 0.0, "mmsi": 300}}

by the following code:

from pyspark.sql import SparkSession
from pyspark.sql.types import ArrayType, StructField, StructType, StringType, IntegerType

appName = "PySpark Example - JSON file to Spark Data Frame"
master = "local"

# Create Spark session
spark = SparkSession.builder \
    .appName(appName) \
    .master(master) \
    .getOrCreate()

# Create a schema for the dataframe
schema = StructType([
    StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
])

# Create data frame
json_file_path = "data.json"
df = spark.read.json(json_file_path, schema, multiLine=True)
print(df.schema)
print(df.head(3))

It prints: [Row(callsign=None, name=None, mmsi=None)]. What do I do wrong? I have set my environment variables in the system settings.

1
  • 1
    Yes, thanks. I just started learning pyspark and exploring similarities/differences with Pandas. Commented May 19, 2020 at 6:29

1 Answer 1

1

You are having positionmessage struct field and missing in schema.

Change the schema to include struct field as shown below:

schema = StructType([StructField("positionmessage",StructType([StructField('callsign', StringType(), True),
    StructField('name', StringType(), True),
    StructField('mmsi', IntegerType(), True)
]))])

spark.read.schema(schema).json("<path>").\
select("positionmessage.*").\
show()
#+--------+----+----+
#|callsign|name|mmsi|
#+--------+----+----+
#|    PPH1| 0.0| 100|
#|    PPH2| 0.0| 200|
#|    PPH3| 0.0| 300|
#+--------+----+----+
Sign up to request clarification or add additional context in comments.

3 Comments

How do load a timestamp yyyy-mm-dd- hh:mm:ss when I have per row an extra field called: "timestamplast": "2019-08-01T00:00:01Z"
Not sure what you mean by extra field, You can try adding field in schema and read as string type then use to_timestamp function to get the desired format. If this doesn't work please open new question and tag me in it :) ..!
Just made a new question "pyspark clean data within dataframe" where it is more clear what I try to do. Thanks. stackoverflow.com/questions/61899539/…

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.