DataFrameReader.json method provides optional schema argument you can use here. If your schema is complex the simplest solution is to reuse one inferred from the file which contains all the fields:
df_complete = spark.read.json("complete_file")
schema = df_complete.schema
df_with_missing = spark.read.json("df_with_missing", schema)
# or
# spark.read.schema(schema).("df_with_missing")
If you know schema but for some reason you cannot use above you have to create it from scratch.
schema = StructType([
StructField("A", LongType(), True), ..., StructField("C", LongType(), True)])
As always be sure to perform some quality checks after loading your data.
Example (note that all fields are nullable):
from pyspark.sql.types import *
schema = StructType([
StructField("x1", FloatType()),
StructField("x2", StructType([
StructField("y1", DoubleType()),
StructField("y2", StructType([
StructField("z1", StringType()),
StructField("z2", StringType())
]))
])),
StructField("x3", StringType()),
StructField("x4", IntegerType())
])
spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).printSchema()
## root
## |-- x1: float (nullable = true)
## |-- x2: struct (nullable = true)
## | |-- y1: double (nullable = true)
## | |-- y2: struct (nullable = true)
## | | |-- z1: string (nullable = true)
## | | |-- z2: string (nullable = true)
## |-- x3: string (nullable = true)
## |-- x4: integer (nullable = true)
spark.read.json(sc.parallelize(["""{"x4": 1}"""]), schema).first()
## Row(x1=None, x2=None, x3=None, x4=1)
spark.read.json(sc.parallelize(["""{"x3": "foo", "x1": 1.0}"""]), schema).first()
## Row(x1=1.0, x2=None, x3='foo', x4=None)
spark.read.json(sc.parallelize(["""{"x2": {"y2": {"z2": "bar"}}}"""]), schema).first()
## Row(x1=None, x2=Row(y1=None, y2=Row(z1=None, z2='bar')), x3=None, x4=None)
Important:
This method is applicable only to JSON source and depend on the detail of implementation. Don't use it for sources like Parquet.