Spark DataFrame Schema Nullable Fields

Question

I wrote the following code in both Scala & Python, however the DataFrame that is returned doesn't appear to apply the non-nullable fields in my schema that I am applying. italianVotes.csv is a csv file with '~' as a separator and four fields. I'm using Spark 2.1.0.

italianVotes.csv

2657~135~2~2013-11-22 00:00:00.0
2658~142~2~2013-11-22 00:00:00.0
2659~142~1~2013-11-22 00:00:00.0
2660~140~2~2013-11-22 00:00:00.0
2661~140~1~2013-11-22 00:00:00.0
2662~1354~2~2013-11-22 00:00:00.0
2663~1356~2~2013-11-22 00:00:00.0
2664~1353~2~2013-11-22 00:00:00.0
2665~1351~2~2013-11-22 00:00:00.0
2667~1357~2~2013-11-22 00:00:00.0

Scala

import org.apache.spark.sql.types._
val schema =  StructType(
StructField("id", IntegerType, false) ::
StructField("postId", IntegerType, false) ::
StructField("voteType", IntegerType, true) ::
StructField("time", TimestampType, true) :: Nil)

val fileName = "italianVotes.csv"

val italianDF = spark.read.schema(schema).option("sep", "~").csv(fileName)

italianDF.printSchema()

// output
root
 |-- id: integer (nullable = true)
 |-- postId: integer (nullable = true)
 |-- voteType: integer (nullable = true)
 |-- time: timestamp (nullable = true)

Python

from pyspark.sql.types import *

schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("postId", IntegerType(), False),
    StructField("voteType", IntegerType(), True),
    StructField("time", TimestampType(), True),
])

file_name = "italianVotes.csv"

italian_df = spark.read.csv(file_name, schema = schema, sep = "~")

# print schema
italian_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- postId: integer (nullable = true)
 |-- voteType: integer (nullable = true)
 |-- time: timestamp (nullable = true)

My main question is why are the first two fields nullable when I have set them to non-nullable in my schema?

Interesting to say the least!

Ged
– Ged

2019-03-06 20:06:52 +00:00
Commented Mar 6, 2019 at 20:06 — Ged
– Ged, Commented Mar 6, 2019 at 20:06

zero323 · Accepted Answer · 2017-01-17 20:57:59Z

24

In general Spark Datasets either inherit nullable property from its parents, or infer based on the external data types.

You can argue if it is a good approach or not but ultimately it is sensible. If semantics of a data source doesn't support nullability constraints, then application of a schema cannot either. At the end of the day it is always better to assume that things can be null, than fail on the runtime if this the opposite assumption turns out to be incorrect.

answered Jan 17, 2017 at 20:57

zero323

331k108 gold badges982 silver badges958 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Rajnish Kumar Over a year ago

hi how can we know If semantics of a data source doesn't support nullability constraints

zero323 Over a year ago

@rajNishKuMar As a rule of thumb - if something is plain text format, which doesn't provide schema, it doesn't enforce any constraints.

advocateofnone Over a year ago

@zero323 does it mean if I read json and then do printSchema() I'll always get nullable = true for all fields, even if a field is never null as per the data ? My question: stackoverflow.com/questions/61425977/…

Jason Over a year ago

Not sure this argument holds water since csv also doesn't have types, yet Spark allows coercion of types by specifying a schema, but not nullability by specifying a schema. IMHO it ought to just throw an exception if the data contains nulls where the schema mandates otherwise.

Marti Nito Over a year ago

I strongly disagree:-/. A sensible way to handle this would be to immediately raise an exception before event touching the underlying file and tell the user, that nullable=False is not supported for the respective data source. The alternative would be to validate the data on read, which is probably what was intended by the user in the first place, but which has the downside that it may be quite expensive. However, if as a user i provide nullable=False, i probably want to make that check at some point either way, so I might as well establish all the guarantees on my data when I load it.

Collectives™ on Stack Overflow

Spark DataFrame Schema Nullable Fields

italianVotes.csv

Scala

Python

1 Answer 1

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

italianVotes.csv

Scala

Python

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related