1

I would like to specify a schema for spark dataframes in python. After I load the data once, I can print the Schema, I might see something like

df = spark.read.json(datapath)
df.schema

StructType(List(StructField(fldname,StringType,true)))

Having created this python object: df.schema by reading the data, I can now use it to read more. However I think I will wait less if I don't have to first read the data to get the schema - I'd like to persist the schema, even just typing in the schema in my script. For typing it in, I've tried

from pyspark.sql.types import StructType, StructField, StringType

schema = StructType([ StructField('fldname', StringType, True)])

but I get the message

AssertionError: dataType should be DataType

This is spark 2.0.2

1
  • Instead of StringType, use StringType(). Commented Jun 19, 2017 at 17:06

1 Answer 1

2

While creating the schema you missed () parenthesis

schema = StructType([ StructField('fldname', StringType(), True)])

In python, you need to construct as StringType() instead of using a singleton.

Hope this solved the issue.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.