2

I'm reading a source that got descriptions longer then 256 chars. I want to write them to Redshift.

According to: https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns it is only possible in Scala.

According to this: https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691 it should be a workaround to specify the schema when creating the dataframe. I'm not able to get it to work.

How can I specify the schema with varchar(max)?

df = ...from source

schema = StructType([
    StructField('field1', StringType(), True),
    StructField('description', StringType(), True)
])

df = sqlContext.createDataFrame(df.rdd, schema)
0

1 Answer 1

3

Redshift maxlength annotations are passed in format

{"maxlength":2048}

so this is the structure you should pass to StructField constructor:

from pyspark.sql.types import StructField, StringType

StructField("description", StringType(), metadata={"maxlength":2048})

or alias method:

from pyspark.sql.functions import col

col("description").alias("description", metadata={"maxlength":2048})

If you use PySpark 2.2 or earlier please check How to change column metadata in pyspark? for workaround.

Sign up to request clarification or add additional context in comments.

1 Comment

Setting this as the correct answer, even tho I did not get it to work yet, it answers my question. It should also work in python now as well, according to docs.databricks.com/spark/latest/data-sources/aws/… (Databricks have recently closed-sourced the spark-redshift project)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.