PySpark: Using an existing Spark DataFrame's Schema on new Spark DataFrame

Question

In Python, I have an existing Spark DataFrame that includes 135~ columns, called sc_df1. I also have a Pandas DataFrame with the exact same columns that I want to convert to a Spark DataFrame and then unionByName the two Spark DataFrames. i.e., sc_df1.unionByName(sc_df2).

Does anyone know how to use the schema of sc_df1 when converting the Pandas DataFrame to a Spark DataFrame, so that the two Spark DataFrames will have the same schema when unioning?

I know this isn't working, but below is essentially what I'm trying to do:

sc_df2 = sc.createDataFrame(df2, schema = sc_df1.dtypes)

Does using sc_df1.schema work?

Shaido
– Shaido

2020-04-27 02:23:46 +00:00
Commented Apr 27, 2020 at 2:23 — Shaido
– Shaido, Commented Apr 27, 2020 at 2:23

notNull · Accepted Answer · 2020-04-27 02:40:09Z

2

Use spark.createDataFrame() by passing pandas_dataframe with schema of sc_df1 dataframe.

Example:

df=spark.createDataFrame([("a",1),("b",2)],["id","name"])

#converting to pandas df
pandas_df=df.toPandas()

print(type(pandas_df))
#<class 'pandas.core.frame.DataFrame'>

#converting back to pysparkdf by passing df schema
spark_df=spark.createDataFrame(pandas_df,schema=df.schema)

print(type(spark_df))
#<class 'pyspark.sql.dataframe.DataFrame'>

spark_df.show()
#+---+----+
#| id|name|
#+---+----+
#|  a|   1|
#|  b|   2|
#+---+----+


#union both dataframes
df.union(spark_df).show()
#+---+----+
#| id|name|
#+---+----+
#|  a|   1|
#|  b|   2|
#|  a|   1|
#|  b|   2|
#+---+----+

answered Apr 27, 2020 at 2:40

notNull

31.8k4 gold badges41 silver badges58 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

PySpark: Using an existing Spark DataFrame's Schema on new Spark DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related