0

In Python, I have an existing Spark DataFrame that includes 135~ columns, called sc_df1. I also have a Pandas DataFrame with the exact same columns that I want to convert to a Spark DataFrame and then unionByName the two Spark DataFrames. i.e., sc_df1.unionByName(sc_df2).

Does anyone know how to use the schema of sc_df1 when converting the Pandas DataFrame to a Spark DataFrame, so that the two Spark DataFrames will have the same schema when unioning?

I know this isn't working, but below is essentially what I'm trying to do:

sc_df2 = sc.createDataFrame(df2, schema = sc_df1.dtypes)
1
  • 1
    Does using sc_df1.schema work? Commented Apr 27, 2020 at 2:23

1 Answer 1

2

Use spark.createDataFrame() by passing pandas_dataframe with schema of sc_df1 dataframe.

Example:

df=spark.createDataFrame([("a",1),("b",2)],["id","name"])

#converting to pandas df
pandas_df=df.toPandas()

print(type(pandas_df))
#<class 'pandas.core.frame.DataFrame'>

#converting back to pysparkdf by passing df schema
spark_df=spark.createDataFrame(pandas_df,schema=df.schema)

print(type(spark_df))
#<class 'pyspark.sql.dataframe.DataFrame'>

spark_df.show()
#+---+----+
#| id|name|
#+---+----+
#|  a|   1|
#|  b|   2|
#+---+----+


#union both dataframes
df.union(spark_df).show()
#+---+----+
#| id|name|
#+---+----+
#|  a|   1|
#|  b|   2|
#|  a|   1|
#|  b|   2|
#+---+----+
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.