Can spark dataframe (scala) be converted to dataframe in pandas (python)

Question

The Dataframe is created using scala api for SPARK

val someDF = spark.createDataFrame( spark.sparkContext.parallelize(someData), StructType(someSchema) )

I want to convert this to Pandas Dataframe

PySpark provides .toPandas() to convert a spark dataframe to pandas but there is no equivalent for scala(that I can find)

Please help me in this regard.

Pandas is python library. How you are going to work with python library in Scala and for why? PySpark is just a python wrapper under Spark API. — Boris Azanov
– Boris Azanov, Commented Apr 5, 2020 at 10:34
@BorisAzanov , My Issue is to create data frames using SPARK Scala API (for data processing jobs) and then convert the resultant data frames from SPARK as pandas dataframes for modelling purposes to perform further data science analysis. I am using scala for SPARK jobs to get the native support advantages as compared to PySpark . — manish dev
– manish dev, Commented Apr 5, 2020 at 13:07

gsthina · Accepted Answer · 2020-04-05 14:03:56Z

2

To convert a Spark DataFrame into a Pandas DataFrame, you can enable spark.sql.execution.arrow.enabled to true and then read/create a DataFrame using Spark and then convert it to Pandas DataFrame using Arrow

Enable spark.conf.set("spark.sql.execution.arrow.enabled", "true")
Create DataFrame using Spark like you did:

    val someDF = spark.createDataFrame()

Convert the same to a pandas DataFrame

result_pdf = someDF.select("*").toPandas()

The above commands run using Arrow, because of the config spark.sql.execution.arrow.enabled set to true

Hope this helps!

answered Apr 5, 2020 at 14:03

gsthina

1,1008 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Boris Azanov · Accepted Answer · 2020-04-05 13:24:02Z

1

In Spark DataFrame is just abstraction above data, most common sources of data are files from file system. When you convert dataframe in PySpark to Pandas format, PySpark just convert PySpark abstraction above data to another abstraction from another python framework. If you want made conversion in Scala between Spark and Pandas you can't do that because Pandas is Python library for work with data but spark is not and you will have some difficulties with Python and Scala integration. The best simple things you can do here:

Write dataframe to file system on scala Spark
Read data from file system using Pandas.

answered Apr 5, 2020 at 13:24

Boris Azanov

4,5011 gold badge19 silver badges31 bronze badges

2 Comments

manish dev Over a year ago

Thanks @BorisAzanov ,so to write a scala dataframe to a file system and then read would incur significant i/o hit .

Boris Azanov Over a year ago

@manishdev yes, you're right but I think it's common thing in workflow data processing + modeling

Collectives™ on Stack Overflow

Can spark dataframe (scala) be converted to dataframe in pandas (python)

2 Answers 2

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related