0

The Dataframe is created using scala api for SPARK

val someDF = spark.createDataFrame( spark.sparkContext.parallelize(someData), StructType(someSchema) )

I want to convert this to Pandas Dataframe

PySpark provides .toPandas() to convert a spark dataframe to pandas but there is no equivalent for scala(that I can find)

Please help me in this regard.

2
  • 1
    Pandas is python library. How you are going to work with python library in Scala and for why? PySpark is just a python wrapper under Spark API. Commented Apr 5, 2020 at 10:34
  • @BorisAzanov , My Issue is to create data frames using SPARK Scala API (for data processing jobs) and then convert the resultant data frames from SPARK as pandas dataframes for modelling purposes to perform further data science analysis. I am using scala for SPARK jobs to get the native support advantages as compared to PySpark . Commented Apr 5, 2020 at 13:07

2 Answers 2

2

To convert a Spark DataFrame into a Pandas DataFrame, you can enable spark.sql.execution.arrow.enabled to true and then read/create a DataFrame using Spark and then convert it to Pandas DataFrame using Arrow

  1. Enable spark.conf.set("spark.sql.execution.arrow.enabled", "true")
  2. Create DataFrame using Spark like you did:
    val someDF = spark.createDataFrame()
  1. Convert the same to a pandas DataFrame
result_pdf = someDF.select("*").toPandas()

The above commands run using Arrow, because of the config spark.sql.execution.arrow.enabled set to true

Hope this helps!

Sign up to request clarification or add additional context in comments.

Comments

1

In Spark DataFrame is just abstraction above data, most common sources of data are files from file system. When you convert dataframe in PySpark to Pandas format, PySpark just convert PySpark abstraction above data to another abstraction from another python framework. If you want made conversion in Scala between Spark and Pandas you can't do that because Pandas is Python library for work with data but spark is not and you will have some difficulties with Python and Scala integration. The best simple things you can do here:

  1. Write dataframe to file system on scala Spark
  2. Read data from file system using Pandas.

2 Comments

Thanks @BorisAzanov ,so to write a scala dataframe to a file system and then read would incur significant i/o hit .
@manishdev yes, you're right but I think it's common thing in workflow data processing + modeling

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.