PySpark - Dataframe Manipulations

Question

I am new to Spark and trying out various things to understand Spark. Currently, I have a CSV that I am trying to parse and manipulate it to my required format. I am not understanding how to do pivot and get the output or by any other means as well. My CSV looks like this:

AHeader AValue, BHeader BValue, CHeader CValue

Now the CSV output I am trying to build is something like this:

AHeader, AValue
BHeader, BValue
CHeader, CValue

This is my current code:

datafile_csv = "test.csv"

def process_csv(abspath, sparkcontext):
    sqlContext = SQLContext (sparkcontext)
    df = sqlContext.read.load (os.path.join (abspath, datafile_csv),
                               format='com.databricks.spark.csv',
                               inferSchema='true')

    df.registerTempTable("currency")
    print "Dataframe:"
    display(df)
    // Don't know what to do here ????
    reshaped_df = df.groupby('_c0')
    display(reshaped_df)

if __name__ == "__main__":

    abspath = os.path.abspath(os.path.dirname(__file__))
    conf = (SparkConf ()
            . setMaster("local[20]")
            . setAppName("Currency Parser")
            . set("spark.executor.memory", "2g"))
    sc = SparkContext(conf=conf)
    process_csv (abspath, sc)

I am not sure how I can convert this dataframe to the expected output. Do I need to transpose all the columns to rows and then do SparkSQL on them? What is the correct solution for this?

You'll need to use a custom line separator to parse this, which I don't think databricks will support. Try pandas read_csv, defining your lineterminator as the comma, and space as the column separator. — Andrew
– Andrew, Commented Nov 18, 2019 at 21:06
@Andrew can you help me on how this can be done with Pandas? Not sure as I am new to Spark ecosystem — Killer Beast
– Killer Beast, Commented Nov 19, 2019 at 2:26

Napoleon Borntoparty · Accepted Answer · 2019-11-19 10:08:33Z

You're asking two questions here. First question is the ETL question of loading your CSV properly, which might be better done in pandas (due to your narrowly specific data structure) such as:

import pandas as pd
from pyspark.sql import SparkSession
from io import StringIO

spark = SparkSession.builder.getOrCreate()
TESTDATA = StringIO("""AHeader AValue, BHeader BValue, CHeader CValue""")

pandas_df = pd.read_csv(TESTDATA,  # replace with path to your csv
                        delim_whitespace=True,
                        lineterminator=",",
                        header=None,
                        names=['col1', 'col2'])
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()

+-------+------+
|   col1|  col2|
+-------+------+
|AHeader|AValue|
|BHeader|BValue|
|CHeader|CValue|
+-------+------+

Your second question is regarding a pivot in spark. While the pandas.read_csv() puts it into the shape you asked for, if you need further reshaping, have a look here: http://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html?highlight=pivot#pyspark.sql.GroupedData.pivot

Collectives™ on Stack Overflow

PySpark - Dataframe Manipulations

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related