0

I am new to Spark and trying out various things to understand Spark. Currently, I have a CSV that I am trying to parse and manipulate it to my required format. I am not understanding how to do pivot and get the output or by any other means as well. My CSV looks like this:

AHeader AValue, BHeader BValue, CHeader CValue

Now the CSV output I am trying to build is something like this:

AHeader, AValue
BHeader, BValue
CHeader, CValue

This is my current code:

datafile_csv = "test.csv"

def process_csv(abspath, sparkcontext):
    sqlContext = SQLContext (sparkcontext)
    df = sqlContext.read.load (os.path.join (abspath, datafile_csv),
                               format='com.databricks.spark.csv',
                               inferSchema='true')

    df.registerTempTable("currency")
    print "Dataframe:"
    display(df)
    // Don't know what to do here ????
    reshaped_df = df.groupby('_c0')
    display(reshaped_df)

if __name__ == "__main__":

    abspath = os.path.abspath(os.path.dirname(__file__))
    conf = (SparkConf ()
            . setMaster("local[20]")
            . setAppName("Currency Parser")
            . set("spark.executor.memory", "2g"))
    sc = SparkContext(conf=conf)
    process_csv (abspath, sc)

I am not sure how I can convert this dataframe to the expected output. Do I need to transpose all the columns to rows and then do SparkSQL on them? What is the correct solution for this?

2
  • You'll need to use a custom line separator to parse this, which I don't think databricks will support. Try pandas read_csv, defining your lineterminator as the comma, and space as the column separator. Commented Nov 18, 2019 at 21:06
  • @Andrew can you help me on how this can be done with Pandas? Not sure as I am new to Spark ecosystem Commented Nov 19, 2019 at 2:26

1 Answer 1

1

You're asking two questions here. First question is the ETL question of loading your CSV properly, which might be better done in pandas (due to your narrowly specific data structure) such as:

import pandas as pd
from pyspark.sql import SparkSession
from io import StringIO

spark = SparkSession.builder.getOrCreate()
TESTDATA = StringIO("""AHeader AValue, BHeader BValue, CHeader CValue""")

pandas_df = pd.read_csv(TESTDATA,  # replace with path to your csv
                        delim_whitespace=True,
                        lineterminator=",",
                        header=None,
                        names=['col1', 'col2'])
spark_df = spark.createDataFrame(pandas_df)
spark_df.show()

+-------+------+
|   col1|  col2|
+-------+------+
|AHeader|AValue|
|BHeader|BValue|
|CHeader|CValue|
+-------+------+

Your second question is regarding a pivot in spark. While the pandas.read_csv() puts it into the shape you asked for, if you need further reshaping, have a look here: http://spark.apache.org/docs/2.4.3/api/python/pyspark.sql.html?highlight=pivot#pyspark.sql.GroupedData.pivot

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.