I am new to Spark and trying out various things to understand Spark. Currently, I have a CSV that I am trying to parse and manipulate it to my required format. I am not understanding how to do pivot and get the output or by any other means as well. My CSV looks like this:
AHeader AValue, BHeader BValue, CHeader CValue
Now the CSV output I am trying to build is something like this:
AHeader, AValue
BHeader, BValue
CHeader, CValue
This is my current code:
datafile_csv = "test.csv"
def process_csv(abspath, sparkcontext):
sqlContext = SQLContext (sparkcontext)
df = sqlContext.read.load (os.path.join (abspath, datafile_csv),
format='com.databricks.spark.csv',
inferSchema='true')
df.registerTempTable("currency")
print "Dataframe:"
display(df)
// Don't know what to do here ????
reshaped_df = df.groupby('_c0')
display(reshaped_df)
if __name__ == "__main__":
abspath = os.path.abspath(os.path.dirname(__file__))
conf = (SparkConf ()
. setMaster("local[20]")
. setAppName("Currency Parser")
. set("spark.executor.memory", "2g"))
sc = SparkContext(conf=conf)
process_csv (abspath, sc)
I am not sure how I can convert this dataframe to the expected output. Do I need to transpose all the columns to rows and then do SparkSQL on them? What is the correct solution for this?
lineterminatoras the comma, and space as the column separator.