I am using PySpark and am trying to use a CSV to store my data. I converted my Numpy array I had into a DataFrame and it was formatted like so:
label | 0 1 2 4 ... 768
---------------------------------------
1 | 0.12 0.23 0.31 0.72 ... 0.91
and so on, splitting each value of a 'row vector' per se in the array into individual columns. That format is not compatible with Spark, it needs the features all in one column. Is there a way I can load my array into a DataFrame in that format? For example:
label | Features
------------------------------------------
1 | [0.12,0.23,0.31,0.72,...,0.91]
I tried following advice from this thread, which detailed merging the columns using Spark API, but when loading my labels in, I get an error because the labels become part of a vector and not a string or int value.