I have data/rows of multiple key/value pairs with an unknown number of keys -- some overlapping and some not -- that I would like to create a Spark DataFrame from. My ultimate goal is to write CSV from this DataFrame.
I have flexibility with the input data/rows: most readily they are JSON strings, but could be converted, varying by potentially overlapping keys:
{"color":"red", "animal":"fish"}
{"color":"green", "animal":"panda"}
{"color":"red", "animal":"panda", "fruit":"watermelon"}
{"animal":"aardvark"}
{"color":"blue", "fruit":"apple"}
Ideally, I would like to create a DataFrame that looks like this from this data:
-----------------------------
color | animal | fruit
-----------------------------
red | fish | null
green | panda | null
red | panda | watermelon
null | aardvark | null
blue | null | apple
-----------------------------
Of note, data/rows without a particular key are null, and all keys from the data/rows are represented as columns.
I feel relatively comfortable with many of the Spark basics, but am having trouble envisioning a process for efficiently taking my RDD/DataFrame with key/value pairs -- but an unknown number of columns and keys -- and creating a DataFrame with those keys as columns.
Efficient, in that I would like to avoid, if possible, creating an object where all input rows are held in memory (e.g. a single dictionary).
With, again, the final goal of writing CSV, where I'm assuming creating a DataFrame is a logical step to that end.
Another wrinkle:
Some of the data will be multivalued, something like:
{"color":"pink", "animal":["fish","mustang"]}
{"color":["orange","purple"], "animal":"panda"}
With a provided delimiter, e.g. / to avoid collision with , for delimiting columns, I would like to delimit these in output for column, e.g.:
------------------------------------
color | animal | fruit
------------------------------------
pink | fish/mustang | null
orange/purple | panda | null
------------------------------------
Once there is an approach for the primary question, I'm confident I can work this part out, but throwing it out there anyhow as it will be a dimension of the problem.
df = spark.read.json("myfile.json"). Seems to work for me on your first example. Update: It also works for your second example, but treats all records as strings so you'll have to do some regex to convert the string representation of the list to format it in your desired way..json()method from an RDD, not from reading an external location?read.json()might accept an RDD as well, spark.apache.org/docs/latest/api/python/…, giving that a go...