I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.
I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:
primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1
There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.
The users want me to unpivot the data. For example:
primary_key|key|value
12345|is_male|1
12345|is_college_educated|1
This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?
Thanks.