Transforming one row into many rows using Spark

Question

I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.

I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:

primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1

There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.

The users want me to unpivot the data. For example:

primary_key|key|value
12345|is_male|1
12345|is_college_educated|1

This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?

Thanks.

ags29 · Accepted Answer · 2017-10-24 15:04:03Z

3

The idea is to generate a list of lines from each input line as you have shown. This will give an RDD of lists of lines. Then use flatMap to get an RDD of individual lines:

If your file is loaded in as rdd1, then the following should give you what you want:

rdd1.flatMap(break_out)

where the function for processing lines is defined as:

def break_out(line):
  # split line into individual fields/values
  line_split=line.split("|")
  # get the values for the line
  vals=line_split[::2]
  # field names for the line
  keys=line_split[1::2]
  # first value is primary key
  primary_key=vals[0]
  # get list of field values, pipe delimited
  return(["|".join((primary_key, keys[i], vals[i+1])) for i in range(len(keys))])

You may need some additional code to deal with header lines etc, but this should work.

edited Oct 24, 2017 at 15:04

answered Oct 24, 2017 at 14:43

ags29

2,7061 gold badge11 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Transforming one row into many rows using Spark

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related