0

I'm trying to use Spark to turn one row into many rows. My goal is something like a SQL UNPIVOT.

I have a pipe delimited text file that is 360GB, compressed (gzip). It has over 1,620 columns. Here's the basic layout:

primary_key|property1_name|property1_value|property800_name|property800_value
12345|is_male|1|is_college_educated|1

There are over 800 of these property name/value fields. There are roughly 280 million rows. The file is in an S3 bucket.

The users want me to unpivot the data. For example:

primary_key|key|value
12345|is_male|1
12345|is_college_educated|1

This is my first time using Spark. I'm struggling to figure out a good way to do this.
What is a good way to do this in Spark?

Thanks.

1 Answer 1

3

The idea is to generate a list of lines from each input line as you have shown. This will give an RDD of lists of lines. Then use flatMap to get an RDD of individual lines:

If your file is loaded in as rdd1, then the following should give you what you want:

rdd1.flatMap(break_out)

where the function for processing lines is defined as:

def break_out(line):
  # split line into individual fields/values
  line_split=line.split("|")
  # get the values for the line
  vals=line_split[::2]
  # field names for the line
  keys=line_split[1::2]
  # first value is primary key
  primary_key=vals[0]
  # get list of field values, pipe delimited
  return(["|".join((primary_key, keys[i], vals[i+1])) for i in range(len(keys))])

You may need some additional code to deal with header lines etc, but this should work.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.