Spark binary column split into multiple columns

Question

I'm writing a Java application. I have a spark Dataset<MyObject> that results in a binary type column:

Dataset<MyObject> dataset = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
dataset.printSchema();

//root
//|-- value: binary (nullable = true)

MyObject has different (nested) fields, and I want to "explode" them in multiple columns in my Dataset. The new columns also need to be computed from multiple attributes in MyObject. As a solution, I could use .withColumn() and apply a UDF. Unfortunately, I don't know how to accept a binary type in the UDF and then convert it to MyObject. Any suggestions on how to do that?

Binary is represented as an array of bytes, you can try with byte[] as input type for the UDF. And see this post for how to return complex type from the UDF. — blackbishop
– blackbishop, Commented Jan 28, 2021 at 18:18

phcaze · Accepted Answer · 2021-01-28 19:20:13Z

1

Thanks to blackbishop's suggestion I solved it. Here is the complete solution:

You need to register the UDF:

UDFRegistration udfRegistration = sparkSession.sqlContext().udf();
udfRegistration.register("extractSomeLong", extractSomeLong(), DataTypes.LongType);

Declare and implement the UDF. The first argument must be byte[] and you need to convert the byte array to your object as indicated:

private static UDF1<byte[], Long> extractSomeLong() {
    return (byteArray) -> {
        if (byteArray != null) {
            ByteArrayInputStream in = new ByteArrayInputStream(byteArray);
            ObjectInputStream is = new ObjectInputStream(in);
            MyObject traceWritable = (MyObject) is.readObject();
            return traceWritable.getSomeLong();
        }
        else {
            return -1L;
        }
    };
}

And finally it can be used with:

Dataset<MyObject> data = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
Dataset<Row> processedData = data.withColumn( "ID", functions.callUDF( "extractSomeLong", new Column("columnName")))

answered Jan 28, 2021 at 19:20

phcaze

1,7775 gold badges29 silver badges62 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

blackbishop Over a year ago

I think you can return map or struct type if you need to extract more than one field. Better than extracting each one with a specific UDF :)

Collectives™ on Stack Overflow

Spark binary column split into multiple columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related