0

I'm writing a Java application. I have a spark Dataset<MyObject> that results in a binary type column:

Dataset<MyObject> dataset = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
dataset.printSchema();

//root
//|-- value: binary (nullable = true)

MyObject has different (nested) fields, and I want to "explode" them in multiple columns in my Dataset. The new columns also need to be computed from multiple attributes in MyObject. As a solution, I could use .withColumn() and apply a UDF. Unfortunately, I don't know how to accept a binary type in the UDF and then convert it to MyObject. Any suggestions on how to do that?

1
  • 1
    Binary is represented as an array of bytes, you can try with byte[] as input type for the UDF. And see this post for how to return complex type from the UDF. Commented Jan 28, 2021 at 18:18

1 Answer 1

1

Thanks to blackbishop's suggestion I solved it. Here is the complete solution:

You need to register the UDF:

UDFRegistration udfRegistration = sparkSession.sqlContext().udf();
udfRegistration.register("extractSomeLong", extractSomeLong(), DataTypes.LongType);

Declare and implement the UDF. The first argument must be byte[] and you need to convert the byte array to your object as indicated:

private static UDF1<byte[], Long> extractSomeLong() {
    return (byteArray) -> {
        if (byteArray != null) {
            ByteArrayInputStream in = new ByteArrayInputStream(byteArray);
            ObjectInputStream is = new ObjectInputStream(in);
            MyObject traceWritable = (MyObject) is.readObject();
            return traceWritable.getSomeLong();
        }
        else {
            return -1L;
        }
    };
}

And finally it can be used with:

Dataset<MyObject> data = sparkSession.createDataset(someRDD, Encoders.javaSerialization(MyObject.class));
Dataset<Row> processedData = data.withColumn( "ID", functions.callUDF( "extractSomeLong", new Column("columnName")))
Sign up to request clarification or add additional context in comments.

1 Comment

I think you can return map or struct type if you need to extract more than one field. Better than extracting each one with a specific UDF :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.