1

I need a UDF2 that takes two arguments as input corresponding to two Dataframe columns of types String and mllib.linalg.Vector and return a Tuple2. IS this doable? if yes, how do I register this udf()?

hiveContext.udf().register("getItemData", get_item_data, WHAT GOES HERE FOR RETURN TYPE?);

the udf is defined as follows:

UDF2<String, org.apache.spark.mllib.linalg.Vector, Tuple2<String, org.apache.spark.mllib.linalg.Vector>> get_item_data =
            (String id, org.apache.spark.mllib.linalg.Vector features) -> {
        return new Tuple2<>(id, features);
    };

1 Answer 1

2

There goes a schema which can be defined as follows:

import org.apache.spark.sql.types.DataType;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.mllib.linalg.VectorUDT;

List<StructField> fields = new ArrayList<>();
fields.add(DataTypes.createStructField("id", DataTypes.StringType, false));
fields.add(DataTypes.createStructField("features", new VectorUDT(), false));
DataType schema = DataTypes.createStructType(fields);

but if all you need is just a struct without any additional processing org.apache.spark.sql.functions.struct should do the trick:

df.select(struct(col("id"), col("features"));
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.