1

I'm migrating the following function to a function to sql udf spark.

DROP FUNCTION IF EXISTS anyarray_enumerate(anyarray);
CREATE FUNCTION anyarray_enumerate(anyarray)
RETURNS TABLE (index bigint, value anyelement) AS
$$
SELECT
    row_number() OVER (),
    value
FROM (
    SELECT unnest($1) AS value
) AS unnested
$$
LANGUAGE sql IMMUTABLE;

I do not get that spark sql output is similar to that obtained in SQL. Any help or idea?

demo=# select anyarray_enumerate(array[599,322,119,537]);
anyarray_enumerate
--------------------
 (1,599)
 (2,322)
 (3,119)
 (4,537)
 (4 rows)

My current code is:

def anyarray_enumerate[T](anyarray: WrappedArray[T]) = anyarray.zipWithIndex   
// Registers a function as a UDF so it can be used in SQL statements.
sqlContext.udf.register("anyarray_enumerate", anyarray_enumerate(_:WrappedArray[Int]))

Thank you

1 Answer 1

1

Your UDF returns the whole array of tuples in a single row:

spark.sql("select anyarray_enumerate(array(599, 322, 119, 537)) as foo").show()
+--------------------+
|                 foo|
+--------------------+
|[[599,0], [322,1]...|
+--------------------+

but you can use the explode() function to split that into multiple rows:

spark.sql("select explode(anyarray_enumerate(array(599, 322, 119, 537))) as foo").show()
+-------+
|    foo|
+-------+
|[599,0]|
|[322,1]|
|[119,2]|
|[537,3]|
+-------+

Also, the zipWithIndex method returns the value first and index second, unlike your SQL command, but that's easily fixed in the UDF.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.