How to pull the slice of an array in Spark SQL (Dataframes)?

Question

I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:

|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|

In both array-types, from 'courses' onward is the same data and structure.

I want to take the slice of the array using a case statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax [3:], and normal PostgreSQL syntax [3, n] where n is the length of the array. If it's not 'api', then just take the given value.

My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.

given value would just be the row. So for example, if the array is ['courses', 'etc...', ...] then leave the value as that, but if it were ['api', 'v1', 'courses', 'something...'] then set the value to ['courses', 'something', ...] — flybonzai
– flybonzai, Commented Jun 9, 2016 at 21:53
I think a udf would come in handy here. see ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example — Saif Charaniya
– Saif Charaniya, Commented Jun 9, 2016 at 22:08

nessa.gp · Accepted Answer · 2018-06-25 20:47:37Z

2

It's very easy just define a UDF, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).

from pyspark.sql.functions import udf

df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()

getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)

df.select(getUDF(col("_1"), lit(0))).show()

+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
|                [bs, xd]|
|            [bc, cd, :x]|
+------------------------+

edited Jun 25, 2018 at 20:47

nessa.gp

1,88421 silver badges21 bronze badges

answered Jun 9, 2016 at 22:49

Alberto Bonsanto

18.1k10 gold badges67 silver badges93 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

flybonzai Over a year ago

So if I pass in a literal lower and upper bound into the UDF, then I can just return it as such array[lbound:ubound] and set my UDF return value to ArrayType(StringType())?

Alberto Bonsanto Over a year ago

Type is inferred by the UDF, but yes for the rest.

Wilmerton Over a year ago

sometimes you want to avoid UDF for optimizing your code. Here is an example in Scala. But the logic here should be very similar by defining a slice function (Int,Int) => Column and using when and otherwise from the Column API.

datapug · Accepted Answer · 2019-08-08 21:49:01Z

0

Assuming that the column in your Dataframe is called http_col and the first item in the array is an empty string, a possible solution is:

df.selectExpr(
  """if(array_contains(http_col, 'api'),
        slice(http_col, 4, size(http_col) - 3),
        http_col) as cleaned_http_col
  """
)

If you have Spark >= 2.4.0 another option could be:

df.selectExpr(
  "array_remove(array_remove(http_col, 'api'), 'v1') as cleaned_http_col"
)

answered Aug 8, 2019 at 21:49

datapug

2,4411 gold badge19 silver badges35 bronze badges

Collectives™ on Stack Overflow

How to pull the slice of an array in Spark SQL (Dataframes)?

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related