4

I have a column full of arrays containing split http requests. I have them filtered down to one of two possibilities:

|[, courses, 27381...|
|[, courses, 27547...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, api, v1, cours...|
|[, courses, 33287...|
|[, courses, 24024...|

In both array-types, from 'courses' onward is the same data and structure.

I want to take the slice of the array using a case statement where if the first element of the array is 'api', then take elements 3 -> end of the array. I've tried using Python slice syntax [3:], and normal PostgreSQL syntax [3, n] where n is the length of the array. If it's not 'api', then just take the given value.

My ideal end-result would be an array where every row shares the same structure, with courses in the first index for easier parsing from that point onwards.

3
  • the given value == whole array? Commented Jun 9, 2016 at 21:50
  • given value would just be the row. So for example, if the array is ['courses', 'etc...', ...] then leave the value as that, but if it were ['api', 'v1', 'courses', 'something...'] then set the value to ['courses', 'something', ...] Commented Jun 9, 2016 at 21:53
  • I think a udf would come in handy here. see ragrawal.wordpress.com/2015/10/02/spark-custom-udf-example Commented Jun 9, 2016 at 22:08

2 Answers 2

2

It's very easy just define a UDF, you made a very similar question previously so I won't post the exact answer to let you think and learn (for your own good).

from pyspark.sql.functions import udf

df = sc.parallelize([(["ab", "bs", "xd"],), (["bc", "cd", ":x"],)]).toDF()

getUDF = udf(lambda x, y: x[1:] if x[y] == "ab" else x)

df.select(getUDF(col("_1"), lit(0))).show()

+------------------------+
|PythonUDF#<lambda>(_1,0)|
+------------------------+
|                [bs, xd]|
|            [bc, cd, :x]|
+------------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

So if I pass in a literal lower and upper bound into the UDF, then I can just return it as such array[lbound:ubound] and set my UDF return value to ArrayType(StringType())?
Type is inferred by the UDF, but yes for the rest.
sometimes you want to avoid UDF for optimizing your code. Here is an example in Scala. But the logic here should be very similar by defining a slice function (Int,Int) => Column and using when and otherwise from the Column API.
0

Assuming that the column in your Dataframe is called http_col and the first item in the array is an empty string, a possible solution is:

df.selectExpr(
  """if(array_contains(http_col, 'api'),
        slice(http_col, 4, size(http_col) - 3),
        http_col) as cleaned_http_col
  """
)

If you have Spark >= 2.4.0 another option could be:

df.selectExpr(
  "array_remove(array_remove(http_col, 'api'), 'v1') as cleaned_http_col"
)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.