7

I have the following column in a pyspark dataframe, of type Array[Int].

+--------------------+
|     feature_indices|
+--------------------+
|                 [0]|
|[0, 1, 4, 10, 11,...|
|           [0, 1, 2]|
|                 [1]|
|                 [0]|
+--------------------+

I am trying to pad the array with zeros, and then limit the list length, so that the length of each row's array would be the same. For example, for n = 5, I expect:

+--------------------+
|     feature_indices|
+--------------------+
|     [0, 0, 0, 0, 0]|
|   [0, 1, 4, 10, 11]|
|     [0, 1, 2, 0, 0]|
|     [1, 0, 0, 0, 0]|
|     [0, 0, 0, 0, 0]|
+--------------------+

Any suggestions? I looked at pyspark rpad function, but it only operates on string type columns.

2 Answers 2

6

You can write a udf to do this:

from pyspark.sql.types import ArrayType, IntegerType
import pyspark.sql.functions as F

pad_fix_length = F.udf(
    lambda arr: arr[:5] + [0] * (5 - len(arr[:5])), 
    ArrayType(IntegerType())
)

df.withColumn('feature_indices', pad_fix_length(df.feature_indices)).show()
+-----------------+
|  feature_indices|
+-----------------+
|  [0, 0, 0, 0, 0]|
|[0, 1, 4, 10, 11]|
|  [0, 1, 2, 0, 0]|
|  [1, 0, 0, 0, 0]|
|  [0, 0, 0, 0, 0]|
+-----------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Excellent, thank you! I was struggling with composing the udf properly.
What if we don't give ArrayType(IntegerType()) in udf then?
Is there a way to do without pandasUDF? Its a costly computation and with size of data what i have, its though to use pandasUDF.
0

I recently used the pad_sequences function within Keras to do something similar. I'm not sure of your usecase so this might be an unnecessarily large dependency to add on.

Anyways, here's the link to the documentation for the function: https://keras.io/preprocessing/sequence/#pad_sequences

from keras.preprocessing.sequence import pad_sequences    

input_sequence =[[1,2,3], [1,2], [1,4]]

padded_sequence = pad_sequences(input_sequence, maxlen=3, padding='post', truncating='post', value=0.0)

print padded_sequence

The output:

[[1 2 3]
 [1 2 0]
 [1 4 0]]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.