Pyspark sample value from array column

Question

I have my spark dataframe as follow:

target_id   other_ids
3733345     [3731634, 3729995, 3728014, 3708332, 3720...
3725312     [3711541, 3726052, 3733763, 900056057, 371...
3717114     [3701718, 3713481, 3715433, 3714825, 3731...
3408996     [3405896, 3250400, 3237054, 3242492, 3256...
3354970     [3354969, 3347893, 3348168, 3353273, 3356...

I want to first shuffle the elements in the arrays in of other_ids column and then create a new column new_id where I sample an id from the array of other_ids column where target_id is not in other_ids.
Final result:

target_id   other_ids                                      new_id
3733345     [3731634, 3729995, 3728014, 3708332, 3720...   3708332
3725312     [3711541, 3726052, 3733763, 900056057, 371...  900056057
3717114     [3701718, 3713481, 3715433, 3714825, 3731...   3250400
3408996     [3405896, 3250400, 3237054, 3242492, 3256...   3237054
3354970     [3354969, 3347893, 3348168, 3353273, 3356...   3353273

Any suggestions? Thnaks.

Can you share your output dataset? I kind of lost here!!

Dipanjan Mallick
– Dipanjan Mallick

2022-03-24 11:57:10 +00:00
Commented Mar 24, 2022 at 11:57 — Dipanjan Mallick
– Dipanjan Mallick, Commented Mar 24, 2022 at 11:57

Emma · Accepted Answer · 2022-03-24 16:36:04Z

1

You can try this -

df = df.withColumn('new_id', F.element_at(
    F.shuffle(
        F.array_except(F.col('other_ids'), F.array(F.col('target_id')))
    ),
    1
))

answered Mar 24, 2022 at 16:36

Emma

9,5781 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

3nomis Over a year ago

That is actually a good and simple solution. I was wondering do you know how to make the shuffle deterministic?

Emma Over a year ago

shuffle doesn't take seed. If you want it to be deterministic, do you need to pick from random position in an array or can it be always element_at 1 without shuffle? The other option I can think of is to use udf with randint.

Collectives™ on Stack Overflow

Pyspark sample value from array column

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related