0

I have my spark dataframe as follow:

target_id   other_ids
3733345     [3731634, 3729995, 3728014, 3708332, 3720...
3725312     [3711541, 3726052, 3733763, 900056057, 371...
3717114     [3701718, 3713481, 3715433, 3714825, 3731...
3408996     [3405896, 3250400, 3237054, 3242492, 3256...
3354970     [3354969, 3347893, 3348168, 3353273, 3356...

I want to first shuffle the elements in the arrays in of other_ids column and then create a new column new_id where I sample an id from the array of other_ids column where target_id is not in other_ids.
Final result:

target_id   other_ids                                      new_id
3733345     [3731634, 3729995, 3728014, 3708332, 3720...   3708332
3725312     [3711541, 3726052, 3733763, 900056057, 371...  900056057
3717114     [3701718, 3713481, 3715433, 3714825, 3731...   3250400
3408996     [3405896, 3250400, 3237054, 3242492, 3256...   3237054
3354970     [3354969, 3347893, 3348168, 3353273, 3356...   3353273

Any suggestions? Thnaks.

1
  • Can you share your output dataset? I kind of lost here!! Commented Mar 24, 2022 at 11:57

1 Answer 1

1

You can try this -

df = df.withColumn('new_id', F.element_at(
    F.shuffle(
        F.array_except(F.col('other_ids'), F.array(F.col('target_id')))
    ),
    1
))
Sign up to request clarification or add additional context in comments.

2 Comments

That is actually a good and simple solution. I was wondering do you know how to make the shuffle deterministic?
shuffle doesn't take seed. If you want it to be deterministic, do you need to pick from random position in an array or can it be always element_at 1 without shuffle? The other option I can think of is to use udf with randint.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.