Pyspark: change values in an array column based on another array column

Question

I have the following pyspark dataframe:

root
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- posTags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- dependencies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- labelledDependencies: array (nullable = true)
 |    |-- element: string (containsNull = true)

with an example of the following data

+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|tokens                        |posTags                    |dependencies                       |labelledDependencies                        |
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|[i, try, to, get, my, balance]|[NNP, VB, TO, VB, PRP$, NN]|[try, ROOT, get, try, balance, get]|[nsubj, root, mark, parataxis, appos, nsubj]|
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+

I want to change the labelled dependency of the token balance from nsubj to dobj.

My logic is as follows: if you find a labelled dependency nsubj and the token has POS Tag NN and the token has dependency on a token that has POS tag VB (get) then change nsubj to dobj.

I can do this with the following function:

def change_things(tokens,posTags,dependencies,labelledDependencies):
    for i in range(0,len(labelledDependencies)):
        if labelledDependencies[i] == 'nsubj':
            if posTags[i] == 'NN':
                if posTags[tokens.index(dependencies[i])] == 'VB':
                    labelledDependencies[i] = 'dobj'
    return tokens,posTags,dependencies,labelledDependencies

and maybe even register it as a udf.

However, my question is how I can do this without using a udf and instead only pyspark built-in methods.

blackbishop · Accepted Answer · 2021-11-03 14:31:19Z

1

You can use Spark built-in transform function :

import pyspark.sql.functions as F

df2 = df.withColumn(
    "labelledDependencies",
    F.expr("""transform(
            labelledDependencies, 
            (x, i) -> CASE WHEN x = 'nsubj' 
                                AND posTags[i] = 'NN' 
                                AND posTags[array_position(tokens, dependencies[i]) - 1] = 'VB' 
                           THEN 'dobj'
                           ELSE x
                      END
        )
    """)
)



df2.show(1, False)
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+
#|tokens                        |posTags                    |dependencies                       |labelledDependencies                       |
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+
#|[i, try, to, get, my, balance]|[NNP, VB, TO, VB, PRP$, NN]|[try, ROOT, get, try, balance, get]|[nsubj, root, mark, parataxis, appos, dobj]|
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+

answered Nov 3, 2021 at 14:31

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

romborimba Over a year ago

Thanks! It worked. What does AND posTags[array_position(tokens, dependencies[i]) - 1] this do exactly?

blackbishop Over a year ago

@romborimba it is equivalent to the condition if posTags[tokens.index(dependencies[i])] == 'VB' in your code. i is the index of the current element in labelledDependencies array.

romborimba Over a year ago

Thanks, it's clear. Final question, how would I go if I wanted to create a new column and not transform an existing one?

blackbishop Over a year ago

@romborimba just rename it ;) df.withColumn("labelledDependencies_v2", ...)

Collectives™ on Stack Overflow

Pyspark: change values in an array column based on another array column

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related