2

I have the following pyspark dataframe:

root
 |-- tokens: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- posTags: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- dependencies: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- labelledDependencies: array (nullable = true)
 |    |-- element: string (containsNull = true)

with an example of the following data

+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|tokens                        |posTags                    |dependencies                       |labelledDependencies                        |
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|[i, try, to, get, my, balance]|[NNP, VB, TO, VB, PRP$, NN]|[try, ROOT, get, try, balance, get]|[nsubj, root, mark, parataxis, appos, nsubj]|
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+

I want to change the labelled dependency of the token balance from nsubj to dobj.

My logic is as follows: if you find a labelled dependency nsubj and the token has POS Tag NN and the token has dependency on a token that has POS tag VB (get) then change nsubj to dobj.

I can do this with the following function:

def change_things(tokens,posTags,dependencies,labelledDependencies):
    for i in range(0,len(labelledDependencies)):
        if labelledDependencies[i] == 'nsubj':
            if posTags[i] == 'NN':
                if posTags[tokens.index(dependencies[i])] == 'VB':
                    labelledDependencies[i] = 'dobj'
    return tokens,posTags,dependencies,labelledDependencies

and maybe even register it as a udf.

However, my question is how I can do this without using a udf and instead only pyspark built-in methods.

1 Answer 1

1

You can use Spark built-in transform function :

import pyspark.sql.functions as F

df2 = df.withColumn(
    "labelledDependencies",
    F.expr("""transform(
            labelledDependencies, 
            (x, i) -> CASE WHEN x = 'nsubj' 
                                AND posTags[i] = 'NN' 
                                AND posTags[array_position(tokens, dependencies[i]) - 1] = 'VB' 
                           THEN 'dobj'
                           ELSE x
                      END
        )
    """)
)



df2.show(1, False)
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+
#|tokens                        |posTags                    |dependencies                       |labelledDependencies                       |
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+
#|[i, try, to, get, my, balance]|[NNP, VB, TO, VB, PRP$, NN]|[try, ROOT, get, try, balance, get]|[nsubj, root, mark, parataxis, appos, dobj]|
#+------------------------------+---------------------------+-----------------------------------+-------------------------------------------+
Sign up to request clarification or add additional context in comments.

4 Comments

Thanks! It worked. What does AND posTags[array_position(tokens, dependencies[i]) - 1] this do exactly?
@romborimba it is equivalent to the condition if posTags[tokens.index(dependencies[i])] == 'VB' in your code. i is the index of the current element in labelledDependencies array.
Thanks, it's clear. Final question, how would I go if I wanted to create a new column and not transform an existing one?
@romborimba just rename it ;) df.withColumn("labelledDependencies_v2", ...)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.