I have the following pyspark dataframe:
root
|-- tokens: array (nullable = true)
| |-- element: string (containsNull = true)
|-- posTags: array (nullable = true)
| |-- element: string (containsNull = true)
|-- dependencies: array (nullable = true)
| |-- element: string (containsNull = true)
|-- labelledDependencies: array (nullable = true)
| |-- element: string (containsNull = true)
with an example of the following data
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|tokens |posTags |dependencies |labelledDependencies |
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
|[i, try, to, get, my, balance]|[NNP, VB, TO, VB, PRP$, NN]|[try, ROOT, get, try, balance, get]|[nsubj, root, mark, parataxis, appos, nsubj]|
+------------------------------+---------------------------+-----------------------------------+--------------------------------------------+
I want to change the labelled dependency of the token balance from nsubj to dobj.
My logic is as follows:
if you find a labelled dependency nsubj and the token has POS Tag NN and the token has dependency on a token that has POS tag VB (get) then change nsubj to dobj.
I can do this with the following function:
def change_things(tokens,posTags,dependencies,labelledDependencies):
for i in range(0,len(labelledDependencies)):
if labelledDependencies[i] == 'nsubj':
if posTags[i] == 'NN':
if posTags[tokens.index(dependencies[i])] == 'VB':
labelledDependencies[i] = 'dobj'
return tokens,posTags,dependencies,labelledDependencies
and maybe even register it as a udf.
However, my question is how I can do this without using a udf and instead only pyspark built-in methods.