1

I am working with databricks dataframe(pyspark)

I have a dataframe that contains a array with string value.

I need to use the df value to assemble with value from a python array that i have.

What i want is to put the df value in a python array like this:

listArray = []

listArray.append(dataframeArrayValue)

print(listArray)
outPut:
     [value1, value2, value3]

The problem I get is that it kind off work, but for some reason I can not work with the string value that is added to the new array list(listArray).

My concept is that i am gonna build a url, where i need to use SQL to get the begining information of that url. That first part is what i put in the df array. For the last part off the url, i have that stored in a python array.

I want to loop through both array, and put the result in a empty array.

Something like this:

display(dfList)
outPut:
      [dfValue1, dafValue2, dfValue3]

print(pyList)
      [pyValue1, pyValue2, pyValue3]

Whant to put them together like this:

dfValue1 + pyValue2 etc..

And getting a array like this:

newArrayContainingBoth = []

-- loop with append

result:

print(newArrayContainingBoth)
outPut:
[dfValue1+pyValue1, dfValue2+pyValue2, dfValue3+pyValue]

Hope my question was clear enough

4
  • How did you loop? Commented Oct 26, 2018 at 14:09
  • Did you try? newArrayContainingBoth = dfList + pyList Commented Oct 26, 2018 at 15:43
  • I have not made a loop jet. A problem is that from the df it looks like something like this: [value1, value2], but when i try to get the first element dfList[0], i get [value1, value2]. Idk why it is like that, cause teoretical, it is supose to get me [value1] only. Sorry for bad english Commented Oct 28, 2018 at 21:40
  • On a note, are you sure? df = [value1, value2] and can you show some sample value df. Also, if you do python_list = df.collect(), all you have is list in python_list. Commented Oct 28, 2018 at 21:50

1 Answer 1

1

Try this steps,

  • You can use explode() to get a string from that array. Then,
  • collect() as list,
  • Extract string part from the Row,
  • split() by a comma (",").
  • Finally, use it.

First import explode(),

from pyspark.sql.functions import explode 

Assuming your context in DataFrame "df"

columns = ['nameOffjdbc', 'some_column']
rows = [
        (['/file/path.something1'], 'value1'),
        (['/file/path.something2'], 'value2')
        ]

df = spark.createDataFrame(rows, columns)
df.show(2, False)
+-----------------------+-----------+
|nameOffjdbc            |some_column|
+-----------------------+-----------+
|[/file/path.something1]|value1     |
|[/file/path.something2]|value2     |
+-----------------------+-----------+

Select the column nameOffjdbc from DataFrame 'df'

dfArray = df.select('nameOffjdbc')
print(dfArray)
DataFrame[nameOffjdbc: array<string>]

Explode the column nameOffjdbc

dfArray = dfArray.withColumn('nameOffjdbc', explode('nameOffjdbc'))
dfArray.show(2, False)
+---------------------+
|nameOffjdbc          |
+---------------------+
|/file/path.something1| 
|/file/path.something2|
+---------------------+

Now collect it to newDfArray (This is a python list that you need).

newDfArray = dfArray.collect()
print(newDfArray)
[Row(nameOffjdbc=u'/file/path.something1'), 
     Row(nameOffjdbc=u'/file/path.something2')]

Since, it is (will be) in the format [Row(column)=u'value']. We need to get the value (string) part of it. hence,

pyList = ",".join(str('{0}'.format(value.nameOffjdbc)) for value in newDfArray)
print(pyList, type(pyList))
('/file/path.something1,/file/path.something2', <type 'str'>)

Split the value by a comma ",", which will create a list out of a string.

pyList = pyList.split(',')
print(pyList, type(pyList))
(['/file/path.something1', '/file/path.something2'], <type 'list'>)

Use it

print(pyList[0])
/file/path.something1

print(pyList[1])
/file/path.something2

If you want to loop

for items in pyList:
    print(items)
/file/path.something1
/file/path.something2

In a nut shell the following code is all you need.

columns = ['nameOffjdbc', 'some_column']
rows = [
    (['/file/path.something1'], 'value1'),
    (['/file/path.something2'], 'value2')
    ]
df = spark.createDataFrame(rows, columns)

dfArray = df.select('nameOffjdbc')

dfArray = dfArray.withColumn('nameOffjdbc', explode('nameOffjdbc')).collect()
pyList = ",".join(str('{0}'.format(value.nameOffjdbc)) for value in dfArray).split(',')

NOTE: collect() always collects a DataFrame values into a list.

For more information, refer:

Sign up to request clarification or add additional context in comments.

4 Comments

Thank you! It works nice! The only problem is when i got to your "Use it" step, my print looks like this: print(pyList[0]) [u'/file/path.something1' Do you know why? Or does that not matter if it looks like that?
I dont want the [ u and ' to be a part off my string
Update: I fixt that porblem with .replace()
You need to follow every step and then use it.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.