Pyspark simultaneously remove from 2 lists in dataframe

Question

I'm having a problem with huge data set and I'm looking for removing same indexed item from 2 lists.

Let me describe an example.

Imagine a Google search - list of 12 URLs from the page. First is an advertisement and the last one too, second and 7th is a picture link. Now I want only the organic links.

Types can be randomly positioned in the list. I was checking the array_remove which is pretty nice but it can only remove specific item from 1 list and I'm not advanced enough to figure out how to do it simultaneously for 2 lists. Sadly dataset is really big and I'm afraid that posexplode in not an option for me here.

Keep in mind it's a column of lists, not a column of individual items.

I'm looking for something like

if "adlink" or "picture" in typelist:
   remove it from typelist and remove same indexed item from urls list

  urls  |  type 
-----------------
[url1,  | [adlink, 
 url2,  |  picture,
 url3,  |  link,
 url4,  |  link,
 url5,  |  link, 
 url6,  |  link,
 url7,  |  picture,
 url8,  |  link,
 url9,  |  link,
 url10, |  link,
 url11, |  link,
 url12] |  adlink]

Desired output:

  urls  |  type 
-----------------
[url3,  | [link,
 url4,  |  link,
 url5,  |  link, 
 url6,  |  link,
 url8,  |  link,
 url9,  |  link,
 url10, |  link,
 url11] |  link]

murtihash · Accepted Answer · 2020-04-09 22:57:44Z

2

df.show()#your dataframe
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|urls                                                                       |type                                                                              |
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|[url1, url2, url3, url4, url5, url6, url7, url8, url9, url10, url11, url12]|[adlink, picture, link, link, link, link, picture, link, link, link, link, adlink]|
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+

You can use higher order functions as you have spark2.4(I could tell because u used array_remove). First, you can zip the arrays together using arrays_zip, and then use filter on the zipped array(type_urls) to filter out where ever type is 'adlink' and 'picture', then select your desired columns from the zipped column using columname.arrayname.

Filter(higher order function), basically allows you to apply filter to higher ordered data, without having to explode it(as you mentioned posexplode). Higher order functions

arrays_zip Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. arrays_zip Pyspark API docs

from pyspark.sql import functions as F
df.withColumn("type_urls", F.arrays_zip(F.col("urls"),F.col("type"))).select("type_urls")\
  .withColumn("urls1", F.expr("""filter(type_urls, x-> x.type!='adlink' and x.type!='picture')"""))\
  .select(F.col("urls1.urls"), F.col("urls1.type")).show(truncate=False)

+--------------------------------------------------+------------------------------------------------+
|urls                                              |type                                            |
+--------------------------------------------------+------------------------------------------------+
|[url3, url4, url5, url6, url8, url9, url10, url11]|[link, link, link, link, link, link, link, link]|
+--------------------------------------------------+------------------------------------------------+

edited Apr 9, 2020 at 22:57

answered Apr 9, 2020 at 22:28

murtihash

8,4401 gold badge16 silver badges26 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Leemosh Over a year ago

Hello, thanks for your reply, I was playing with your code a little and I wasn't able to solve one problem (probably lact of SQL knowledge) - If I have an empty list in the urls, how can I handle it? Giving me java.lang.NullPointerException. Understand that it's connected with reading from list and if the list is empty - it's gonna cause this error - most likely the expr(filter..)

murtihash Over a year ago

@Leemosh if u have empty list in urls do want to leave that row as is?

Leemosh Over a year ago

I think I solved my problem with .filter(size('type_urls') > 0), but thanks anyway!

Collectives™ on Stack Overflow

Pyspark simultaneously remove from 2 lists in dataframe

1 Answer 1

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related