1

I'm having a problem with huge data set and I'm looking for removing same indexed item from 2 lists.

Let me describe an example.

Imagine a Google search - list of 12 URLs from the page. First is an advertisement and the last one too, second and 7th is a picture link. Now I want only the organic links.

Types can be randomly positioned in the list. I was checking the array_remove which is pretty nice but it can only remove specific item from 1 list and I'm not advanced enough to figure out how to do it simultaneously for 2 lists. Sadly dataset is really big and I'm afraid that posexplode in not an option for me here.

Keep in mind it's a column of lists, not a column of individual items.

I'm looking for something like

if "adlink" or "picture" in typelist:
   remove it from typelist and remove same indexed item from urls list
  urls  |  type 
-----------------
[url1,  | [adlink, 
 url2,  |  picture,
 url3,  |  link,
 url4,  |  link,
 url5,  |  link, 
 url6,  |  link,
 url7,  |  picture,
 url8,  |  link,
 url9,  |  link,
 url10, |  link,
 url11, |  link,
 url12] |  adlink]

Desired output:

  urls  |  type 
-----------------
[url3,  | [link,
 url4,  |  link,
 url5,  |  link, 
 url6,  |  link,
 url8,  |  link,
 url9,  |  link,
 url10, |  link,
 url11] |  link]

1 Answer 1

2
df.show()#your dataframe
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|urls                                                                       |type                                                                              |
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+
|[url1, url2, url3, url4, url5, url6, url7, url8, url9, url10, url11, url12]|[adlink, picture, link, link, link, link, picture, link, link, link, link, adlink]|
+---------------------------------------------------------------------------+----------------------------------------------------------------------------------+ 

You can use higher order functions as you have spark2.4(I could tell because u used array_remove). First, you can zip the arrays together using arrays_zip, and then use filter on the zipped array(type_urls) to filter out where ever type is 'adlink' and 'picture', then select your desired columns from the zipped column using columname.arrayname.

Filter(higher order function), basically allows you to apply filter to higher ordered data, without having to explode it(as you mentioned posexplode). Higher order functions

arrays_zip Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. arrays_zip Pyspark API docs

from pyspark.sql import functions as F
df.withColumn("type_urls", F.arrays_zip(F.col("urls"),F.col("type"))).select("type_urls")\
  .withColumn("urls1", F.expr("""filter(type_urls, x-> x.type!='adlink' and x.type!='picture')"""))\
  .select(F.col("urls1.urls"), F.col("urls1.type")).show(truncate=False)

+--------------------------------------------------+------------------------------------------------+
|urls                                              |type                                            |
+--------------------------------------------------+------------------------------------------------+
|[url3, url4, url5, url6, url8, url9, url10, url11]|[link, link, link, link, link, link, link, link]|
+--------------------------------------------------+------------------------------------------------+
Sign up to request clarification or add additional context in comments.

3 Comments

Hello, thanks for your reply, I was playing with your code a little and I wasn't able to solve one problem (probably lact of SQL knowledge) - If I have an empty list in the urls, how can I handle it? Giving me java.lang.NullPointerException. Understand that it's connected with reading from list and if the list is empty - it's gonna cause this error - most likely the expr(filter..)
@Leemosh if u have empty list in urls do want to leave that row as is?
I think I solved my problem with .filter(size('type_urls') > 0), but thanks anyway!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.