i have a pyspark dataframe with a column urls which contains some urls, another pyspark dataframe which also contains urls and id but these urls are inlinks e.g. abc.com in 1st one and abc.com/contact in second one. I want to collect all the ids of inlinks related to a particular domain in a new column in first dataframe. I am currently doing this
url_list = df1.select('url').collect()
all_rows = df2.collect()
ids = list()
urls = list()
for row in all_rows:
ids.append(row.id)
urls.append(row.url)
dict_ids = dict([(i.website,"") for i in url_list])
for url,id in zip(urls, ids):
res = [ele.website for ele in url_list if(ele.website in url)]
if len(res)>0:
print(res)
dict_ids[res[0]]+=('\n\n\n'+id+'\n\n\n')
this is taking a lot of time, I wanted to use the spark processing so I also tried this
def add_id(url, id):
for i in url_list:
if i.website in url:
dict_ids[i.website]+=id
add_id_udf=udf(add_id,StringType())
test = df_crawled_2.withColumn("Test", add_id_udf(df2['url'],df2['id']))
display(test)
input:
df1::
url
http://example.com
http://example2.com/index.html
df2::
url,id
http://example.com/contact, 12
http://example2.com/index.html/pif, 45
http://example.com/about, 68
http://example2.com/index.html/juk/er, 96
expected output:
df1::
url,id
http://example.com, [12,68]
http://example2.com/index.html, [45,96]
or even a dictionary is fine with urls as keys and id as values.
But this dict_ids in the second case remained empty. Can somebody please help me out here?