7

I have two lists:

The first one is a regular list which contains links of Sitemaps:

ur = ['https://www.hi.de/hu/sitemap.xml', 
      'https://www.hi.de/ma/sitemap.xml', 
      'https://www.hi.de/au/sitemap.xml', 
      ]

The second list is nested and contains links which were indexed on the sitemaps and a date for every link:

wh = [['No-Date', 'https://www.hi.de/hu/artikel/xxx', ''],        
      ['2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], 
      ['2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], 
      ['2019-11-11', 'https://www.hi.de/au/artikel/xxx']]

Now I want to merge the list with the nedted list based on the sitmap they came from like this:

ui = [['https://www.hi.de/hu/sitemap.xml', 'No-Date', 'https://www.hi.de/hu/artikel/xxx', ''],        
      ['https://www.hi.de/ma/sitemap.xml' '2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], 
      ['https://www.hi.de/ma/sitemap.xml', '2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], 
      ['https://www.hi.de/au/sitemap.xml', '2019-11-11', 'https://www.hi.de/au/artikel/xxx']]

But with my code:

ui = [[(url2, x) for url2 in ur for x in y if url2.rsplit('/', 1)[0] in x] for y in wh]

The date in every sublist gets deleted and additionally the entries are stored in a tuple like this:

...
[[('https://www.hi.de/hu/sitemap.xml', 'https://www.hi.de/hu/artikel/xxx', '')],
...

How can I change the code to get the desired result in the variable ui?

6 Answers 6

5

You can use a list comprehension that checks for the matching sitemap between two lists to get your desired result:

ur = ['https://www.hi.de/hu/sitemap.xml', 
      'https://www.hi.de/ma/sitemap.xml', 
      'https://www.hi.de/au/sitemap.xml', 
      ]

wh = [['No-Date', 'https://www.hi.de/hu/artikel/xxx', ''],        
      ['2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], 
      ['2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], 
      ['2019-11-11', 'https://www.hi.de/au/artikel/xxx']]

print([[[u] + x] for x in wh for u in ur if x[1].split('/')[3] == u.split('/')[3]])

which outputs:

[['https://www.hi.de/hu/sitemap.xml', 'No-Date', 'https://www.hi.de/hu/artikel/xxx', ''],
 ['https://www.hi.de/ma/sitemap.xml' '2019-11-13', 'https://www.hi.de/ma/artikel/xxx'],
 ['https://www.hi.de/ma/sitemap.xml', '2019-11-12', 'https://www.hi.de/ma/artikel/xxx'],
 ['https://www.hi.de/au/sitemap.xml', '2019-11-11', 'https://www.hi.de/au/artikel/xxx']]
Sign up to request clarification or add additional context in comments.

12 Comments

Hi @Austin great solution! However it does not work if my list of sitemaps and urls is longer.
@gython, I assume sitemap always is at 3rd split of /. Please give an example of wrong case.
Hi @Austin, for example I am using a list with 11 sitemaps and a nested list with several thousand links. When I use your code for this case I am gettting the error: IndexError: list index out of range. You are right, sitemap is always at 3rd split of /, but my problem is that my lists have more entries. Can you help me out?
Can you make sure wh is a list with each list in it has 2 elements and the second element of each list has a 3rd split, and same for ur list (has a 3rd split)? It seems either of the above cases is not true.
For easy debugging, convert this list comprehension to normal for loops and print element from wh and ur inside. Somewhere inside you will get IndexError; the element after just printed is the culprit.
|
4

You can transform ur to a dictionary for easier lookup:

import re
ur = ['https://www.hi.de/hu/sitemap.xml', 'https://www.hi.de/ma/sitemap.xml', 'https://www.hi.de/au/sitemap.xml']
data = [['No-Date', 'https://www.hi.de/hu/artikel/xxx'], ['2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], ['2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], ['2019-11-11', 'https://www.hi.de/au/artikel/xxx']]
d = dict((re.split('/(?=sitemap\.)', i)[0], i) for i in ur)
result = [[d[re.split('/(?=\w{3,}/)', b)[0]], a, b] for a, b in data]

Output:

[['https://www.hi.de/hu/sitemap.xml', 'No-Date', 'https://www.hi.de/hu/artikel/xxx'], 
['https://www.hi.de/ma/sitemap.xml', '2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], 
['https://www.hi.de/ma/sitemap.xml', '2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], 
['https://www.hi.de/au/sitemap.xml', '2019-11-11', 'https://www.hi.de/au/artikel/xxx']]

Comments

2

You could combine elements of your list with double for loop, unpack values of second list using *-operator, and save them all using list comprehension.

ui = [
    [i, *j] 
    for i in ur for j in wh 
    if i.split('/')[3] == j[1].split('/')[3]
]

print(ui)

Output:

[
    ['https://www.hi.de/hu/sitemap.xml', 'No-Date', 'https://www.hi.de/hu/artikel/xxx', ''],
    ['https://www.hi.de/ma/sitemap.xml', '2019-11-13', 'https://www.hi.de/ma/artikel/xxx'],
    ['https://www.hi.de/ma/sitemap.xml', '2019-11-12', 'https://www.hi.de/ma/artikel/xxx'],
    ['https://www.hi.de/au/sitemap.xml', '2019-11-11', 'https://www.hi.de/au/artikel/xxx']
]

1 Comment

What happened to this row ['2019-11-11', 'https://www.hi.de/au/artikel/xxx']? I think OP wants a lookup to sitemap, not just merge the lists.
1

You could do a aimple list comprehension like,

>>> ur
['https://www.hi.de/hu/sitemap.xml', 'https://www.hi.de/ma/sitemap.xml', 'https://www.hi.de/au/sitemap.xml']
>>> wh
[['No-Date', 'https://www.hi.de/hu/artikel/xxx', ''], ['2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], ['2019-11-12', 'https://www.hi.de/ma/artikel/xxx'], ['2019-11-11', 'https://www.hi.de/au/artikel/xxx']]
>>> [[u] + w for u,w in zip(ur, wh)]
[['https://www.hi.de/hu/sitemap.xml', 'No-Date', 'https://www.hi.de/hu/artikel/xxx', ''], ['https://www.hi.de/ma/sitemap.xml', '2019-11-13', 'https://www.hi.de/ma/artikel/xxx'], ['https://www.hi.de/au/sitemap.xml', '2019-11-12', 'https://www.hi.de/ma/artikel/xxx']]

Comments

1

You can also use enumerate.

ui = [[x] + wh[i] for i,x in enumerate(ur)]
print(ui)

Output:

[
    ['https://www.hi.de/hu/sitemap.xml','No-Date','https://www.hi.de/hu/artikel/xxx',''],
    ['https://www.hi.de/ma/sitemap.xml', '2019-11-13','https://www.hi.de/ma/artikel/xxx'],
    ['https://www.hi.de/au/sitemap.xml','2019-11-12','https://www.hi.de/ma/artikel/xxx']
]

1 Comment

enumerate is useless there, you're not using x, you should use range if you just want the index, but simply iterating over our list is best option in this example.
0

Try using a zip:

[(x[0],x[1][0],x[1][1]) for x in zip(ur, wh)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.