Extract urls from a string of html data

Question

I already tried to extract this html data with BeautifulSoup but it's only limited with tags. What I need to do is to get the trailing something.html or some/something.html after the prefix www.example.com/products/ while eliminating the parameters like ?search=1. I prefer to use regex with this but I don't know the exact pattern for this.

input:

System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html

prefix = "www.example.com/products/"
# do something
# expected output: ['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

It'd be super helpful to show a very clear input and then your desired output of the result. — Capn Jack
– Capn Jack, Commented Sep 28, 2018 at 15:06
Do you have known conditions for the "prefix" ? Can it be either www.example.com/products/ or www.example.com/, for example ? — HolyDanna
– HolyDanna, Commented Sep 28, 2018 at 15:13
@thelogicalkoan It's a combination of HTML and JSON, but for this case it doesn't matter. — JustInTime
– JustInTime, Commented Sep 28, 2018 at 15:20

Colonel Beauvel · Accepted Answer · 2018-09-28 15:20:27Z

1

I guess you want to use re here - with a trick since I "?" will follow the "html" in a URI:

import re 

L = ["//www.example.com/products/ammaxxllx.html", "https://www.example.com/site/kakzja.html", "//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1"]
prefix = "www.example.com/products/"

>>> [re.search(prefix+'(.*)html', el).group(1) + 'html' for el in L if prefix in el]
['ammaxxllx.html', 'sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html']

edited Sep 28, 2018 at 15:20

answered Sep 28, 2018 at 15:14

Colonel Beauvel

31.3k11 gold badges49 silver badges88 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

JustInTime Over a year ago

The input should be

System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html

and not a list. While the output should be a list.

Colonel Beauvel Over a year ago

the input you give is not clean, do you have a dict, text as input?

thelogicalkoan · Accepted Answer · 2018-09-28 15:30:24Z

0

Though the above answer by using re module is just awesome. You could also work around without using the module. Like this:

prefix = 'www.example.com/products/'
L = ['//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1', '//www.example.com/products/site/ammaxxllx.html', 'https://www.example.com/site/kakzja.html']
ans = []
for l in L:
    input_ = l.rsplit(prefix, 1)
    try:
        input_ = input_[1]
        ans.append(input_[:input_.index('.html')] + '.html')
    except Exception as e:
        pass
print ans
['sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html', 'site/ammaxxllx.html']

answered Sep 28, 2018 at 15:30

thelogicalkoan

6201 gold badge5 silver badges14 bronze badges

2 Comments

JustInTime Over a year ago

The input should be

System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html

and not a list. While the output should be a list.

thelogicalkoan Over a year ago

You asked how to parse the given URL with a specific pattern and not the input. If you also want answers for that then please update your question with proper input data.

bvidalar · Accepted Answer · 2018-09-28 16:20:10Z

0

Another option is to use urlparse instead of/along with re

It will allow you to split a URL like this:

import urlparse

my_url = "http://www.example.com/products/ammaxxllx.html?spam=eggs#sometag"
url_obj = urlparse.urlsplit(my_url)

url_obj.scheme
>>> 'http'
url_obj.netloc
>>> 'www.example.com'
url_obj.path
>>> '/products/ammaxxllx.html'
url_obj.query
>>> 'spam=eggs'
url_obj.fragment
>>> 'sometag'

# Now you're able to work with every chunk as wanted! 
prefix = '/products'
if url_obj.path.startswith(prefix):
    # Do whatever you need, replacing the initial characters. You can use re here
    print url_obj.path[len(prefix) + 1:]
>>>> ammaxxllx.html

answered Sep 28, 2018 at 16:20

bvidalar

962 bronze badges

1 Comment

JustInTime Over a year ago

The input should be

System","urlKey":"ppath","value":[],"hidden":false,"locked":false}],"bizData":"Related+Categories=Mobiles","pos":0},"listItems":[{"name":"Sam-Sung B309i High Precision Smooth Keypad Mobile Phone ","nid":"250505808","icons":[],"productUrl":"//www.example.com/products/sam-sung-b309i-high-precision-smooth-keypad-mobile-phone-i250505808-s341878516.html?search=1", "image": ["//www.example.com/products/site/ammaxxllx.html], "https://www.example.com/site/kakzja.html

and not a list. While the output should be a list.

Collectives™ on Stack Overflow

Extract urls from a string of html data

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

2 Comments

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related