How can I extract redirected url with Python without using requests module and via xpath?

Question

The html-page incudes the following script:

<script>
const url = 'REQUIRED LINK';
window.location.href = url + window.location.search;
</script>

This is the only place in page, where the link is. I don't know Java at all.
I tried extract this way:

page_2 = requests.get(link).content.decode('UTF-8')
html_tree = html.fromstring(page_2)

inside_scripts = html_tree.xpath("//script[contains(@text, 'url')]")

But it returns empty list.

Java =/= Javascript

rdas
– rdas

2020-05-15 15:12:41 +00:00
Commented May 15, 2020 at 15:12 — rdas
– rdas, Commented May 15, 2020 at 15:12

Constantin · Accepted Answer · 2020-05-15 15:18:13Z

2

Let's suppose const url = 'REQUIRED LINK'; always uses the same formatting, including spaces.

You could run the following code - using regex - to extract 'REQUIRED LINK'

Javascript:

const regex = /(?<=const url = ').+(?=';)/gm;

var required_link = YOUR_HTML_STRING.match(regex);

Python:

import re

regex = r"(?<=const url = ').+(?=';)"

require_link = re.findall(regex, HTML_STRING)[0]

answered May 15, 2020 at 15:18

Constantin

9151 gold badge9 silver badges32 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Pato Navarro · Accepted Answer · 2020-05-15 15:26:57Z

1

you should use:

inside_scripts = html_tree.xpath("//script[contains(., 'url')]")

answered May 15, 2020 at 15:26

Pato Navarro

3403 silver badges11 bronze badges

2 Comments

Pato Navarro Over a year ago

I am not an expert in html but I believe this is related with xpath definition.

E.Wiest Over a year ago

@text means you're looking for an attribute. script element has no attribute. I suppose you were looking for this : //script[contains(text(),"const url")]

E.Wiest · Accepted Answer · 2020-05-15 15:48:03Z

0

One liner to extract it with XPath 1.0 :

print(html_tree.xpath('substring-after(substring-before(//script[contains(.,"const url")],"';"),"= '")'))

Output : REQUIRED LINK

answered May 15, 2020 at 15:48

E.Wiest

5,9152 gold badges9 silver badges12 bronze badges

Collectives™ on Stack Overflow

How can I extract redirected url with Python without using requests module and via xpath?

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related