0

The html-page incudes the following script:

<script>
const url = 'REQUIRED LINK';
window.location.href = url + window.location.search;
</script>

This is the only place in page, where the link is. I don't know Java at all.
I tried extract this way:

page_2 = requests.get(link).content.decode('UTF-8')
html_tree = html.fromstring(page_2)

inside_scripts = html_tree.xpath("//script[contains(@text, 'url')]")

But it returns empty list.

1
  • 1
    Java =/= Javascript Commented May 15, 2020 at 15:12

3 Answers 3

2

Let's suppose const url = 'REQUIRED LINK'; always uses the same formatting, including spaces.

You could run the following code - using regex - to extract 'REQUIRED LINK'

Javascript:

const regex = /(?<=const url = ').+(?=';)/gm;

var required_link = YOUR_HTML_STRING.match(regex);

Python:

import re

regex = r"(?<=const url = ').+(?=';)"

require_link = re.findall(regex, HTML_STRING)[0]
Sign up to request clarification or add additional context in comments.

Comments

1

you should use:

inside_scripts = html_tree.xpath("//script[contains(., 'url')]")

2 Comments

I am not an expert in html but I believe this is related with xpath definition.
@text means you're looking for an attribute. script element has no attribute. I suppose you were looking for this : //script[contains(text(),"const url")]
0

One liner to extract it with XPath 1.0 :

print(html_tree.xpath('substring-after(substring-before(//script[contains(.,"const url")],"';"),"= '")'))

Output : REQUIRED LINK

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.