How to extract links from an html page

Question

I have an html page that has data like so:

<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>

I would like to extract the link 'PASS_report_test_2025-03-24_17h07m10.html'. The date and timestamp of the link changes depending on the day that the tests are run. However, the prefix substring 'PASS_report_' does not.

Expected output: PASS_report_test_2025-03-24_17h07m10.html

I tried the solution sed -n 's/.*href="\([^"]*\).*/\1/p' file

suggested here. But it didn't work i.e. Printing out the values of the variable that contained the links after parsing resulted null.

Any suggestions on how to extract the link?

Thank you in advance.

But it didn't work. doesn't tell us why/how it didn't work; did it generate an error message? did it generate no output? did it generate the wrong output? something else? please update the question with details on what you mean by But it didn't work. — markp-fuso
– markp-fuso, Commented Mar 26 at 20:37

markp-fuso · Accepted Answer · 2025-03-26 20:53:56Z

4

OP has cut-n-pasted a sed solution from another Q&A but states that it didn't work which I take to mean that it generated all links, ie:

$ sed -n 's/.*href="\([^"]*\).*/\1/p' test.html
test-2025-03-24_17-05.log
PASS_report_test_2025-03-24_17h07m10.html
TESTS-test_01.xml
TESTS-test_02.xml

One idea for updating this sed solution to look for just the one link OP is interested in:

$ sed -n 's/.*href="\(PASS_report[^"]*\).*/\1/p' test.html
PASS_report_test_2025-03-24_17h07m10.html

If OP's html file is guaranteed to be nicely formatted as in the example then there are a slew of approaches that will also work, eg:

$ grep '"PASS_report' test.html | cut -d'"' -f2
PASS_report_test_2025-03-24_17h07m10.html

$ cut -d'"' -f2 test.html | grep '^PASS_report'
PASS_report_test_2025-03-24_17h07m10.html

$ awk -F'"' '$2~/^PASS_report/ {print $2}' test.html
PASS_report_test_2025-03-24_17h07m10.html

$ while IFS='"' read -r _ link _; do [[ "${link}" =~ PASS_report* ]] && { echo "${link}"; break; }; done < test.html
PASS_report_test_2025-03-24_17h07m10.html

answered Mar 26 at 20:53

markp-fuso

38.6k5 gold badges24 silver badges48 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Léa Gris Mar 26 at 21:51

Wrong tools. Neither Bash, nor awk, nor sed is appropriate to parse HTML. HTML is not meant to be parsed either. HTML is meant to be rendered by browser who try to guess the intention behind the poorly and permissive SGML inspired syntax of it.

markp-fuso Mar 26 at 22:36

@LéaGris if you're referring to a general purpose HTML parser then yes, I'd agree that bash / awk / sed / cut / grep are not the 'right' tools; on the other hand, if dealing with the occasional, predefined, static formatted HTML file, and you don't have access to an HTML parser, then any tool that gets the job done is the 'right' tool

Léa Gris Mar 27 at 12:29

I agree—when dealing with a predefined, static HTML file and lacking access to a proper parser, using sed, awk, or similar tools can be pragmatic. However, it's important to mention their limitations when writing an answer on Stack Overflow, especially for beginners. They often need to learn both these tools and their constraints, as well as be aware that structured formats like HTML, XML, and JSON have dedicated parsers better suited for the job.

pmf · Accepted Answer · 2025-03-26 22:19:19Z

You can't parse [X]HTML with regex. At least not reliably. Instead, you should use an HTML parser in order to operate on the logical structure of the input (like tags and attributes in the case of HTML), rather than on its textual representation (looking for a string enclosed within a pair of single or double quotes while not containing this type of quotes itself, preceded by "href" and an equals sign, maybe surrounded by some whitespace characters, …). Going through all valid representations is exactly what a parser would do for you.

One option for such a parser could be htmlq, which can extract parts of the input based on CSS selectors. To go for anchor links in href attributes, use a to select anchor tags, and -a href to print their href attribute values. Also, add -f file.html to read from a file, otherwise the input is expected on STDIN. Then, pipe the output through grep to filter it by any distinctive criteria (which largely depends on what NOT to match, i.e. what to ignore, rather than just on what to match, i.e. what to find). As of your sample input, this would do:

htmlq -f file.html -a href 'a' | grep '^PASS_report_'

PASS_report_test_2025-03-24_17h07m10.html

Cyrus · Accepted Answer · 2025-03-29 08:23:04Z

3

With an XML/HTML parser and valid HTML.

Get only link which URL starts with string PASS_report_:

xmlstarlet select --html --template --value-of '//a[starts-with(@href, "PASS_report_")]' -n file.html

Output:

PASS_report_test_2025-03-24_17h07m10.html

edited Mar 29 at 8:23

answered Mar 27 at 20:09

Cyrus

90.2k15 gold badges112 silver badges173 bronze badges

Comments

Kent · Accepted Answer · 2025-03-26 20:40:26Z

2

First of all, regex isn't a right tool for parsing xml/html.

If you are sure the input is always formatted like the example, this oneliner might help you:

$ grep -o 'PASS_report_[^"]*' file|grep -i 'html$'

$ cat f
<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>
 
$ grep -o 'PASS_report_[^"]*' f|grep -i 'html$'
PASS_report_test_2025-03-24_17h07m10.html

answered Mar 26 at 20:40

Kent

197k36 gold badges248 silver badges317 bronze badges

Comments

Alexey Melezhik · Accepted Answer · 2025-04-01 13:10:37Z

1

You may use Raku/Sparrow for that :

within:  "<td><a href=" \" (.*?) ".html" \" ">"
regexp: ^^ "PASS_report_test_" (\S+)
end:

code: <<RAKU
!raku
for captures()<> -> $c {
    say "PASS_report_test_", $c[0], ".html";
}
RAKU

The first within: statement filters out all href lines, the second regexp: - only lines with pass report links , capturing href data.

Code: block iterates over captured data and prints it out line, by line.

edited Apr 1 at 13:10

answered Mar 31 at 15:26

Alexey Melezhik

1,03110 silver badges30 bronze badges

Collectives™ on Stack Overflow

How to extract links from an html page

5 Answers 5

3 Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

3 Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related