4

I have an html page that has data like so:

<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>

I would like to extract the link 'PASS_report_test_2025-03-24_17h07m10.html'. The date and timestamp of the link changes depending on the day that the tests are run. However, the prefix substring 'PASS_report_' does not.

Expected output: PASS_report_test_2025-03-24_17h07m10.html

I tried the solution sed -n 's/.*href="\([^"]*\).*/\1/p' file

suggested here. But it didn't work i.e. Printing out the values of the variable that contained the links after parsing resulted null.

Any suggestions on how to extract the link?

Thank you in advance.

2
  • But it didn't work. doesn't tell us why/how it didn't work; did it generate an error message? did it generate no output? did it generate the wrong output? something else? please update the question with details on what you mean by But it didn't work. Commented Mar 26 at 20:37
  • lynx -dump -listonly -nonumbers file.html | sed 's|.*/||' Commented Mar 26 at 22:44

5 Answers 5

4

OP has cut-n-pasted a sed solution from another Q&A but states that it didn't work which I take to mean that it generated all links, ie:

$ sed -n 's/.*href="\([^"]*\).*/\1/p' test.html
test-2025-03-24_17-05.log
PASS_report_test_2025-03-24_17h07m10.html
TESTS-test_01.xml
TESTS-test_02.xml

One idea for updating this sed solution to look for just the one link OP is interested in:

$ sed -n 's/.*href="\(PASS_report[^"]*\).*/\1/p' test.html
PASS_report_test_2025-03-24_17h07m10.html

If OP's html file is guaranteed to be nicely formatted as in the example then there are a slew of approaches that will also work, eg:

$ grep '"PASS_report' test.html | cut -d'"' -f2
PASS_report_test_2025-03-24_17h07m10.html

$ cut -d'"' -f2 test.html | grep '^PASS_report'
PASS_report_test_2025-03-24_17h07m10.html

$ awk -F'"' '$2~/^PASS_report/ {print $2}' test.html
PASS_report_test_2025-03-24_17h07m10.html

$ while IFS='"' read -r _ link _; do [[ "${link}" =~ PASS_report* ]] && { echo "${link}"; break; }; done < test.html
PASS_report_test_2025-03-24_17h07m10.html
Sign up to request clarification or add additional context in comments.

3 Comments

Wrong tools. Neither Bash, nor awk, nor sed is appropriate to parse HTML. HTML is not meant to be parsed either. HTML is meant to be rendered by browser who try to guess the intention behind the poorly and permissive SGML inspired syntax of it.
@LéaGris if you're referring to a general purpose HTML parser then yes, I'd agree that bash / awk / sed / cut / grep are not the 'right' tools; on the other hand, if dealing with the occasional, predefined, static formatted HTML file, and you don't have access to an HTML parser, then any tool that gets the job done is the 'right' tool
I agree—when dealing with a predefined, static HTML file and lacking access to a proper parser, using sed, awk, or similar tools can be pragmatic. However, it's important to mention their limitations when writing an answer on Stack Overflow, especially for beginners. They often need to learn both these tools and their constraints, as well as be aware that structured formats like HTML, XML, and JSON have dedicated parsers better suited for the job.
4

You can't parse [X]HTML with regex. At least not reliably. Instead, you should use an HTML parser in order to operate on the logical structure of the input (like tags and attributes in the case of HTML), rather than on its textual representation (looking for a string enclosed within a pair of single or double quotes while not containing this type of quotes itself, preceded by "href" and an equals sign, maybe surrounded by some whitespace characters, …). Going through all valid representations is exactly what a parser would do for you.

One option for such a parser could be htmlq, which can extract parts of the input based on CSS selectors. To go for anchor links in href attributes, use a to select anchor tags, and -a href to print their href attribute values. Also, add -f file.html to read from a file, otherwise the input is expected on STDIN. Then, pipe the output through grep to filter it by any distinctive criteria (which largely depends on what NOT to match, i.e. what to ignore, rather than just on what to match, i.e. what to find). As of your sample input, this would do:

htmlq -f file.html -a href 'a' | grep '^PASS_report_'
PASS_report_test_2025-03-24_17h07m10.html

Comments

3

With an XML/HTML parser and valid HTML.

Get only link which URL starts with string PASS_report_:

xmlstarlet select --html --template --value-of '//a[starts-with(@href, "PASS_report_")]' -n file.html

Output:

PASS_report_test_2025-03-24_17h07m10.html

Comments

2

First of all, regex isn't a right tool for parsing xml/html.

If you are sure the input is always formatted like the example, this oneliner might help you:

$ grep -o 'PASS_report_[^"]*' file|grep -i 'html$'
$ cat f
<td><a href="test-2025-03-24_17-05.log">test-2025-03-24_17-05.log</a></td>
<td><a href="PASS_report_test_2025-03-24_17h07m10.html">PASS_report_test_2025-03-24_17h07m10.html</a></td>
<td><a href="TESTS-test_01.xml">TESTS-test_01.xml</a></td>
<td><a href="TESTS-test_02.xml">TESTS-test_02.xml</a></td>
 
$ grep -o 'PASS_report_[^"]*' f|grep -i 'html$'
PASS_report_test_2025-03-24_17h07m10.html

Comments

1

You may use Raku/Sparrow for that :

within:  "<td><a href=" \" (.*?) ".html" \" ">"
regexp: ^^ "PASS_report_test_" (\S+)
end:

code: <<RAKU
!raku
for captures()<> -> $c {
    say "PASS_report_test_", $c[0], ".html";
}
RAKU

The first within: statement filters out all href lines, the second regexp: - only lines with pass report links , capturing href data.

Code: block iterates over captured data and prints it out line, by line.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.