Issue
I am getting data via command awk from file, exactly string in "" from <a href="DATA">.
Source file.
...
<!-- Page 18 -->
<p style="position:absolute;top:956px;left:485px;white-space:nowrap" class="ft1829"><a href="145041">145041</a></p>
<p style="position:absolute;top:586px;left:246px;white-space:nowrap" class="ft1829"><a href="145042">145042</a></p>
<p style="position:absolute;top:156px;left:446px;white-space:nowrap" class="ft1829"><a href="440332">440332</a></p>
<!-- Page 19 -->
<p style="position:absolute;top:1205px;left:53px;white-space:nowrap" class="ft1938"><b>1 790,- </b>|<a href="457710"> 457710</a></p>
<p style="position:absolute;top:1205px;left:634px;white-space:nowrap" class="ft1938"><b>2 290,- </b>|<a href="464429"> 464429</a></p>
<p style="position:absolute;top:924px;left:353px;white-space:nowrap" class="ft1938"><b>2 590,- </b>|<a href="464430"> 464430</a></p>
...
Command (with help on this forum).
awk '/Page/ {h=$3} /-- Page 1 --/ {h="Title"} /href=/ && h {split($0,a,"\"");print h","a[6]}'
Results.
...
18,145041
18,145042
18,440332
19,457710
19,464429
...
Problem is, when links are on the same line, data for only first link are processed.
Example.
`<a href="457710"> 457710</a></p> | <a href="464429"> 464429</a></p>`
Output.
...
18,457710,
...
Expected output.
...
18,457710,
18,464429,
...
What is wrong in awk command?
Thanks for any ideas.
Update 1
I need take all hrefs from this input.
<!-- Page 1 -->
<p style="position:absolute;top:397px;left:23px;white-space:nowrap" class="ft116"><a href="237002">237002 </a>|<a href="237003"> 237003</a></p>
<p style="position:absolute;top:831px;left:666px;white-space:nowrap" class="ft124"><a href="230041">230041</a></p>
<p style="position:absolute;top:855px;left:447px;white-space:nowrap" class="ft116"><a href="467173">467173</a></p>
<p style="position:absolute;top:910px;left:36px;white-space:nowrap" class="ft116">Hmotnost: 6 kg | <a href="464431">464431</a></p>
<!-- Page 2 -->
<p style="position:absolute;top:1176px;left:561px;white-space:nowrap" class="ft216"><a href="318417">318417</a></p>
<p style="position:absolute;top:963px;left:561px;white-space:nowrap" class="ft216"><a href="338701">338701</a></p>
...
Command.
awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' test.html
Output.
11,237002
12,230041
11,467173
11,464431
21,318417
...
But I need this (for example 1,238003 is not present in result above, and first column page is different).
1,237002
1,237003
1,230041
1,467173
1,464431
2,318417
...
Thanks.