0

Issue

I am getting data via command awk from file, exactly string in "" from <a href="DATA">.

Source file.

...

<!-- Page 18 -->
<p style="position:absolute;top:956px;left:485px;white-space:nowrap" class="ft1829"><a href="145041">145041</a></p>
<p style="position:absolute;top:586px;left:246px;white-space:nowrap" class="ft1829"><a href="145042">145042</a></p>
<p style="position:absolute;top:156px;left:446px;white-space:nowrap" class="ft1829"><a href="440332">440332</a></p>
<!-- Page 19 -->
<p style="position:absolute;top:1205px;left:53px;white-space:nowrap" class="ft1938"><b>1&#160;790,-&#160;</b>|<a href="457710">&#160;457710</a></p>
<p style="position:absolute;top:1205px;left:634px;white-space:nowrap" class="ft1938"><b>2 290,-&#160;</b>|<a href="464429">&#160;464429</a></p>
<p style="position:absolute;top:924px;left:353px;white-space:nowrap" class="ft1938"><b>2 590,-&#160;</b>|<a href="464430">&#160;464430</a></p>

...

Command (with help on this forum).

awk '/Page/ {h=$3} /-- Page 1 --/ {h="Title"} /href=/ && h {split($0,a,"\"");print h","a[6]}'

Results.

...

18,145041
18,145042
18,440332
19,457710
19,464429

...

Problem is, when links are on the same line, data for only first link are processed.

Example.

`<a href="457710">&#160;457710</a></p> | <a href="464429">&#160;464429</a></p>`

Output.

...

18,457710,

...

Expected output.

...

18,457710,
18,464429,

...

What is wrong in awk command?

Thanks for any ideas.

Update 1

I need take all hrefs from this input.

<!-- Page 1 -->
<p style="position:absolute;top:397px;left:23px;white-space:nowrap" class="ft116"><a href="237002">237002&#160;</a>|<a href="237003">&#160;237003</a></p>
<p style="position:absolute;top:831px;left:666px;white-space:nowrap" class="ft124"><a href="230041">230041</a></p>
<p style="position:absolute;top:855px;left:447px;white-space:nowrap" class="ft116"><a href="467173">467173</a></p>
<p style="position:absolute;top:910px;left:36px;white-space:nowrap" class="ft116">Hmotnost:&#160;6&#160;kg&#160;|&#160;<a href="464431">464431</a></p>
<!-- Page 2 -->
<p style="position:absolute;top:1176px;left:561px;white-space:nowrap" class="ft216"><a href="318417">318417</a></p>
<p style="position:absolute;top:963px;left:561px;white-space:nowrap" class="ft216"><a href="338701">338701</a></p>

...

Command.

awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' test.html

Output.

11,237002
12,230041
11,467173
11,464431
21,318417
...

But I need this (for example 1,238003 is not present in result above, and first column page is different).

1,237002
1,237003
1,230041
1,467173
1,464431
2,318417

...

Thanks.

1
  • 2
    Don't use line-oriented tools for parsing HTML/XML. There are syntax-aware programs for doing that, like pup. Commented Aug 29, 2019 at 7:09

2 Answers 2

1

As the awk command will only process the first hyperlink on each line, just edit the file first to suit the awk command:

sed 's/\(a href=\)/\n\1/g' data-file | awk '/page/ ....' 
Sign up to request clarification or add additional context in comments.

3 Comments

Hi, thanks for idea, but it generate only pages now, not second column, content of "" of href.
@genderbee, it worked fine for me, not sure why it didn't for you. Line which you showed where 2 href are coming on same line is coming in your actual Input_file?
@RavinderSingh13 Could you try with input updated in question and let me know? Thanks.
1

Tested with given example, could you please try following.

awk '
{
  gsub("</p> | ","&\n")
  $1=$1
}
match($0,/class=\"[a-zA-Z]+[0-9]+/){
  val=substr($0,RSTART,RLENGTH)
  sub(/[^0-9]*/,"",val)
}
match($0,/<a href=\"[0-9]+/){
  val1=substr($0,RSTART,RLENGTH)
  sub(/[^"]*\"/,"",val1)
  print substr(val,1,2)","val1
  val=val1=""
}
'  Input_file

3 Comments

Nope, still skipping others hrefs on line.
@genderbee, could you please try awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' Input_file once? I tired with your shown samples and it worked fine for me.
I tried, but still same. Could you look at Update 1 in question and let me know, if the output is same for you, please? Thank you in advance.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.