Bash, awk, get specific string from file

Question

Issue

I am getting data via command awk from file, exactly string in "" from <a href="DATA">.

Source file.

...

<!-- Page 18 -->
<p style="position:absolute;top:956px;left:485px;white-space:nowrap" class="ft1829"><a href="145041">145041</a></p>
<p style="position:absolute;top:586px;left:246px;white-space:nowrap" class="ft1829"><a href="145042">145042</a></p>
<p style="position:absolute;top:156px;left:446px;white-space:nowrap" class="ft1829"><a href="440332">440332</a></p>
<!-- Page 19 -->
<p style="position:absolute;top:1205px;left:53px;white-space:nowrap" class="ft1938"><b>1&#160;790,-&#160;</b>|<a href="457710">&#160;457710</a></p>
<p style="position:absolute;top:1205px;left:634px;white-space:nowrap" class="ft1938"><b>2 290,-&#160;</b>|<a href="464429">&#160;464429</a></p>
<p style="position:absolute;top:924px;left:353px;white-space:nowrap" class="ft1938"><b>2 590,-&#160;</b>|<a href="464430">&#160;464430</a></p>

...

Command (with help on this forum).

awk '/Page/ {h=$3} /-- Page 1 --/ {h="Title"} /href=/ && h {split($0,a,"\"");print h","a[6]}'

Results.

Problem is, when links are on the same line, data for only first link are processed.

Example.

`<a href="457710">&#160;457710</a></p> | <a href="464429">&#160;464429</a></p>`

Output.

...

18,457710,

...

Expected output.

...

18,457710,
18,464429,

...

What is wrong in awk command?

Thanks for any ideas.

Update 1

I need take all hrefs from this input.

<!-- Page 1 -->
<p style="position:absolute;top:397px;left:23px;white-space:nowrap" class="ft116"><a href="237002">237002&#160;</a>|<a href="237003">&#160;237003</a></p>
<p style="position:absolute;top:831px;left:666px;white-space:nowrap" class="ft124"><a href="230041">230041</a></p>
<p style="position:absolute;top:855px;left:447px;white-space:nowrap" class="ft116"><a href="467173">467173</a></p>
<p style="position:absolute;top:910px;left:36px;white-space:nowrap" class="ft116">Hmotnost:&#160;6&#160;kg&#160;|&#160;<a href="464431">464431</a></p>
<!-- Page 2 -->
<p style="position:absolute;top:1176px;left:561px;white-space:nowrap" class="ft216"><a href="318417">318417</a></p>
<p style="position:absolute;top:963px;left:561px;white-space:nowrap" class="ft216"><a href="338701">338701</a></p>

...

Command.

awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}' test.html

Output.

But I need this (for example 1,238003 is not present in result above, and first column page is different).

Thanks.

Don't use line-oriented tools for parsing HTML/XML. There are syntax-aware programs for doing that, like pup. — oguz ismail
– oguz ismail, Commented Aug 29, 2019 at 7:09

suspectus · Accepted Answer · 2019-08-29 07:07:19Z

1

As the awk command will only process the first hyperlink on each line, just edit the file first to suit the awk command:

sed 's/\(a href=\)/\n\1/g' data-file | awk '/page/ ....'

answered Aug 29, 2019 at 7:07

suspectus

17.4k8 gold badges56 silver badges63 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

genderbee Over a year ago

Hi, thanks for idea, but it generate only pages now, not second column, content of "" of href.

RavinderSingh13 Over a year ago

@genderbee, it worked fine for me, not sure why it didn't for you. Line which you showed where 2 href are coming on same line is coming in your actual Input_file?

genderbee Over a year ago

@RavinderSingh13 Could you try with input updated in question and let me know? Thanks.

RavinderSingh13 · Accepted Answer · 2019-08-29 07:11:31Z

1

Tested with given example, could you please try following.

awk '
{
  gsub("</p> | ","&\n")
  $1=$1
}
match($0,/class=\"[a-zA-Z]+[0-9]+/){
  val=substr($0,RSTART,RLENGTH)
  sub(/[^0-9]*/,"",val)
}
match($0,/<a href=\"[0-9]+/){
  val1=substr($0,RSTART,RLENGTH)
  sub(/[^"]*\"/,"",val1)
  print substr(val,1,2)","val1
  val=val1=""
}
'  Input_file

answered Aug 29, 2019 at 7:11

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

3 Comments

genderbee Over a year ago

Nope, still skipping others hrefs on line.

RavinderSingh13 Over a year ago

@genderbee, could you please try

awk 'match($0,/class=\"[a-zA-Z]+[0-9]+/){val=substr($0,RSTART,RLENGTH);sub(/[^0-9]*/,"",val)} match($0,/<a href=\"[0-9]+/){val1=substr($0,RSTART,RLENGTH);sub(/[^"]*\"/,"",val1);print substr(val,1,2)","val1}'  Input_file

once? I tired with your shown samples and it worked fine for me.

genderbee Over a year ago

I tried, but still same. Could you look at Update 1 in question and let me know, if the output is same for you, please? Thank you in advance.

Collectives™ on Stack Overflow

Bash, awk, get specific string from file

2 Answers 2

3 Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related