get multiple words after a specific word of HTML using linux/unix scripting

Question

i have a file 'movie.html' :

<html>
<head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head>
<body bgcolor="white">
<h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a>
<a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -
<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -
<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -
</pre><hr></body>
</html>

I want to get multiple word with pipe delimited like this:

title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

I tried this code:

awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt

but the output isn't as my expectation please help me, i am still a beginner

jared_mamrot · Accepted Answer · 2021-04-09 01:23:58Z

2

Parsing html with regex is not advised for several reasons (see https://stackoverflow.com/a/1732454/12957340), but here is one potential solution:

awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /\(.*\)/ {print $6 " | " $3}' movie.html

answered Apr 9, 2021 at 1:23

jared_mamrot

26.5k5 gold badges27 silver badges56 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

RavinderSingh13 · Accepted Answer · 2021-04-09 03:34:15Z

With your shown samples, could you please try following. I prefer to this with match function.

awk '
BEGIN{
  OFS=" | "
  print "title | link"
}
match($0,/^<a href="[^"]*/){
  val=substr($0,RSTART+9,RLENGTH-9)
  match($0,/>.*<\/a>/)
  print substr($0,RSTART+1,RLENGTH-6),val
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                      ##Starting awk program from here.
BEGIN{                                     ##Starting BEGIN section of this program from here.
  OFS=" | "                                ##Setting OFS as space | space here.
  print "title | link"                     ##Printing title space | space link here.
}
match($0,/^<a href="[^"]*/){               ##Using match to match regex from starting of line <a href=" till " comes.
  val=substr($0,RSTART+9,RLENGTH-9)        ##Creating val which has sub string of matched above text, making it as per OP needs here.
  match($0,/>.*<\/a>/)                     ##Using match to match from > till </a> here.
  print substr($0,RSTART+1,RLENGTH-6),val  ##Printing current matched sub string(by above match function) and val value here.
}
' Input_file                               ##Mentioning Input_file name here.

Jetchisel · Accepted Answer · 2021-04-09 13:11:14Z

2

If ed is available/acceptable and you understand the risk of using non html parser to parse hmtl files.

script.ed

0a
title | link
.
p
g/^<a href=.\{1,\}/s/^.\{1,\}="//\
s/\/[[:blank:]]*<\/a>.*$//\
s/">/ /\
s/^\([^ ]\{1,\}\) \(.\{1,\}\)/\2 | \1/p
Q

Then

ed -s file.html < script.ed

answered Apr 9, 2021 at 13:11

Jetchisel

8,3112 gold badges23 silver badges19 bronze badges

Comments

Victor Lee · Accepted Answer · 2021-04-09 04:03:01Z

1

Another way, I think you could get the processed lines with grep and then use awk format the output content.

grep -oP 'href="([^".]*)">([^</.]*)' movie.html | awk -F'[">]' 'BEGIN{print "title | link"}{print $4" | "$2}'

grep will get the lines like below:

href="1%20Buck%20%282017%29/">1 Buck (2017)
href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)
href="1%20Night%20%282016%29/">1 Night (2016)

answered Apr 9, 2021 at 4:03

Victor Lee

2,7164 gold badges22 silver badges48 bronze badges

Comments

Carlos Pascual · Accepted Answer · 2021-04-09 05:49:10Z

1

Adding sub() and gsub() functions to the code:

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(/\//, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file
title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

With file > output:

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(/\//, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file > output.txt

edited Apr 9, 2021 at 5:49

answered Apr 9, 2021 at 5:07

Carlos Pascual

1,1761 gold badge8 silver badges9 bronze badges

Collectives™ on Stack Overflow

get multiple words after a specific word of HTML using linux/unix scripting

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

5 Answers 5

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related