2

i have a file 'movie.html' :

<html>
<head><title>Index of /Data/Movies/Hollywood/2016_2017/</title></head>
<body bgcolor="white">
<h1>Index of /Data/Movies/Hollywood/2016_2017/</h1><hr><pre><a href="../">../</a>
<a href="1%20Buck%20%282017%29/">1 Buck (2017)/</a>                                     25-Nov-2019 10:25       -
<a href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)/</a>                              25-Nov-2019 10:26       -
<a href="1%20Night%20%282016%29/">1 Night (2016)/</a>                                    25-Nov-2019 10:27       -
</pre><hr></body>
</html>

I want to get multiple word with pipe delimited like this:

title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

I tried this code:

awk -F'[><]' 'BEGIN{ print "title","link" } /%29/ {print $3,$2}' movie.html > output.txt

but the output isn't as my expectation please help me, i am still a beginner

5 Answers 5

2

Parsing html with regex is not advised for several reasons (see https://stackoverflow.com/a/1732454/12957340), but here is one potential solution:

awk -F'[<>/"]' 'BEGIN{ print "title | link" }; /\(.*\)/ {print $6 " | " $3}' movie.html
Sign up to request clarification or add additional context in comments.

Comments

2

With your shown samples, could you please try following. I prefer to this with match function.

awk '
BEGIN{
  OFS=" | "
  print "title | link"
}
match($0,/^<a href="[^"]*/){
  val=substr($0,RSTART+9,RLENGTH-9)
  match($0,/>.*<\/a>/)
  print substr($0,RSTART+1,RLENGTH-6),val
}' Input_file

Explanation: Adding detailed explanation for above.

awk '                                      ##Starting awk program from here.
BEGIN{                                     ##Starting BEGIN section of this program from here.
  OFS=" | "                                ##Setting OFS as space | space here.
  print "title | link"                     ##Printing title space | space link here.
}
match($0,/^<a href="[^"]*/){               ##Using match to match regex from starting of line <a href=" till " comes.
  val=substr($0,RSTART+9,RLENGTH-9)        ##Creating val which has sub string of matched above text, making it as per OP needs here.
  match($0,/>.*<\/a>/)                     ##Using match to match from > till </a> here.
  print substr($0,RSTART+1,RLENGTH-6),val  ##Printing current matched sub string(by above match function) and val value here.
}
' Input_file                               ##Mentioning Input_file name here. 

Comments

2

If ed is available/acceptable and you understand the risk of using non html parser to parse hmtl files.

script.ed

0a
title | link
.
p
g/^<a href=.\{1,\}/s/^.\{1,\}="//\
s/\/[[:blank:]]*<\/a>.*$//\
s/">/ /\
s/^\([^ ]\{1,\}\) \(.\{1,\}\)/\2 | \1/p
Q

Then

ed -s file.html < script.ed

Comments

1

Another way, I think you could get the processed lines with grep and then use awk format the output content.

grep -oP 'href="([^".]*)">([^</.]*)' movie.html | awk -F'[">]' 'BEGIN{print "title | link"}{print $4" | "$2}'

grep will get the lines like below:

href="1%20Buck%20%282017%29/">1 Buck (2017)
href="1%20Mile%20to%20You%20%282017%29/">1 Mile to You (2017)
href="1%20Night%20%282016%29/">1 Night (2016)

Comments

1

Adding sub() and gsub() functions to the code:

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(/\//, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file
title | link
1 Buck (2017) | 1%20Buck%20%282017%29/
1 Mile to You (2017) | 1%20Mile%20to%20You%20%282017%29/
1 Night (2016) | 1%20Night%20%282016%29/

With file > output:

awk -F'[><]' 'BEGIN{ print "title","|", "link" } /%29/ {sub(/\//, " |", $3);gsub(/^a href="|"$/, "", $2);print $3,$2}' file > output.txt

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.