I'm trying to parse some HTML subtitles into an array using Bash and html-xml-utils, and I've tried using a Lynx dump to pretty it up, but I had the same problem, because I can't get my sed to put more than one word at a time into the array.
Code:
array=($(echo $PAGE |
hxselect -i ".sub_info_container .sub_title" |
sed -r 's/.*\">(.*)<\/a>.*/\1/' ))
echo $array
This gets piped into sed:
<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>
Output of echo $array:
Some
What I'm trying to get:
Some Random Title
Without the punctuation would be nice, and the subtitles often have ? or ! instead of period, but it could work including punctuation too.
Things I've tried:
- Using Lynx to pretty up the code, then using
awkto grab the elements - A lot of different
sedandawkmethods of grabbing the text