0

I'm trying to parse some HTML subtitles into an array using Bash and html-xml-utils, and I've tried using a Lynx dump to pretty it up, but I had the same problem, because I can't get my sed to put more than one word at a time into the array.

Code:

    array=($(echo $PAGE |
       hxselect -i ".sub_info_container .sub_title" |
       sed -r 's/.*\">(.*)<\/a>.*/\1/' ))

echo $array

This gets piped into sed:

<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>

Output of echo $array:

Some

What I'm trying to get:

Some Random Title

Without the punctuation would be nice, and the subtitles often have ? or ! instead of period, but it could work including punctuation too.

Things I've tried:

  • Using Lynx to pretty up the code, then using awk to grab the elements
  • A lot of different sed and awk methods of grabbing the text
2
  • remove the first set of parenthesis. Commented Oct 6, 2015 at 1:14
  • Use an XML parser (xmlstarlet, xmllint, ...). Commented Oct 6, 2015 at 1:17

2 Answers 2

1

I'm not sure why, but my code ended up separating spaces into separate items. The solution was the following code:

array=($(echo $PAGE |
       hxselect -i ".sub_info_container .sub_title" |
       lynx -stdin -dump | tr " " - ))

I used tr to turn the spaces into dashes, allowing it to be passed into the array. Taking off the extra parenthesis as everybody suggested actually removed the function of assigning the values into an array, as I stated was my intention. After the code completed I simply re-converted all the dashes back to spaces. It's not pretty but it works!

Sign up to request clarification or add additional context in comments.

Comments

0

Try this:

s='<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>'

array=$(echo "$s" | sed 's/<\/div><div /\n/' | sed -r 's/.*\">(.*)<\/a>.*/\1/g')

echo "$array"

I had to add a newline between the divs to match both. I'm not that good with sed and couldn't figure out how to do it without that.

Your main problem was with the extra parenthesis

array=($(echo .....))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.