Parsing HTML to array only returns one word

Question

I'm trying to parse some HTML subtitles into an array using Bash and html-xml-utils, and I've tried using a Lynx dump to pretty it up, but I had the same problem, because I can't get my sed to put more than one word at a time into the array.

Code:

    array=($(echo $PAGE |
       hxselect -i ".sub_info_container .sub_title" |
       sed -r 's/.*\">(.*)<\/a>.*/\1/' ))

echo $array

This gets piped into sed:

<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>

Output of echo $array:

Some

What I'm trying to get:

Some Random Title

Without the punctuation would be nice, and the subtitles often have ? or ! instead of period, but it could work including punctuation too.

Things I've tried:

Using Lynx to pretty up the code, then using awk to grab the elements
A lot of different sed and awk methods of grabbing the text

remove the first set of parenthesis.

ElefantPhace
– ElefantPhace

2015-10-06 01:14:37 +00:00
Commented Oct 6, 2015 at 1:14 — ElefantPhace
– ElefantPhace, Commented Oct 6, 2015 at 1:14
Use an XML parser (xmlstarlet, xmllint, ...).

Cyrus
– Cyrus

2015-10-06 01:17:08 +00:00
Commented Oct 6, 2015 at 1:17 — Cyrus
– Cyrus, Commented Oct 6, 2015 at 1:17

hydrix · Accepted Answer · 2015-10-06 04:49:30Z

1

I'm not sure why, but my code ended up separating spaces into separate items. The solution was the following code:

array=($(echo $PAGE |
       hxselect -i ".sub_info_container .sub_title" |
       lynx -stdin -dump | tr " " - ))

I used tr to turn the spaces into dashes, allowing it to be passed into the array. Taking off the extra parenthesis as everybody suggested actually removed the function of assigning the values into an array, as I stated was my intention. After the code completed I simply re-converted all the dashes back to spaces. It's not pretty but it works!

answered Oct 6, 2015 at 4:49

hydrix

3322 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

ElefantPhace · Accepted Answer · 2015-10-06 01:34:31Z

0

Try this:

s='<div class="sub_title"><a class="sub_title" href="/link">Some Random Title.</a></div><div class="sub_title"><a class="sub_title" href="/link2">Another subtitle I want.</a>'

array=$(echo "$s" | sed 's/<\/div><div /\n/' | sed -r 's/.*\">(.*)<\/a>.*/\1/g')

echo "$array"

I had to add a newline between the divs to match both. I'm not that good with sed and couldn't figure out how to do it without that.

Your main problem was with the extra parenthesis

array=($(echo .....))

answered Oct 6, 2015 at 1:34

ElefantPhace

3,8143 gold badges22 silver badges36 bronze badges

Collectives™ on Stack Overflow

Parsing HTML to array only returns one word

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related