2

I need to extract the video names from youtube's index.html. I have been able to break apart the file into small chunks, each containing one video listing, however I cannot seem to extract the video title. My professor has provided the following command, however I cannot seem to get it to work in this situation.

number=`expr "$s" : ".*\/\([0-9,]*\)\/"`; echo $number # will print 250,4211

Although I'm not completely sure, I think I'm having trouble getting this method to work because there aren't spaces between the video title and surrounding text. Here is a sample of what I would need to extract the title from:

<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

Out of this chunk of text, I would need to extract "Lil' Buck "Golden Gateway" Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin", without the quotes.

1

2 Answers 2

1

You can use the bash regex \<img.*alt=\"([^\"]*)\" to extract the alt text from the img element.

Example:

$ cat file
<li class="video-list-item "><a href="/watch?v=9BbgvlgDQMg&amp;feature=g-sptl&amp;cid=inp-hs-edt" class="video-list-item-link yt-uix-sessionlink" data-sessionlink="ei=CMzmroaB5bICFRiXIQoda3kX5g%3D%3D&amp;feature=g-sptl%26cid%3Dinp-hs-edt" ><span class="ux-thumb-wrap contains-addto "><span class="video-thumb ux-thumb yt-thumb-default-120 "><span class="yt-thumb-clip"><span class="yt-thumb-clip-inner"><img src="http://s.ytimg.com/yt/img/pixel-vfl3z5WfW.gif" alt="Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin" data-thumb="//i2.ytimg.com/vi/9BbgvlgDQMg/default.jpg" width="120" ><span class="vertical-align"></span></span></span></span><span class="video-time">3:51</span>

$ line="$(cat file)"

$ if [[ "$line" =~ \<img.*alt=\"([^\"]*)\" ]]
then
  echo "${BASH_REMATCH[1]}"
fi
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin

Update:

Using expr:

$ expr "$line" : '.*<img.*alt=\"\([^\"]*\)\".*'
Lil&#39; Buck &quot;Golden Gateway&quot; Venice Beach California YAK FILMS Super Bowl 2012 Madonna Memphis Jookin
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you, worked perfectly. I really appreciate the help. Now to implement it.
0

I suppose it is mandatory to use regex in your assignment... if not i would go for an xml parser...

But if YES I suggest you have a go with Reg Ex buddy

RegexBuddy makes it easier than ever for you to create regular expressions that do what you intend, without any guesswork. Still, you need to test your regex patterns to be 100% sure that they match what you want, and don't match what you don't want.

2 Comments

Thank you for your reply. Do you know if there is a way to do it simply by using the 'expr' command, as mentioned by my professor?
yes you can, but it is more easy to use the tool to find the right reg-ex string.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.