1

I am fairly confident with bash scripting, however this seems a little over my head.

What I am attempting to do, is take a string -- IE

page_content=<div class="contact_info_wrap"><img src="http://example.com/UserMedia/gafgallery/icons/email_icon.png" style="border-width: 0px; border-style: solid;" width="40" /><img alt="" src="example.com/UserMedia/gafgallery/icons/loc_icon.png" style="border-width: 0px; border-style: solid;" width="40" />

which was found by using this:

 pageCheck="example.com"
 if test "${page_content#*$pageCheck}" != "$page_content"

within the then I am attempting to take each of the urls in $page_content, only containing http://example.com, and add them to an array. Though I honestly don't even know where to start! I would like to end up with something like:

This[0]='http://example.com/the/first/url/containing/example.com'
This[1]='http://example.com/the/second/url/containing/example.com'
This[2]='etc ... '
This[3]='etc ... '

Is there a simple efficient way to get this done?

2
  • Could you provide the whole code (including the portion where you read the content you associate with each URL)? Commented Feb 25, 2017 at 0:00
  • Not much to show ---- Literally: mysql --login-path=myhostalias -Dywpadmin_current_content -e"SELECT page_id, page_content FROM client_content WHERE client_section_id = '$client_section_id'" | while read page_id page_content; do -- Then the code you see above Commented Feb 25, 2017 at 0:07

1 Answer 1

1

Try something like this :

#!/bin/bash
sql_request()
{
mysql --login-path=myhostalias -Dywpadmin_current_content -e"SELECT page_id, page_content FROM client_content WHERE client_section_id = '$client_section_id'"
}

filter_urls()
{
grep -E -o "(href|src)=\"[^\"]*$1[^\"]*" | cut -d'"' -f2 | sort -u
}

declare -a array=()
while read page_id page_content
do
  while read url
  do
     array+=("$url")
  done < <(filter_urls "example.com" <<<"$page_content")
done < <(sql_request)

printf "%s\n" "${array[@]-}" # Just to show array content

I am not an expert with mysql, I just copy/pasted your commanda assuming it is working. I assumed you want one array with URLs from all pages read, but the solution can be adjusted easily if you are looking for something else.

Furthermore, I assume your data is read correctly by read without changing IFS or using the common -r option, but this is something you may want to do.

Some points of interest :

  • Note the use of process substitution < <() which allows reading from the command found inside, a bit like a pipe. The big difference is it leaves the loop body in the main shell context, therefore allowing variables to be assigned without losing their value after exiting the loop.

  • I allowed URLs beginning with src or href, but I assumed they were always quoted. If this assumption is not save, you would need to rework the regular expression used.

  • The script sorts URLs with -u to make them unique on a per-page basis, which is a bit lazy (if you need to make them unique, they probably need to be unique in the array). Not knowing what you really need, I do not want to add code without being sure it helps rather than obscurs.

Sign up to request clarification or add additional context in comments.

2 Comments

You are correct about the quotes ... Your answer looks like it'd work ... I will test it tomorrow and if it works I will accept your answer -- otherwise I will continue this conversation .. Thanks for your answer!
With a little tweaking, this ended up being exactly what I needed .. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.