bash -- Isolate specific URLs into an array from string

Question

I am fairly confident with bash scripting, however this seems a little over my head.

What I am attempting to do, is take a string -- IE

page_content=<div class="contact_info_wrap"><img src="http://example.com/UserMedia/gafgallery/icons/email_icon.png" style="border-width: 0px; border-style: solid;" width="40" /><img alt="" src="example.com/UserMedia/gafgallery/icons/loc_icon.png" style="border-width: 0px; border-style: solid;" width="40" />

which was found by using this:

 pageCheck="example.com"
 if test "${page_content#*$pageCheck}" != "$page_content"

within the then I am attempting to take each of the urls in $page_content, only containing http://example.com, and add them to an array. Though I honestly don't even know where to start! I would like to end up with something like:

This[0]='http://example.com/the/first/url/containing/example.com'
This[1]='http://example.com/the/second/url/containing/example.com'
This[2]='etc ... '
This[3]='etc ... '

Is there a simple efficient way to get this done?

Could you provide the whole code (including the portion where you read the content you associate with each URL)? — Fred
– Fred, Commented Feb 25, 2017 at 0:00
Not much to show ---- Literally: mysql --login-path=myhostalias -Dywpadmin_current_content -e"SELECT page_id, page_content FROM client_content WHERE client_section_id = '$client_section_id'" | while read page_id page_content; do -- Then the code you see above — Zak
– Zak, Commented Feb 25, 2017 at 0:07

Fred · Accepted Answer · 2017-02-25 00:53:53Z

1

Try something like this :

#!/bin/bash
sql_request()
{
mysql --login-path=myhostalias -Dywpadmin_current_content -e"SELECT page_id, page_content FROM client_content WHERE client_section_id = '$client_section_id'"
}

filter_urls()
{
grep -E -o "(href|src)=\"[^\"]*$1[^\"]*" | cut -d'"' -f2 | sort -u
}

declare -a array=()
while read page_id page_content
do
  while read url
  do
     array+=("$url")
  done < <(filter_urls "example.com" <<<"$page_content")
done < <(sql_request)

printf "%s\n" "${array[@]-}" # Just to show array content

I am not an expert with mysql, I just copy/pasted your commanda assuming it is working. I assumed you want one array with URLs from all pages read, but the solution can be adjusted easily if you are looking for something else.

Furthermore, I assume your data is read correctly by read without changing IFS or using the common -r option, but this is something you may want to do.

Some points of interest :

Note the use of process substitution < <() which allows reading from the command found inside, a bit like a pipe. The big difference is it leaves the loop body in the main shell context, therefore allowing variables to be assigned without losing their value after exiting the loop.
I allowed URLs beginning with src or href, but I assumed they were always quoted. If this assumption is not save, you would need to rework the regular expression used.
The script sorts URLs with -u to make them unique on a per-page basis, which is a bit lazy (if you need to make them unique, they probably need to be unique in the array). Not knowing what you really need, I do not want to add code without being sure it helps rather than obscurs.

edited Feb 25, 2017 at 0:53

answered Feb 25, 2017 at 0:46

Fred

7,10512 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Zak Over a year ago

You are correct about the quotes ... Your answer looks like it'd work ... I will test it tomorrow and if it works I will accept your answer -- otherwise I will continue this conversation .. Thanks for your answer!

Zak Over a year ago

With a little tweaking, this ended up being exactly what I needed .. Thank you!

Collectives™ on Stack Overflow

bash -- Isolate specific URLs into an array from string

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related