1

I have this bash script that i wrote to analyse the html of any given web page. What its actually supposed to do is to return the domains on that page. Currently its returning the number of URL's on that web page.

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

How can i get it to return the domains instead of the URL's. From my programming knowledge I know its supposed to do parsing from the right but i am a newbie at bash scripting. Can someone please help me. This is as far as I have gone.

1
  • You lose line breaks and whitespace with an unquoted echo. But actually, I would obtain the URL and the filename, then wget -O "$filename" "$url" Commented Aug 9, 2012 at 8:48

4 Answers 4

2

I know there's a better way to do this in awk but you can do this with sed, by appending this after your awk '/http/':

| sed -e 's;https\?://;;' | sed -e 's;/.*$;;'

Then you want to move your sort and uniq to the end of that.

So that the whole line will look like:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | awk   '/http/' | sed -e 's;https\?://;;' | sed -e 's;/.*$;;' | sort | uniq -c > out)

You can get rid of this line:

output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
Sign up to request clarification or add additional context in comments.

1 Comment

Yes this is close but when i run it it doesnt return how many occurences each domain appears on the webpage that is queried like it was doing with the URL's. Its output is like this advertising.bbcworldwide.com news.bbc.co.uk newsvote.bbc.co.uk purl.org static.bbci.co.uk www.bbcamerica.com www.bbc.com www.bbc.co.uk www.browserchoice.eu www.omniture.com
2

EDIT 2: Please note, that you might want to adapt the search patterns in the sed expressions to your needs. This solution considers only http[s]?://-protocol and www.-servers...

EDIT:
If you want count and domains:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http[s]?://\([^/]*\).*$@\1@p' | \
   sort | \
     uniq -c | \
       sed 's/www.//'

gives

2 wordpress.org
10 zelleke.com

Original Answer:

You might want to use lynx for extracting links from URL

lynx -dump -listonly http://zelleke.com

gives

# blank line at the top of the output
References

   1. http://www.zelleke.com/feed/
   2. http://www.zelleke.com/comments/feed/
   3. http://www.zelleke.com/
   4. http://www.zelleke.com/#content
   5. http://www.zelleke.com/#secondary
   6. http://www.zelleke.com/
   7. http://www.zelleke.com/wp-login.php
   8. http://www.zelleke.com/feed/
   9. http://www.zelleke.com/comments/feed/
  10. http://wordpress.org/
  11. http://www.zelleke.com/
  12. http://wordpress.org/

Based on this output you achieve desired result with:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http://\([^/]*\).*$@\1@p' | \
   sort -u | \
     sed 's/www.//'

gives

wordpress.org
zelleke.com

2 Comments

yes this also gives me a result close to what i want.. Thank you.
@theodros-zelleke sed doesn't support the non-greedy operator ?. Do http[s]*:// instead
0

You can remove path from url with sed:

sed s@http://@@; s@/.*@@

I want to say you also, that these two lines are wrong:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

You must make either redirection ( > out ), or command substitution $(), but not two thing at the same time. Because the variables will be empty in this case.

This part

content=$(wget "$url" -q -O -)
echo $content > $file

would be also better to write this way:

wget "$url" -q -O - > $file

2 Comments

Yes i have removed the output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
But its still not totalling the domains that belong to the same top level domain and domain name. I removed output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out) and replaced it with @JonLin's solution. Currently its returning the domains but not totalling them as to which ones belong to the same top level domain
0

you may be interested by it:

https://www.rfc-editor.org/rfc/rfc3986#appendix-B

explain the way to parse uri using regex.

so you can parse an uri from the left this way, and extract the "authority" that contains domain and subdomain names.

sed -r 's_^([^:/?#]+:)?(//([^/?#]*))?.*_\3_g';
grep -Eo '[^\.]+\.[^\.]+$' # pipe with first line, give what you need

this is interesting to:

http://www.scribd.com/doc/78502575/124/Extracting-the-Host-from-a-URL

assuming that url always begin this way

https?://(www\.)?

is really hazardous.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.