Bash script to return domains instead of URL's

Question

I have this bash script that i wrote to analyse the html of any given web page. What its actually supposed to do is to return the domains on that page. Currently its returning the number of URL's on that web page.

#!/bin/sh

echo "Enter a url eg www.bbc.com:"
read url
content=$(wget "$url" -q -O -)
echo "Enter file name to store URL output"
read file
echo $content > $file
echo "Enter file name to store filtered links:"
read links
found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)
cat out

How can i get it to return the domains instead of the URL's. From my programming knowledge I know its supposed to do parsing from the right but i am a newbie at bash scripting. Can someone please help me. This is as far as I have gone.

You lose line breaks and whitespace with an unquoted echo. But actually, I would obtain the URL and the filename, then wget -O "$filename" "$url" — tripleee
– tripleee, Commented Aug 9, 2012 at 8:48

Jon Lin · Accepted Answer · 2012-08-09 08:37:05Z

2

I know there's a better way to do this in awk but you can do this with sed, by appending this after your awk '/http/':

| sed -e 's;https\?://;;' | sed -e 's;/.*$;;'

Then you want to move your sort and uniq to the end of that.

So that the whole line will look like:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | awk   '/http/' | sed -e 's;https\?://;;' | sed -e 's;/.*$;;' | sort | uniq -c > out)

You can get rid of this line:

output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

edited Aug 9, 2012 at 8:37

answered Aug 9, 2012 at 8:08

Jon Lin

144k29 gold badges227 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

roykasa Over a year ago

Yes this is close but when i run it it doesnt return how many occurences each domain appears on the webpage that is queried like it was doing with the URL's. Its output is like this advertising.bbcworldwide.com news.bbc.co.uk newsvote.bbc.co.uk purl.org static.bbci.co.uk www.bbcamerica.com www.bbc.com www.bbc.co.uk www.browserchoice.eu www.omniture.com

tzelleke · Accepted Answer · 2012-08-09 09:21:17Z

2

EDIT 2: Please note, that you might want to adapt the search patterns in the sed expressions to your needs. This solution considers only http[s]?://-protocol and www.-servers...

EDIT:
If you want count and domains:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http[s]?://\([^/]*\).*$@\1@p' | \
   sort | \
     uniq -c | \
       sed 's/www.//'

gives

2 wordpress.org
10 zelleke.com

Original Answer:

You might want to use lynx for extracting links from URL

lynx -dump -listonly http://zelleke.com

gives

# blank line at the top of the output
References

   1. http://www.zelleke.com/feed/
   2. http://www.zelleke.com/comments/feed/
   3. http://www.zelleke.com/
   4. http://www.zelleke.com/#content
   5. http://www.zelleke.com/#secondary
   6. http://www.zelleke.com/
   7. http://www.zelleke.com/wp-login.php
   8. http://www.zelleke.com/feed/
   9. http://www.zelleke.com/comments/feed/
  10. http://wordpress.org/
  11. http://www.zelleke.com/
  12. http://wordpress.org/

Based on this output you achieve desired result with:

lynx -dump -listonly http://zelleke.com | \
  sed -n '4,$ s@^.*http://\([^/]*\).*$@\1@p' | \
   sort -u | \
     sed 's/www.//'

gives

wordpress.org
zelleke.com

edited Aug 9, 2012 at 9:21

answered Aug 9, 2012 at 8:35

tzelleke

15.4k5 gold badges35 silver badges49 bronze badges

2 Comments

roykasa Over a year ago

yes this also gives me a result close to what i want.. Thank you.

fabsays Over a year ago

@theodros-zelleke sed doesn't support the non-greedy operator ?. Do http[s]*:// instead

Igor Chubin · Accepted Answer · 2012-08-09 08:11:48Z

0

You can remove path from url with sed:

sed s@http://@@; s@/.*@@

I want to say you also, that these two lines are wrong:

found=$(cat $file | grep -o -E 'href="([^"#]+)"' | cut -d '"' -f2 | sort | uniq | awk   '/http/' > $links)
output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

You must make either redirection ( > out ), or command substitution $(), but not two thing at the same time. Because the variables will be empty in this case.

This part

content=$(wget "$url" -q -O -)
echo $content > $file

would be also better to write this way:

wget "$url" -q -O - > $file

answered Aug 9, 2012 at 8:11

Igor Chubin

65.2k14 gold badges132 silver badges149 bronze badges

2 Comments

roykasa Over a year ago

Yes i have removed the output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out)

roykasa Over a year ago

But its still not totalling the domains that belong to the same top level domain and domain name. I removed output=$(egrep -o '^http://[^/]+/' $links | sort | uniq -c > out) and replaced it with @JonLin's solution. Currently its returning the domains but not totalling them as to which ones belong to the same top level domain

Community · Accepted Answer · 2021-10-07 06:26:44Z

0

Collectives™ on Stack Overflow

Bash script to return domains instead of URL's

4 Answers 4

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related