Bash: extract the scheme and host part from a URL string

Question

i need to strip the url from a given string. The only solution i have is:

_url="http://download.enlightenment.org/rel/apps/econnman/econnman-1.1.tar.gz"
_l=${_url%.*/*}        #  http://download.enlightenment
_l=${#_l}              #  29
_url=${_url:0:${l}+4}  #  http://download.enlightenment.org

But this will fail on every none 3 chars length TLD and it's just the wrong way.

thx a lot.

bash only? Can you use awk or sed?

dawg
– dawg

2016-11-06 21:17:07 +00:00
Commented Nov 6, 2016 at 21:17 — dawg
– dawg, Commented Nov 6, 2016 at 21:17

dawg · Accepted Answer · 2016-11-06 23:30:23Z

6

You can use grep:

 $ echo "$_url" | grep -Eo '^http[s]?://[^/]+'
 http://download.enlightenment.org

You can use expr with a regex:

$ echo `expr "$_url" : '\(http://[^/]*\)'`
http://download.enlightenment.org

Or, use awk:

echo "$_url" | awk -F/ 'BEGIN{OFS=FS} {print $1 OFS OFS $3}'
http://download.enlightenment.org

You can use cut:

echo "$_url" | cut -d/ -f1-3
http://download.enlightenment.org

cut is probably the easiest to get the rest of the url as well:

$ echo "$_url" | cut -d/ -f4-
rel/apps/econnman/econnman-1.1.tar.gz

Or, completely internal to Bash:

 $ [[ $_url =~ ^([^:]+://[^/]+)/?(.*)$ ]] && server="${BASH_REMATCH[1]}"
 $ echo "$server"
 http://download.enlightenment.org

and "${BASH_REMATCH[2]}" has the rest of the url.

edited Nov 6, 2016 at 23:30

answered Nov 6, 2016 at 21:40

dawg

105k24 gold badges143 silver badges217 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 12:33:26Z

3

To extract a substring from a value already contained in a shell variable, use Bash's regex-matching operator, =~, which supports extended regular expressions:

^{Note: dawg's answer contains solutions that are better suited to input from a file or stdin with multiple inputs.

They incur startup cost due to involving child processes, but for collections of inputs that is well worth it, because external utilities are much more efficient at processing larger input sets.}

_url='http://download.enlightenment.org/rel/apps/econnman/econnman-1.1.tar.gz'
[[ $_url =~ ^https?://[^/]+ ]] && _url="${BASH_REMATCH[0]}"
echo "$_url"   # -> 'http://download.enlightenment.org'

^https?://[^/]+ ]] matches any string that starts with (^) literal http:// or https:// and matches the longest nonempty (+) run of characters that follows that doesn't include / ([^/]+).
Built-in array BASH_REMATCH contains the results of the most recent application of the =~ operator, with the first element (with index 0) containing whatever the regex matched as a whole.
(Subsequent elements would contain what parenthesized sub-expressions (a.k.a capture groups) matched, but in this case we're not using any).

edited May 23, 2017 at 12:33

CommunityBot

11 silver badge

answered Nov 6, 2016 at 21:31

mklement0

453k68 gold badges729 silver badges989 bronze badges

2 Comments

wfx Over a year ago

Works perfectly for my needs i only change it a little bit ... ^http://[^/]+ ... to ... ^https*://[^/]+ ... so i can use http and https.

mklement0 Over a year ago

@wfx: Glad to hear it; ^https?://[^/]+ is a little more robust, though - I've updated the answer.

Nicolas · Accepted Answer · 2016-11-06 21:14:49Z

0

I don't know if this works for bash, but it works for pcre regex engine.

(?<=:\/\/)(.*)(?=\/)

Finds all text between first / and second /. Works for https://google.com/ but doesn't for google.com/ or https://google.com. Depends on what you need.

Regex 101

answered Nov 6, 2016 at 21:14

Nicolas

7,2094 gold badges35 silver badges81 bronze badges

Collectives™ on Stack Overflow

Bash: extract the scheme and host part from a URL string

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related