13

Basically, I need to get those rows which contain domain and subdomain name from a URL or the whole website name excluding www.

My DB table looks like this:

+----------+------------------------+
|    id    |    website             |
+----------+------------------------+
| 1        | https://www.google.com |
+----------+------------------------+
| 2        | http://www.google.co.in|
+----------+------------------------+
| 3        | www.google.com         |
+----------+------------------------+
| 4        | www.google.co.in       |
+----------+------------------------+
| 5        | google.com             |
+----------+------------------------+
| 6        | google.co.in           |
+----------+------------------------+
| 7        | http://google.co.in    |
+----------+------------------------+

Expected output:

google.com
google.co.in
google.com
google.co.in
google.com
google.co.in
google.co.in

My Postgres Query looks like this:

select id, substring(website from '.*://([^/]*)') as website_domain from contacts

But above query give blank websites. So, how I can get the desired output?

4
  • google.com has no subdomain AFAIK...so why does it appear in your expected output? Commented Nov 28, 2017 at 10:21
  • blank websites?.. for which row?.. Commented Nov 28, 2017 at 10:25
  • Basically, I want the whole website after www. Commented Nov 28, 2017 at 10:26
  • but in some cases website doesn't contain www. then I need the whole thing like google.com or google.co.in Commented Nov 28, 2017 at 10:28

2 Answers 2

23

You must use the "non capturing" match ?: to cope with the non "http://" websites.

like

select 
  id, 
  substring(website from '(?:.*://)?(?:www\.)?([^/?]*)') as website_domain     
from contacts;

SQL Fiddle: http://sqlfiddle.com/#!17/f890c/2/0

PostgreSQL's regular expressions: https://www.postgresql.org/docs/9.3/functions-matching.html#POSIX-ATOMS-TABLE

Sign up to request clarification or add additional context in comments.

3 Comments

In this case, the website contains www. need to remove this also
The www is an extand of this technique, i've update the answer
which is good technique substring or REGEX_REPLACE(as in above ans) because the query need to be optimize because this query works on large amount of data
13

You may use

SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tbl;

See the regex demo.

Details

  • ^ - start of string
  • (https?://)? - 1 or 0 occurrences of http:// or https://
  • (www\.)? - 1 or 0 occurrences of www.

See the PostgreSQL demo:

CREATE TABLE tb1
    (website character varying)
;

INSERT INTO tb1
    (website)
VALUES
    ('https://www.google.com'),
    ('http://www.google.co.in'),
    ('www.google.com'),
    ('www.google.co.in'),
    ('google.com'),
    ('google.co.in'),
    ('http://google.co.in')
;

SELECT REGEXP_REPLACE(website, '^(https?://)?(www\.)?', '') from tb1;

Result:

enter image description here

2 Comments

Thanks. This also works for me. One more thing my friend and me working on the same project but due to C# he is using SQLite(MySQL). So Can you also help me because REGEXP_REPLACE is not supporting MySQL
@ShubhamSrivastava In C#, with MS SQL, you need to use an UDF to be able to use regex. See this answer that should help to get started. Also, you might want to use this script.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.