Replacing ONLY domain in Python

Question

I have been using a regex that searches a document for all URLS and replaces them but now I want to only replace the hostname, not the subdomain or any other part of the URL.

For example I want https://ftp.website.com > https://ftp.mything.com

This is a tool I am writing to sanitize documents and am fairly new to some of this. Any help would be greatly appreciated. Thanks!

This is my quick and dirty find and replace so far:

import fileinput
import re

for line in fileinput.input():
    line = re.sub(
        r'^(?:http:\/\/|www\.|https:\/\/)([^\/]+)',
        r'client.com', line.rstrip())
    line = re.sub(
        r'\b(\d{1,3}\.){2}\d{1,3}\b',
        r'1.33.7', line.rstrip())
    print(line)

I realize that URL parse can accomplish this but I want this to find the URLs in the document and I do not want to supply them. Maybe I just need help using regex to find the urls and passing that to urlparse to remove the parts I want. Hope this clarifies.

I do not want to specify a url, I want to search for all URLs in the document and just replace the domain. — zek
– zek, Commented Oct 6, 2017 at 23:40

malioboro · Accepted Answer · 2017-10-07 02:06:06Z

0

My solution below will separate the URL to 3 groups: before host, hostname, and afterhost:

import re
regex = r"^(http[:\/\w\.]*[/.])(\w+)(\.[\w\/]+)$"

target = "http://olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://newdomain.com'

target = "http://ftp.olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://ftp.newdomain.com'

target = "https://sub.sub.olddomain.com/sub/sub"
print re.sub(regex,r"\1newdomain\3",target)
# 'https://sub.sub.newdomain.com/sub/sub'

target = "how.about.this"
print re.sub(regex,r"\1newdomain\3",target)
# 'how.about.this'

edited Oct 7, 2017 at 2:06

answered Oct 7, 2017 at 1:40

malioboro

3,3514 gold badges38 silver badges57 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Calvin Taylor · Accepted Answer · 2017-10-07 09:16:29Z

0

import fileinput
import re

regex = r"(^.*http\://(?:www\.)*)\S+?((?:\.\S+?)*/.*$)"

for line in fileinput.input():
    print re.sub(regex,r"\1newdomain\2",line)

# targets = [ "http://olddomain.com/test/test" , "this urel http://www.olddomain.com/test/test dends" ]
#
# for target in targets:
#     print re.sub(regex,r"\1newdomain\2",target)

gives when the comments are removed and the file input is commented out. I've left it in this so it will work as requested.

python /tmp/test2.py
http://newdomain.com/test/test
this urel http://www.newdomain.com/test/test dends

answered Oct 7, 2017 at 9:16

Calvin Taylor

7044 silver badges17 bronze badges

Collectives™ on Stack Overflow

Replacing ONLY domain in Python

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related