1

I have been using a regex that searches a document for all URLS and replaces them but now I want to only replace the hostname, not the subdomain or any other part of the URL.

For example I want https://ftp.website.com > https://ftp.mything.com

This is a tool I am writing to sanitize documents and am fairly new to some of this. Any help would be greatly appreciated. Thanks!

This is my quick and dirty find and replace so far:

import fileinput
import re

for line in fileinput.input():
    line = re.sub(
        r'^(?:http:\/\/|www\.|https:\/\/)([^\/]+)',
        r'client.com', line.rstrip())
    line = re.sub(
        r'\b(\d{1,3}\.){2}\d{1,3}\b',
        r'1.33.7', line.rstrip())
    print(line)

I realize that URL parse can accomplish this but I want this to find the URLs in the document and I do not want to supply them. Maybe I just need help using regex to find the urls and passing that to urlparse to remove the parts I want. Hope this clarifies.

3
  • This question is identical to this Commented Oct 6, 2017 at 22:35
  • Possible duplicate of Changing hostname in a url Commented Oct 6, 2017 at 23:14
  • 1
    I do not want to specify a url, I want to search for all URLs in the document and just replace the domain. Commented Oct 6, 2017 at 23:40

2 Answers 2

0

My solution below will separate the URL to 3 groups: before host, hostname, and afterhost:

import re
regex = r"^(http[:\/\w\.]*[/.])(\w+)(\.[\w\/]+)$"

target = "http://olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://newdomain.com'

target = "http://ftp.olddomain.com"
print re.sub(regex,r"\1newdomain\3",target)
# 'http://ftp.newdomain.com'

target = "https://sub.sub.olddomain.com/sub/sub"
print re.sub(regex,r"\1newdomain\3",target)
# 'https://sub.sub.newdomain.com/sub/sub'

target = "how.about.this"
print re.sub(regex,r"\1newdomain\3",target)
# 'how.about.this'
Sign up to request clarification or add additional context in comments.

Comments

0
import fileinput
import re

regex = r"(^.*http\://(?:www\.)*)\S+?((?:\.\S+?)*/.*$)"

for line in fileinput.input():
    print re.sub(regex,r"\1newdomain\2",line)

# targets = [ "http://olddomain.com/test/test" , "this urel http://www.olddomain.com/test/test dends" ]
#
# for target in targets:
#     print re.sub(regex,r"\1newdomain\2",target)

gives when the comments are removed and the file input is commented out. I've left it in this so it will work as requested.

python /tmp/test2.py
http://newdomain.com/test/test
this urel http://www.newdomain.com/test/test dends

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.