0

Given such a domain:

http%3a%2f%2fwww.google.com%2fpagead%2fconversion%2f1001680686%2f%3flabel%3d4dahCKKczAYQrt7R3QM%26value%3d%26muid%3d_0RQqV8nf-ENh3b4qRJuXQ%26bundleid%3dcom.google.android.youtube%26appversion%3d5.10

I want to replace the

%3a%2f%2

with

://

and get rid all the content behind ".com", so finally I just want to got

http://www.google.com

How can I implement this in Java using a regex?

2

3 Answers 3

2

You can use:

String u = URLDecoder.decode(url, "UTF-8").replaceFirst("(\\.[^/]+).*$", "$1");
// http://www.google.com
Sign up to request clarification or add additional context in comments.

8 Comments

When using your method, it works for most URLs, but I got one result like this:"http\://nfl.demdex.net/event?d_uuid=78914312359887357297063319424411977817&d_dpid=1327&d_dpuuid=2A05BC680507A5C5-60000108200532A8&d_ptfm=android&d_dst=1&d_rtbd=json". I think I should only have "nfl.demdex.net", right? What's the issue? Anything you can improve or modified?
Yes sure, try my answer now.
This looks really good. Can you give some brief explanation about your answer? I will accept this answer after that, thanks.
Yes sure. This regex finds first dot using \\. and then using negation regex [^/]+ it matches text until it hits a slash /. (\\.[^/]+) is capturing this in captured group #1 and .*$ matches everything till end of URL. In the replacement part we just a back reference $1 for captured group #1 thus giving us http://nfl.demdex.net and discarding the rest.
What if I got url like this: when decoded, it becomes: www.google.com:80/other part... I do not want the port information either (:80), how could I make changes about this? @anubhava
|
1

So you have a URL of this scheme after you decoded it (e.g. with java.net.URLDecoder.decode()):

http://www.google.com/here/is/some/content

To get the Domain and the Protocol from the input, you can use a regex like this:

String input = URLDecoder.decode("http%3a%2f%2fwww.google.com%2fpagead%2fconversion%2f1001680686%2f%3flabel%3d4dahCKKczAYQrt7R3QM%26value%3d%26muid%3d_0RQqV8nf-ENh3b4qRJuXQ%26bundleid%3dcom.google.android.youtube%26appversion%3d5.10");
Matcher m = Pattern.compile("(http[s]?)://([^/]+)(/.*)?").matcher(input);
if (!m.matches()) return;
String protocol = m.group(1);
String domain   = m.group(2);
System.out.println(protocol + "://" + domain);

Explanation of the regex:

(http[s]?)://([^/]+)(/.*)?
|---1----|-2-|--3--|--4---|
  1. Matches the protocols http and https
  2. Matches the :// behind the protocol
  3. Matches the domain name ([^/]+ is any string that doesn't contain a slash)
  4. Matches everything behind the domain (must start with a slash)

2 Comments

I'm just leaving the note that you're using a deprecated version URLDecoder.decode(String) ...
@Tom I linked to the JavaDoc so I know this is deprecated. If you know what's that for an encoding feel free to add it
0

One way;

java.net.URI uri = new java.net.URI(java.net.URLDecoder.decode(url, "UTF-8"));

System.out.println( uri.getScheme() + "://" + uri.getHost() );

1 Comment

Since OP added the regex tag I think he want to have a regex that solves the problem

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.