0

I have an rdd with the edges list which is comma separated like (source_URL, destination_URL). I have to extract the source host from source_URL. I tried the following code:

val edges = links.flatMap{case (src, dst) =>
if (!src.startsWith("http://") || !src.startsWith("https://"))
  { val src_url = "http://" + src 
    val url = new java.net.URL(src_url)
    val uri = url.getHost
    scala.util.Try {
        Some(uri,dst)}
        .getOrElse(None)}
else 
   { val src_url = src
    val url = new java.net.URL(src_url)
    val uri = url.getHost
    scala.util.Try {
        Some(uri,dst)}
        .getOrElse(None)}

}

Input sample:

http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/weingueter
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html

Required output:

www.belvini.de,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
www.belvini.de,http://www.belvini.de/weingueter
www.belvini.de,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html

While running the code, I am getting an exception:

 Job aborted due to stage failure: Task 935 in stage 3.0 failed 4 times, most recent failure: Lost task 935.3 in stage 3.0 (TID 1883, node27.ib, executor 248): 
java.net.MalformedURLException: For input string: "RC-a-shops.de"
at java.net.URL.<init>(URL.java:627)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)

RDD has around 1 Million edges and I'm running it in a cluster. Can someone please suggest how to get rid of this exception

11
  • Can you also provide the data? A sample from edges would help. Especially helpful if you can isolate the row that throws the exception. Commented Feb 5, 2018 at 20:44
  • It's a daring assumption, but... Have you tried prepending "http://" to the url? Commented Feb 5, 2018 at 20:45
  • Hi @Metropolis As I said I'm running it in a cluster. I'm not exactly sure where this exception is happening Commented Feb 6, 2018 at 7:13
  • @AndreyTyukin I updated the post now. Even if the URL begins with http its throwing an exception Commented Feb 6, 2018 at 7:14
  • What, is that the original exception message, copied byte by byte? The code new URL("<this is your url>") leads to the message java.net.MalformedURLException: no protocol: <this is your url>. That means that something in your program is feeding the string "someURL" to the URL constructor. This is obviously nonsense. If this is the case, your code excerpt and your examples seem irrelevant for the problem. We already knew that "someURL" is not a valid URL. The question should be: where does the string "someURL" come from? Commented Feb 6, 2018 at 7:45

2 Answers 2

2

EDIT: The question was edited to include what looks like a well-formed URL in the MalformedURLException. Regardless, my answer stands. The docs for URL suggest it will only throw MalformedURLException when the url is invalid in someway. More complete output would help in debugging this issue.

MalformedURLException - if no protocol is specified, or an unknown protocol is found, or spec is null.

It looks like your src doesn't include the protocol of the URL. You need something like

http://whatever.com/nlp-agm.php

not just nlp-agm.php.

A URL must be of the form

<scheme>://<authority><path>?<query>#<fragment>

where <scheme> is required. new java.net.URL will throw MalformedURLException if the scheme is invalid or not specified. See more here: https://docs.oracle.com/javase/7/docs/api/java/net/URL.html#URL(java.lang.String)

Sign up to request clarification or add additional context in comments.

2 Comments

Hi I'm running it in a cluster. I don't have any output right now. Please let me know how can i resolve this
@ashwini I really can't help you further. You haven't posted the exact error you get. I have pointed you to documentation on why MalformedURLException will be thrown. Study that to understand where this error comes from.
0

The java.net.MalformedURLException: no protocol exception is also thrown when you have quotes in your string:

new Url("\"http:www.example.com\"")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.