I have an rdd with the edges list which is comma separated like (source_URL, destination_URL). I have to extract the source host from source_URL. I tried the following code:
val edges = links.flatMap{case (src, dst) =>
if (!src.startsWith("http://") || !src.startsWith("https://"))
{ val src_url = "http://" + src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
else
{ val src_url = src
val url = new java.net.URL(src_url)
val uri = url.getHost
scala.util.Try {
Some(uri,dst)}
.getOrElse(None)}
}
Input sample:
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/weingueter
http://www.belvini.de/weingut/mID/2530/max-markert.html,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
Required output:
www.belvini.de,http://www.belvini.de/content.php/coID/299/kundenmeinungen.html
www.belvini.de,http://www.belvini.de/weingueter
www.belvini.de,http://www.belvini.de/filter/cID/10/country/suedafrika.137.html
While running the code, I am getting an exception:
Job aborted due to stage failure: Task 935 in stage 3.0 failed 4 times, most recent failure: Lost task 935.3 in stage 3.0 (TID 1883, node27.ib, executor 248):
java.net.MalformedURLException: For input string: "RC-a-shops.de"
at java.net.URL.<init>(URL.java:627)
at java.net.URL.<init>(URL.java:490)
at java.net.URL.<init>(URL.java:439)
RDD has around 1 Million edges and I'm running it in a cluster. Can someone please suggest how to get rid of this exception
new URL("<this is your url>")leads to the messagejava.net.MalformedURLException: no protocol: <this is your url>. That means that something in your program is feeding the string "someURL" to theURLconstructor. This is obviously nonsense. If this is the case, your code excerpt and your examples seem irrelevant for the problem. We already knew that "someURL" is not a valid URL. The question should be: where does the string "someURL" come from?