2

I need to read a file stored in my project's resources, the directory is src/main/resources/dataset/dataset.dat. I'm using the following lines of Scala code to read a text file from HDFS and parse as Spark RDD of dataset objects:

// init Spark context
val conf: SparkConf = new SparkConf().setAppName("mydataset").setMaster("local")
val sc: SparkContext = new SparkContext(conf)

// read dat file
val resource = this.getClass.getClassLoader.getResource("dataset/dataset.dat")
val dsRdd: RDD[DatasetObject] = sc.textFile(resource.toString(), 1).map(line => DatasetData.parse(line))

but the following error occurred:

class java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/grader/grader.jar!/dataset/dataset.dat
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: jar:file:/grader/grader.jar!/dataset/dataset.dat

I tried to read the file in another way but the error keeps occurring:

val dsRdd: RDD[DatasetObject] = sc.textFile("src/main/resources/dataset/dataset.dat").map(line => DatasetData.parse(line))

Important: Unit tests are successfully run locally, the problem occurs on the remote test environment.

6
  • Can you describe your remote test environment? Cloud? Remember that the workers try to load the file, is it available to them? Commented Dec 14, 2021 at 12:11
  • @jgp sorry but I haven't details about the remote environment because it is the Coursera online lab used for the assignments. Commented Dec 14, 2021 at 15:53
  • I think your issue is with the path nevertheless… is it an old course? RDDs are so 2018 :) Commented Dec 14, 2021 at 23:39
  • src/main does not exist in your JAR or after the code compiles. There is a class called SparkFiles, I believe, which you should be using here. Commented Dec 15, 2021 at 0:15
  • 1
    @OneCricketeer Thanks for your time, I found a solution. I leave a comment below :) Commented Dec 15, 2021 at 8:36

1 Answer 1

1

The problem was using getResource and textFile, I had to use a combination of getResourceAsStream and sc.parallelize as follow:

def lines: List[String] = {
    Option(getClass.getResourceAsStream("/dataset/dataset.dat")) match {
      case None => sys.error("Please download the dataset as explained in the assignment instructions")
      case Some(resource) => Source.fromInputStream(resource).getLines().toList
    }
  }

and parse as Spark RDD of dataset objects

val dsRdd: RDD[DatasetObject] = sc.parallelize(lines).map(line => DatasetData.parse(line))
Sign up to request clarification or add additional context in comments.

1 Comment

Depending on the size of the file, it's preferred to send it with spark-submit ... --files rather than as part of the JAR

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.