0

I am trying to write a json file from a spark/scala program and then read it into a DataFrame. This is my code:

 val analysisWriter = new BufferedWriter(new FileWriter("analysis.json"))
 for(i <- 0 to 10){
         val obj =  arr.get(i).asInstanceOf[JSONObject]
         currentAnalysis(""+obj.get("id"))
    }
    analysisWriter.close()
    val df = hiveContext.read.json("file:///data/home/test/analysis.json")
    df.show(10)

  }   

  def currentAnalysis(id: String): Unit= {
     val arrCurrentAnalysis: JSONObject = acc.getCurrentAnalysis(""+id)

     if(arrCurrentAnalysis != null) {
       analysisWriter.append(arrCurrentAnalysis.toString())
       analysisWriter.newLine()
  }

I get the following error when I try to run this code:

java.io.FileNotFoundException: File file:/data/home/test/analysis.json does not exist

I can see the file being created in the same directory where the jar(I am running the jar using spark-submit) is present. Why is the code not able to find the file?

Initially, I was getting java.io.IOException: No input paths specified in job

As pointed out here : Spark SQL "No input paths specified in jobs" when create DataFrame based on JSON file

and here: Spark java.io.IOException: No input paths specified in job ,

I added file:// to the path to read the json file from and now I get the FileNotFoundException.

I am running spark 1.6 on a yarn cluster. Could it be the case that the file is not being available to the executors as it was created after the program has been launched?

3 Answers 3

2

From what I understand, your application depends on a local file for some of its business logics.

We can read the file by referring to it as file:///. But for this to work, the copy of the file needs to be on every worker or every worker need to have access to common shared drive as in a NFS mount.

So to solve this you could use spark-submit provides the --files tag to upload files to the execution directories. If you have small files that do not change.

Alternatively as the others have suggested put it in HDFS

Sign up to request clarification or add additional context in comments.

Comments

0

So, I guess I am right about the file not being available to all the executors. I was able to solve it by copying the file onto a location in HDFS. I don't see the error anymore. I added the following lines to the code:

val fs = FileSystem.get(new URI("hdfs://nameservice1"), sc.hadoopConfiguration)

fs.copyFromLocalFile(new Path("local_path"), new Path("hdfs_path"))

and then provided the hdfs_path to hiveContext.read.json()

It is able to create the Dataframe without any issues now.

Comments

-1

We can also get this error message when we have "white spaces" in path file or filenames (i.e. /Folder1/My Images/...).

java.io.FileNotFoundException: File file:/.../314_100.jpg does not exist

My case reading files with spark. Replace "My images" with "My_images" and it should be ok.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.