0

How can I create Dataframe with all my json files, when after reading each file I need to add fileName as field in dataframe? it seems Variable in for loop is not recognized outside loop. How to overcome this issue?

for (jsonfilenames <- fileArray) {
      var df = hivecontext.read.json(jsonfilename)
      var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))

    }

   // trying to create temp table from dataframe created in loop

tblLanding.registerTempTable("LandingTable") // ERROR here, can't resolved tblLanding

Thank in advance
Hossain

2 Answers 2

3

I think you are new to programming itself. Anyways here you go.

Basically you specify the type and initialise it before loop.

var df:DataFrame = null
for (jsonfilename <- fileArray) {
      df = hivecontext.read.json(jsonfilename)
      var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))

    }

df.registerTempTable("LandingTable") // Getting ERROR here

Update

Ok you are completely new to programming, even loops.

Suppose fileArray is having values as [1.json, 2.json, 3.json, 4.json]

So, the loop actually created 4 dataframe, by reading 4 json files. Which one you want to register as temp table.

If all of them,

var df:DataFrame = null
var count = 0
for (jsonfilename <- fileArray) {
      df = hivecontext.read.json(jsonfilename)
      var tblLanding = df.withColumn("source_file_name", lit(jsonfilename))
      df.registerTempTable(s"LandingTable_$count")
      count++;
    }

And reason for df being empty before this update is, your fileArray is empty or Spark failed to read that file. Print it and check.

To query any of those registered LandingTable

val df2 = hiveContext.sql("SELECT * FROM LandingTable_0")

Update Question has changed to making a single dataFrame from all the json files.

var dataFrame:DataFrame = null
for (jsonfilename <- fileArray) {
   val eachDataFrame = hivecontext.read.json(jsonfilename)
   if(dataFrame == null)
      dataFrame = eachDataFrame
   else
      dataFrame = eachDataFrame.unionAll(dataFrame)
}
dataFrame.registerTempTable("LandingTable")

Insure, that fileArray is not empty and all json files in fileArray are having same schema.

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for you response. I believe,
Thanks for you response. Please note, I am trying to create temp table out side for loop, with DataFrame created in for loop. I believe, var df:DataFrame =null at beaning are not same that I have created in for loop. I got NullPointerException when tried df.show(). Expecting prudent response.
Thanks again for your time. I feel, I could not make you understand my query. First, I need ONE table with all json files. Secondly, I need to access that table out of for loop block. Note, I can't avoid loop to add fileName as DF column.
2
// Create list of dataframes with source-file-names
val dfList = fileArray.map{ filename =>
  hivecontext.read.json(filename)
             .withColumn("source_file_name", lit(filename))
}

// union the dataframes (assuming all are same schema)
val df = dfList.reduce(_ unionAll _)  // or use union if spark 2.x

// register as table
df.registerTempTable("LandingTable")

2 Comments

Thanks, @Shyamendra Solanki !! this is what I was looking for. I have tested your code.It works !!!
Glad it was helpful. Please consider accepting the answer: stackoverflow.com/help/accepted-answer

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.