0

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.

I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.

I started by counting files in a directory with following line of code

sc.wholeTextFiles("/path/to/dir/*.txt").count()

but I don't know how should I proceed further? Please guide me.

I am using Spark 2.3 and Scala.

Thanks.

1
  • 1
    Why do you want a dataframe for each file? it makes little sense in Spark. Would it not be better if you have only a dataframe where each rows keeps track of the document where it comes from? Commented Aug 6, 2018 at 17:56

2 Answers 2

1

The wholetextiles returns a paired Rdd Function

def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]

You can do map over the rdd, the key of the rdd is path of the file and value is content of the file

sc.wholeTextFiles("/path/to/dir/*.txt").take(2)

sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )
Sign up to request clarification or add additional context in comments.

Comments

0

You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.

Something like the below:

// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)

// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {

      val fileStatus = fs_status.next.getPath
      val filepath = fileStatus.toString
      val df = sc1.textFile(filepath)
}

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.