Loading files in a loop in spark

Question

I have n number of files in a directory with same .txt extension and I want to load them in a loop and then make separate dataframes for each of them.

I have read this but in my case all my files have same extension and I want to iterate over them one by one and make dataframe for every file.

I started by counting files in a directory with following line of code

sc.wholeTextFiles("/path/to/dir/*.txt").count()

but I don't know how should I proceed further? Please guide me.

I am using Spark 2.3 and Scala.

Thanks.

Why do you want a dataframe for each file? it makes little sense in Spark. Would it not be better if you have only a dataframe where each rows keeps track of the document where it comes from? — Álvaro Valencia
– Álvaro Valencia, Commented Aug 6, 2018 at 17:56

loneStar · Accepted Answer · 2018-08-06 19:12:20Z

1

The wholetextiles returns a paired Rdd Function

def wholeTextFiles(path: String, minPartitions: Int): rdd.RDD[(String, String)]

You can do map over the rdd, the key of the rdd is path of the file and value is content of the file

sc.wholeTextFiles("/path/to/dir/*.txt").take(2)

sc.wholeTextFiles("/path/to/dir/*.txt").map((x,y)=> some logic on x and y )

answered Aug 6, 2018 at 19:12

loneStar

4,04026 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

K S Nidhin · Accepted Answer · 2018-08-06 18:53:03Z

0

You could use the hadoop fs and get the list of files under the directory and then iterate it over and save it to differnet dataframes.

Something like the below:

// Hadoop FS
val hadoop_fs = FileSystem.get(sc1.hadoopConfiguration)

// Get list of part files
val fs_status = hadoop_fs.listLocatedStatus(new Path(fileFullPath))
while (fs_status.hasNext) {

      val fileStatus = fs_status.next.getPath
      val filepath = fileStatus.toString
      val df = sc1.textFile(filepath)
}

answered Aug 6, 2018 at 18:53

K S Nidhin

2,6583 gold badges28 silver badges44 bronze badges

Collectives™ on Stack Overflow

Loading files in a loop in spark

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related