Read multiple json files in different folders in separate spark dataframe

Question

I have a directory called data. Within this directory there are four subdirectories: 01, 02, 03 and 04. Within these directories are hundres of JSON files that I want to load into a spark dataframe per subdirectory. What is the best way to do this?

I've tried this so far:

directories = ['01', '02', '03', '04']
for files in directories:
    filepath = '/home/jovyan/data/{}/*.json.gz'
    df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
    # exectute rest code here

try loading '/home/jovyan/data/{01,02,03,04}/*.json.gz. no need to use for loop. — mck
– mck, Commented May 6, 2021 at 15:56

aminrd · Accepted Answer · 2021-05-06 16:04:44Z

3

You can use os.walk() to find all files and directories in your data folder recursively. For example, if in the future, you add a new folder 07, you don't have to change your current code.

import os

path = './data/'
for root, directories, files in os.walk(path):
    for file in files:
        filepath = os.path.join(root, file)
        if filepath.endswith('.json') or filepath.endswith('.json.gz'):
            df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
            # exectute rest code here

edited May 6, 2021 at 16:04

answered May 6, 2021 at 15:57

aminrd

5,2605 gold badges34 silver badges50 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Read multiple json files in different folders in separate spark dataframe

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related