1

I have a directory called data. Within this directory there are four subdirectories: 01, 02, 03 and 04. Within these directories are hundres of JSON files that I want to load into a spark dataframe per subdirectory. What is the best way to do this?

I've tried this so far:

directories = ['01', '02', '03', '04']
for files in directories:
    filepath = '/home/jovyan/data/{}/*.json.gz'
    df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
    # exectute rest code here
1
  • 1
    try loading '/home/jovyan/data/{01,02,03,04}/*.json.gz. no need to use for loop. Commented May 6, 2021 at 15:56

1 Answer 1

3

You can use os.walk() to find all files and directories in your data folder recursively. For example, if in the future, you add a new folder 07, you don't have to change your current code.

import os

path = './data/'
for root, directories, files in os.walk(path):
    for file in files:
        filepath = os.path.join(root, file)
        if filepath.endswith('.json') or filepath.endswith('.json.gz'):
            df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
            # exectute rest code here
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.