I have a directory called data. Within this directory there are four subdirectories: 01, 02, 03 and 04. Within these directories are hundres of JSON files that I want to load into a spark dataframe per subdirectory. What is the best way to do this?
I've tried this so far:
directories = ['01', '02', '03', '04']
for files in directories:
filepath = '/home/jovyan/data/{}/*.json.gz'
df = spark.read.format('json').option("header", "true").schema(schema).load(filepath)
# exectute rest code here
'/home/jovyan/data/{01,02,03,04}/*.json.gz. no need to use for loop.