I have a large python project right now, in which the driver program has a function that uses a for loop to traverse each file on my GCP (google cloud platform) bucket. I'm using CLI to submit the job to GCP and let the job run there on GCP.
For each file being traversed in this for loop, I'm invoking a function parse_file(...) that parses the file and invokes serials of other functions that deal with this file.
The whole project runs and takes a few minutes, which is slow, and the driver program hasn't used much PySpark yet. The issue is each parse_file(...) in that file-level for loop is executed in sequential order. Is it possible to use PySpark to parallelize that file-level for loop to run the parse_file(...) function in parallel for all these files to reduce program execution time and improve efficiency? If so, since the program isn't using PySpark, is there a lot of code modification needed to be done to make it parallelized?
So the function of the program looks like this
# ... some other codes
attributes_table = ....
for obj in gcp_bucket.objects(path):
if obj.key.endswith('sys_data.txt'):
#....some other codes
file_data = (d for d in obj.download().decode('utf-8').split('\n'))
parse_file(file_data, attributes_table)
#....some other codes ....
How do I use PySpark to parallelize this part instead of using for loop traverse file one at a time?