I am trying to read a list of files uploaded to a Google Storage bucket and load them to a file/buffer so that I can perform some aggregation on these files.
So far, I am able to read the contents of all the files in a serial manners (each blob object from the iterator that contains all the files in the bucket). However, there are thousands of files that I have uploaded to google cloud storage and even reading these files is taking a considerable amount of time.
from google.cloud import storage
import json
import time
import multiprocessing
from multiprocessing import Pool, Manager
cpu_count = multiprocessing.cpu_count()
manager = Manager()
finalized_list = manager.list()
# Explicitly use service account credentials by specifying the private key file.
storage_client = storage.Client.from_service_account_json('.serviceAccountCredentials.json')
bucket_name = "bucket-name"
def list_blobs():
blobs = storage_client.list_blobs(bucket_name)
return blobs
def read_blob(blob):
bucket = storage_client.bucket(bucket_name)
blob_object = bucket.blob(blob)
with blob_object.open("r") as f:
converted_string = f.read()
print(converted_string)
finalized_list.append(converted_string)
def main():
start_time = time.time()
print("Start time: ", start_time)
pool = Pool(processes=cpu_count)
blobs = list_blobs()
pool.map(read_blob, [blob for blob in blobs])
end_time = time.time()
elapsed_time = end_time - start_time
print("Time taken: ", elapsed_time, " seconds")
if __name__ == "__main__":
main()
As in the above code snippet, I thought of using multiprocessing in python to read each blob object in the bucket, however, since the blob object returned by google cloud storage is not a standard iterator/list object, I am getting an error that says Pickling client objects is not explicitly supported
Is there any other way that I could use to fetch and read thousands of files from cloud storage quickly using a python script?
[blob for blob in blobs]with justblobs. (2) Havingmanager = Manager(); finalized_list = manager.list()at global scope is a disaster if you run this under an OS that creates child tasks using the spawn method, such as Windows (each process will be appending to its own list assuming your blob could be pickled).