Can pyspark be used to efficiently read and process many .csv files? As a minimal example, data are many .csv files each with 5 rows and 2 columns. My real use case is many thousands of files each with few millions of rows and hundreds of columns (appx 10GB per file) on a filesystem or a cluster.
A quick and dirty pandas implementation is as follows (assuming fns is a list of .csv filenames, and processing is implemented as the max of column-means), but will be slow because files are read serially and processing uses a single core.
result = []
for fn in fns:
df = pd.read_csv(fn, header=None)
result.append(df.agg(func).max())
My expectation is that pyspark can both read and process files in parallel.

spark.read.csvto read all the files in the folder (specify wild cards too if needed) which will be done in parallel (assuming all the files have the same schema, otherwise you'll have to think about how to merge the schemas). Also make sure you have enough executors and appropriate executor memory for the application.