pyspark apply function in parallel to data in many csv files

Question

Can pyspark be used to efficiently read and process many .csv files? As a minimal example, data are many .csv files each with 5 rows and 2 columns. My real use case is many thousands of files each with few millions of rows and hundreds of columns (appx 10GB per file) on a filesystem or a cluster.

A quick and dirty pandas implementation is as follows (assuming fns is a list of .csv filenames, and processing is implemented as the max of column-means), but will be slow because files are read serially and processing uses a single core.

result = []
for fn in fns:
    df = pd.read_csv(fn, header=None)
    result.append(df.agg(func).max())

My expectation is that pyspark can both read and process files in parallel.

you can use spark.read.csv to read all the files in the folder (specify wild cards too if needed) which will be done in parallel (assuming all the files have the same schema, otherwise you'll have to think about how to merge the schemas). Also make sure you have enough executors and appropriate executor memory for the application. — Vamsi Prabhala
– Vamsi Prabhala, Commented Jun 12, 2020 at 2:06

Shubham Jain · Accepted Answer · 2020-06-12 05:18:33Z

3

If all your files have the same schema then you can directly read all the files using spark.read.csv

And it seems your files don't have schema then you can provide your custom schema also

import pyspark.sql.types as t
schema = t.StructType([t.StructField('id',t.IntegerType(),True),
                       t.StructField('name',t.StringType(),True)])
df = spark.read.csv('path/to/folder',schema=schema)

#perform you aggregations on df now

answered Jun 12, 2020 at 5:18

Shubham Jain

5,6162 gold badges20 silver badges42 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pyspark apply function in parallel to data in many csv files

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related