0

Can pyspark be used to efficiently read and process many .csv files? As a minimal example, data are many .csv files each with 5 rows and 2 columns. My real use case is many thousands of files each with few millions of rows and hundreds of columns (appx 10GB per file) on a filesystem or a cluster.

A quick and dirty pandas implementation is as follows (assuming fns is a list of .csv filenames, and processing is implemented as the max of column-means), but will be slow because files are read serially and processing uses a single core.

result = []
for fn in fns:
    df = pd.read_csv(fn, header=None)
    result.append(df.agg(func).max())

My expectation is that pyspark can both read and process files in parallel.

enter image description here

1
  • you can use spark.read.csv to read all the files in the folder (specify wild cards too if needed) which will be done in parallel (assuming all the files have the same schema, otherwise you'll have to think about how to merge the schemas). Also make sure you have enough executors and appropriate executor memory for the application. Commented Jun 12, 2020 at 2:06

1 Answer 1

3

If all your files have the same schema then you can directly read all the files using spark.read.csv

And it seems your files don't have schema then you can provide your custom schema also

import pyspark.sql.types as t
schema = t.StructType([t.StructField('id',t.IntegerType(),True),
                       t.StructField('name',t.StringType(),True)])
df = spark.read.csv('path/to/folder',schema=schema)

#perform you aggregations on df now
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.