Spark multiple CSV reads?

Question

In my spark application I read ONCE a directory with many CSVs. But, in the DAG I see multiple CSV reads.

Why the spark reads multiple times the CSVs or it's not a real representation; and actually Spark reads them once.

Spark UI Screenshot:

Salim · Accepted Answer · 2020-01-07 23:23:49Z

1

Spark will read them multiple times if the DataFrame is not cached.


    val df1 = spark.read.csv("path")
    val df2_result = df1.filter(.......).save(......)
    val df3_result = df1.map(....).groupBy(...).save(......)

Here df2_result and df3_result both will cause df1 to be rebuilt from csv files. To avoid this you can cache like this. DF1 will built once from csv and the 2nd time it will not be build from files.


    val df1 = spark.read.csv("path")
    df1.cache()
    val df2_result = df1.filter(.......).save(......)
    val df3_result = df1.map(....).groupBy(...).save(......)

answered Jan 7, 2020 at 23:23

Salim

2,18814 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

ggeop Over a year ago

I understand this behavior if I had multiple actions in my DAG. I have only one action. I will add a cache after the read and I will see again the DAG.

ggeop Over a year ago

I find the problem.. I have only one action but I create intermediate dataframes and I assign them in python variables, so the spark creates difference branch in the DAG for these dataframes and reads again the CSVs. So, yes the caching in my case solves the problem and reads only once.

Collectives™ on Stack Overflow

Spark multiple CSV reads?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related