I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?
1 Answer
If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:
>>> df = spark.range(10)
>>> df.sample(0.5).count()
4
In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.
6 Comments
Luigi
I'm trying, but the running time continues to be rather long, although I am using a factor of 0.1.
kamprath
Is the data partitioned well? If not, you might not be leveraging all of your executors. For that matter, what is your partition to executor ratio?
Luigi
I didn't understand what you mean. However, I use Google Colab to run the code and I simply replaced the df.count() operation with df.sample(0.1).count() and rerun the code. Would there be anything else to set?
kamprath
To get the partition count for your dataframe, call
df.rdd.getNumPartitions(). If that value is 1, your data has not been parallelized and thus you aren't getting the benefit of multiple nodes or cores in your spark cluster., If you do get a value greater than 1 (ideally, closer to 200), then the next thing to look at is know the number of available executors your spark cluster has. You do this by looking at the Spark status web page for your cluster.Luigi
I am trying to set the number of partitions with df.coalesce() method, but Colab doesn't generate more than four partitions. There is only one executor, I don't know how to increase them on Google Colab. However, Colab uses an hex core processor.
|
df.rdd.countApprox()perhaps