0

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

4
  • 2
    Possible duplicate of Getting the count of records in a data frame quickly and maybe Count on Spark Dataframe is extremely slow Commented Apr 9, 2019 at 15:01
  • Short answer is no, but if you cache it will speed up subsequent calls to count. Commented Apr 9, 2019 at 15:03
  • Aren't there even approximate methods? Commented Apr 10, 2019 at 15:54
  • try df.rdd.countApprox() perhaps Commented Apr 10, 2019 at 16:16

1 Answer 1

-2

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

Sign up to request clarification or add additional context in comments.

6 Comments

I'm trying, but the running time continues to be rather long, although I am using a factor of 0.1.
Is the data partitioned well? If not, you might not be leveraging all of your executors. For that matter, what is your partition to executor ratio?
I didn't understand what you mean. However, I use Google Colab to run the code and I simply replaced the df.count() operation with df.sample(0.1).count() and rerun the code. Would there be anything else to set?
To get the partition count for your dataframe, call df.rdd.getNumPartitions(). If that value is 1, your data has not been parallelized and thus you aren't getting the benefit of multiple nodes or cores in your spark cluster., If you do get a value greater than 1 (ideally, closer to 200), then the next thing to look at is know the number of available executors your spark cluster has. You do this by looking at the Spark status web page for your cluster.
I am trying to set the number of partitions with df.coalesce() method, but Colab doesn't generate more than four partitions. There is only one executor, I don't know how to increase them on Google Colab. However, Colab uses an hex core processor.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.