How to calculate the number of rows of a dataframe efficiently? [duplicate]

Question

I have a very large pyspark dataframe and I would calculate the number of row, but count() method is too slow. Is there any other faster method?

Possible duplicate of Getting the count of records in a data frame quickly and maybe Count on Spark Dataframe is extremely slow — pault
– pault, Commented Apr 9, 2019 at 15:01
Short answer is no, but if you cache it will speed up subsequent calls to count. — pault
– pault, Commented Apr 9, 2019 at 15:03

kamprath · Accepted Answer · 2019-04-09 15:43:59Z

-2

If you don't mind getting an approximate count, you could try sampling the dataset first and then scaling by your sampling factor:

>>> df = spark.range(10)
>>> df.sample(0.5).count()
4

In this case, you would scale the count() results by 2 (or 1/0.5). Obviously, there is an statistical error with this approach.

answered Apr 9, 2019 at 15:43

kamprath

2,3081 gold badge23 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Luigi Over a year ago

I'm trying, but the running time continues to be rather long, although I am using a factor of 0.1.

kamprath Over a year ago

Is the data partitioned well? If not, you might not be leveraging all of your executors. For that matter, what is your partition to executor ratio?

Luigi Over a year ago

I didn't understand what you mean. However, I use Google Colab to run the code and I simply replaced the df.count() operation with df.sample(0.1).count() and rerun the code. Would there be anything else to set?

kamprath Over a year ago

To get the partition count for your dataframe, call df.rdd.getNumPartitions(). If that value is 1, your data has not been parallelized and thus you aren't getting the benefit of multiple nodes or cores in your spark cluster., If you do get a value greater than 1 (ideally, closer to 200), then the next thing to look at is know the number of available executors your spark cluster has. You do this by looking at the Spark status web page for your cluster.

Luigi Over a year ago

I am trying to set the number of partitions with df.coalesce() method, but Colab doesn't generate more than four partitions. There is only one executor, I don't know how to increase them on Google Colab. However, Colab uses an hex core processor.

|

Collectives™ on Stack Overflow

How to calculate the number of rows of a dataframe efficiently? [duplicate]

1 Answer 1

6 Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Linked

Related