Why do quantile function in R gives unequal count of values in each group

Question

I am trying to group a continuous value data into tertile. I am using the function quantile to do this. following is my code

dd$wbc_tert = with(dd, 
               cut(wbc, 
                   vTert, 
                   include.lowest = T, 
                   labels = c("Low", "Medium", "High")))

Isn't it supposed to give equal count of values in each group? I am getting different count in the groups.

> dd %>% filter(wbc_tert == 'High') %>% select('wbc')  %>% nrow() 
[1] 143
> dd %>% filter(wbc_tert == 'Low') %>% select('wbc')  %>% nrow()
[1] 148
> dd %>% filter(wbc_tert == 'Medium') %>% select('wbc')  %>% nrow()
[1] 139

This is the dput of the values

c(10.9, 5.4, 9.1, 7.4, 6.6, 5.5, 4.4, 6.7, 7.8, 6.7, 6.6, 8.6, 
8.4, 4.8, 7, 5.2, 7, 6.7, 10.4, 7.5, 8.5, 6.8, 8.5, 9.4, 4.6, 
6.8, 10.2, 6.7, 4.6, 4.9, 6.7, 8.9, 5.9, 5.9, 9.9, 4.1, 8.4, 
9, 7.7, 8.2, 5.7, 8.4, 7.7, 4.6, 6.5, 7.3, 4.9, 3.8, 6.2, 7.9, 
5.3, 8.9, 6, 4.8, 5.9, 5.4, 8.6, 6.1, 9.5, 5.8, 6.2, 5.8, 7.9, 
9.6, 6.6, 9.6, 7, 10.1, 9, 6.9, 9.1, 6.8, 8.4, 9.6, 4.4, 10.5, 
7.9, 5.6, 5.1, 6.6, 6.5, 12.7, 5.3, 7.7, 4.8, 4.7, 6.1, 4.3, 
6.1, 11.6, 5.9, 7.4, 5.7, 4.7, 4.8, 8.5, 5.6, 7.9, 9.1, 7.8, 
5.3, 5, 8.1, 8.3, 4.7, 5.4, 7.6, 7.2, 5.7, 7.9, 7.9, 6.4, 3.8, 
4.7, 6.2, 5, 7.6, 5.8, 5.4, 4.3, 6, 4.7, 6, 6.1, 5.8, 5.6, 4.7, 
5, 11.5, 6.3, 4.4, 6.8, 6.6, 6.8, 6.1, 4.8, 5.4, 5.8, 5.2, 7.1, 
5.4, 9.1, 6.9, 5.4, 8.5, 5.3, 7.3, 6.9, 9, 6.3, 8.4, 7.8, 5.7, 
6.4, 5.3, 9.6, 6.4, 9.9, 8.9, 7.7, 6.2, 7.2, 4.6, 5.4, 4.6, 11.2, 
3.1, 12.3, 5.9, 11.1, 6.2, 6.6, 4.1, 7.4, 9.4, 4.1, 6.7, 6.7, 
6.1, 6.3, 5.6, NA, 3.7, 6.8, 6.7, 6.4, 7.3, 5.7, 6.7, 6.9, 5.7, 
5.3, 4, 5.6, 4.8, 5.5, 6, 6.6, 3.6, 5.6, 8.9, 6.3, 5.8, 8.2, 
8.6, 8.5, 5.7, 8.6, 6, 5.1, 5.7, 8.2, 5.4, 6.9, 6.9, 8.3, 9.5, 
5.4, 10.2, 8.8, 7.2, 4.8, 9.8, 4.6, 6.3, 5.8, 4.9, 12.7, 7.5, 
10.6, 9.3, 5.5, 10.7, 6.2, 9.3, 8.3, 7.8, 8.05, 9.57, 6.62, 6.21, 
5.34, 6.11, 10.37, 4.45, 5.55, 8.05, 8.31, 5.06, 6.05, 4.76, 
9.09, 9.11, 9.04, 6.99, 6.33, 9.47, 6.48, 4.46, 9.44, 6.88, 7.09, 
5.75, 10.89, 6.68, 3.64, 6.55, 8.69, 5.89, 9.05, 6.38, 11.62, 
9.11, 9.22, 7.97, 9.64, 12.76, 8.39, 6.57, 8.1, 7.3, 10.1, 4.7, 
6.4, 7.2, 5.5, 3.7, 5.1, 9.8, 7.6, 7.7, 6, 3.9, 6.8, 5.4, 5.4, 
9.7, 9, 6, 7.3, 6.3, 5.8, 8.3, 7, 4.1, 11.2, 5, 7.6, 6.5, 4.8, 
8, 10.1, 7.1, 7.4, 4.3, 4, 10.12, 4.3, 7.26, 8.84, 8.44, 8.44, 
8.12, 6.5, 8.58, 8.55, 8.82, 4.53, 9.51, 4.93, 4.42, 4.69, 8.69, 
5.77, 3.37, 6.58, 3.72, 3.09, 7.13, 8.11, 7.2, 12.18, 6.52, 7.91, 
5.69, 8.24, 7.67, 5.69, 4.85, 7.03, 4.16, 3.57, 8.1, 4.61, 5.98, 
5.13, 7.68, 5.47, 5.54, 4.59, 6, 11.62, 7.38, 7.06, 8.74, 8.02, 
6.73, 7.19, 6.36, 4.86, 6.55, 8.4, 7.76, 4.73, 4.8, 5.73, 8.53, 
4.6, 7.96, 9.48, 6.59, 5.75, 6.61, 6.49, 7.91, 6.92, 7.14, 6.24, 
12.53, 7.03, 4.73, 8.05, 7.26, 4.07, 6.7, 5.7, 7.39, 5.2, 6.61, 
6.8, 6.77, 5.65, 6.08, 7.24, 6.13, 7.92, 7.37, 7.99, 3.31, 9.72, 
8.71, 8.35, 5.05, 8.15, 5.1, 5.4, 8.8, 4.9, 5, 7.43, 10.3, 6.3, 
9.5, 6.9, 6.7, 5.4, 7.7, 8, 6.5, 5.6, 9.7)

Can someone please help what could be the reason

I tried unique elements count also > dd %>% filter(wbc_tert == '5.9-7.7') %>% select('wbc') %>% unique() %>% nrow() [1] 56 > dd %>% filter(wbc_tert == '3.1-5.9') %>% select('wbc') %>% unique() %>% nrow() [1] 56 > dd %>% filter(wbc_tert == '7.7-12.8') %>% select('wbc') %>% unique() %>% nrow() [1] 78 — arshad
– arshad, Commented Aug 18, 2022 at 7:16
> quantile(dd$wbc,1:3/3, na.rm = T) 33.33333% 66.66667% 100% 5.90 7.68 12.76 — arshad
– arshad, Commented Aug 18, 2022 at 7:19
Quantiles need not be symmetric since it is just counting the number of values up to the p:th percentile. I.e., the value of the data which p % of your values are less than. — Baraliuh
– Baraliuh, Commented Aug 18, 2022 at 7:33

Baraliuh · Accepted Answer · 2022-08-18 07:41:29Z

2

Here is an example that shows that quantiles are not necessarily symmetric in the number of values.

# Define some data
x <- 1:10
y <-  rep(1:2, 10)
# Look at the quantiles
quantile(x)
#>    0%   25%   50%   75%  100% 
#>  1.00  3.25  5.50  7.75 10.00
# Due to the added y we now have asymmetry in sizes
quantile(c(y,x))
#>    0%   25%   50%   75%  100% 
#>  1.00  1.00  2.00  2.75 10.00
# Notice how the number of values below 50 % and 75 % changes.
## Without y we get roughly the same bin size
sum(x<quantile(x, .5))
#> [1] 5
sum(x<quantile(x, .75))
#> [1] 7
## But when we add y, there is a doubling of values despite we only increase
## the percentile with 25 %
sum(c(y,x)<quantile(c(y,x), .5))
#> [1] 11
sum(c(y,x)<quantile(c(y,x), .75))
#> [1] 22

^{Created on 2022-08-18 by the reprex package (v2.0.1)}

answered Aug 18, 2022 at 7:41

Baraliuh

2,1617 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Rui Barradas · Accepted Answer · 2022-08-18 08:01:02Z

1

There are 9 ways to compute the quantiles with function quantile, the default is type = 7.
Use findInterval instead of cut.

vTert <- quantile(dd$wbc, (0:3)/3, na.rm = TRUE)

dd$wbc_tert <- findInterval(dd$wbc, vTert, rightmost.closed = TRUE, all.inside = TRUE)
dd$wbc_tert <- factor(dd$wbc_tert, labels = c("Low", "Medium", "High"))

table(dd$wbc_tert, useNA = "always")
#> 
#>    Low Medium   High   <NA> 
#>    143    143    144      1

^{Created on 2022-08-18 by the reprex package (v2.0.1)}

answered Aug 18, 2022 at 8:01

Rui Barradas

78k8 gold badges41 silver badges75 bronze badges

Collectives™ on Stack Overflow

Why do quantile function in R gives unequal count of values in each group

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related