Binning with quantiles adding exception in r

Question

I need to create 10 bins with the most approximate frequency each; for this, I am using the function "ClassInvervals" from the library (ClassInt) with the style 'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.

What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.

For example, let's assume we have this DF:

df <- read.table(text="
    X
1   5
2   29
3   4
4   26
5   4
6   17
7   4
8   4
9   4
10  25
11  4
12  4
13  5
14  14
15  18
16  13
17  29
18  4
19  13
20  6
21  26
22  11
23  2
24  23
25  4
26  21
27  7
28  4
29  18
30  4",h=T,strin=F)

So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:

With this info, first we should treat "4" as a unique bin.

So we have a final output more or less like this:

    X   Bins
1   5   [2,6)
2   29  [27,30)
3   4   [4]
4   26  [26,27)
5   4   [4]
6   17  [15,19)
7   4   [4]
8   4   [4]
9   4   [4]
10  25  [19,26)
11  4   [4]
12  4   [4]
13  5   [2,6)
14  14  [12,15)
15  18  [15,19)
16  13  [12,15)
17  29  [27,30)
18  4   [4]
19  13  [12,15)
20  6   [6,12)
21  26  [26,27)
22  11  [6,12)
23  2   [2,6)
24  23  [19,26)
25  4   [4]
26  21  [19,26)
27  7   [6,12)
28  4   [4]
29  18  [15,19)
30  4   [4]

Until now, my approach has been something like this:

Moda <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Binner <- function(df) {
  library(classInt)
  #Input is a matrix that wants to be binned
  for (c in 1:ncol(df)) {
    if (sapply(df,class)[c]=="numeric") {
      VectorTest <- df[,c]

# Here I get the 10% of the values
      TenPer <- floor(length(VectorTest)/10)

      while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that 
# are repeated more than 10% but I still don't know how to add it as a special bin
        VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
        Counter <- Counter +1
      }

      binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
      binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
      df[ , paste0("Binned_", colnames(df)[c])]   <- binsBrakets
    }
  }
  return (df)
}

Can someone help me?

moodymudskipper · Accepted Answer · 2018-10-10 20:17:15Z

2

You could use cutr::smart_cut:

# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
# 
#   [2,4)   [4,5)   [5,6)  [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29] 
#       1      11       2       2       3       2       2       2       3       2

1 Comment

moodymudskipper Over a year ago

I don't show the exact expected input but I think it might be a XY problem, as I understand considering value 4 as a separate factor is just an idea OP had, and I'm not sure if it's a good one as bins can't be sorted anymore afterwards with that option

struggles · Accepted Answer · 2018-10-10 20:20:44Z

you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).

library(magrittr)

#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>% 
  .[. >= length(df$X)/10] %>%
  names()

#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]



#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))

#Let's combine the information into a single data frame
numbers_bins <- rbind(
  data.frame(
    n = left_over,
    bins = left_over_bins %>% as.character,
    stringsAsFactors = F
  ),
  data.frame(
    n = df$X[df$X %in% large],
    bins = df$X[df$X %in% large] %>% as.character,
    stringsAsFactors = F
  )
)

If you table the information you'll get something like this

table(numbers_bins$bins) %>% sort(T)

       4 (1.97,5]  (11,14]  (23,26]  (17,20] 
      11        3        3        3        2 
 (20,23]  (26,29]    (5,8]  (14,17]   (8,11] 
       2        2        2        1        1

Collectives™ on Stack Overflow

Binning with quantiles adding exception in r

2 Answers 2

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related