0

I need to create 10 bins with the most approximate frequency each; for this, I am using the function "ClassInvervals" from the library (ClassInt) with the style 'quantile' for binning some data. This is working for must columns; but, when I have a column that has 1 number repeated too many times, it appears an error that says that some brackets are not unique, which makes sense assuming the last +30% of the column data is the same number so the function doesn't know how to split the bins.

What I would like to do is that if a number is greater than the 10% of the length of the column, then treat it as a different bin, and if not, then use the function as it is.

For example, let's assume we have this DF:

df <- read.table(text="
    X
1   5
2   29
3   4
4   26
5   4
6   17
7   4
8   4
9   4
10  25
11  4
12  4
13  5
14  14
15  18
16  13
17  29
18  4
19  13
20  6
21  26
22  11
23  2
24  23
25  4
26  21
27  7
28  4
29  18
30  4",h=T,strin=F)

So in this case the 10% of the length would be 3, so if we create a table containing the frequency of each number, it would appear something like this:

2   1
4   11
5   2
6   1
7   1
11  1
13  2
14  1
17  1
18  2
21  1
23  1
25  1
26  2
29  2

With this info, first we should treat "4" as a unique bin.

So we have a final output more or less like this:

    X   Bins
1   5   [2,6)
2   29  [27,30)
3   4   [4]
4   26  [26,27)
5   4   [4]
6   17  [15,19)
7   4   [4]
8   4   [4]
9   4   [4]
10  25  [19,26)
11  4   [4]
12  4   [4]
13  5   [2,6)
14  14  [12,15)
15  18  [15,19)
16  13  [12,15)
17  29  [27,30)
18  4   [4]
19  13  [12,15)
20  6   [6,12)
21  26  [26,27)
22  11  [6,12)
23  2   [2,6)
24  23  [19,26)
25  4   [4]
26  21  [19,26)
27  7   [6,12)
28  4   [4]
29  18  [15,19)
30  4   [4]

Until now, my approach has been something like this:

Moda <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

Binner <- function(df) {
  library(classInt)
  #Input is a matrix that wants to be binned
  for (c in 1:ncol(df)) {
    if (sapply(df,class)[c]=="numeric") {
      VectorTest <- df[,c]

# Here I get the 10% of the values
      TenPer <- floor(length(VectorTest)/10)

      while((sum(VectorTest == Moda(VectorTest)))>=TenPer) {
# in this loop I manage to remove the values that 
# are repeated more than 10% but I still don't know how to add it as a special bin
        VectorTest <- VectorTest[VectorTest!=Moda(VectorTest)]
        Counter <- Counter +1
      }

      binsTest <- classIntervals(VectorTest_Fixed, 10- Counter, style = 'quantile')
      binsBrakets <- cut(VectorTest, breaks = binsTest$brks)
      df[ , paste0("Binned_", colnames(df)[c])]   <- binsBrakets
    }
  }
  return (df)
}

Can someone help me?

2 Answers 2

2

You could use cutr::smart_cut:

# devtools::install_github("moodymudskipper/cutr")
library(cutr)
df$Bins <- smart_cut(df$X,list(10,"balanced"),"g",simplify = F)
table(df$Bins)
# 
#   [2,4)   [4,5)   [5,6)  [6,11) [11,14) [14,18) [18,21) [21,25) [25,29) [29,29] 
#       1      11       2       2       3       2       2       2       3       2 

more on cutr and smart_cut

Sign up to request clarification or add additional context in comments.

1 Comment

I don't show the exact expected input but I think it might be a XY problem, as I understand considering value 4 as a separate factor is just an idea OP had, and I'm not sure if it's a good one as bins can't be sorted anymore afterwards with that option
0

you can create two different dataframes: one with the 10% bins and the rest with the cut created bins. Then bind them together (make sure the bins are strings).

library(magrittr)

#lets find the numbers that appear more than 10% of the time
large <- table(df$X) %>% 
  .[. >= length(df$X)/10] %>%
  names()

#these numbers appear less than 10% of the time
left_over <- df$X[!df$X %in% large]



#we want a total of 10 bins, so we'll cut the data into 10 - the number of 10%
left_over_bins <- cut(left_over, 10 - length(large))

#Let's combine the information into a single data frame
numbers_bins <- rbind(
  data.frame(
    n = left_over,
    bins = left_over_bins %>% as.character,
    stringsAsFactors = F
  ),
  data.frame(
    n = df$X[df$X %in% large],
    bins = df$X[df$X %in% large] %>% as.character,
    stringsAsFactors = F
  )
)

If you table the information you'll get something like this

table(numbers_bins$bins) %>% sort(T)

       4 (1.97,5]  (11,14]  (23,26]  (17,20] 
      11        3        3        3        2 
 (20,23]  (26,29]    (5,8]  (14,17]   (8,11] 
       2        2        2        1        1 

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.