5

I'm currently filtering a Java List to keep approximately a percentage of its elements using a random approach:

import java.util.List;
import java.util.Random;

public class Main {
    public static void main(String[] args) {
        List<String> list = List.of("1", "2", "3", "4", "5", "6", "7", "8", "9", "10");

        // Percentage of elements to retain (between 0 and 100)
        int percentageToRetain = 30; // Change this value to adjust the percentage

        Random random = new Random();
        List<String> result = list.stream()
                .filter(s -> random.nextInt(100) < percentageToRetain)
                .toList();

        System.out.println(result);
    }
}

While this works for approximating the percentage, each run produces different results, and the actual percentage retained can vary significantly from my target, especially with smaller lists.

For example: When targeting 30% of a 10-element list, I might get 2 elements (20%) on one run and 4 elements (40%) on another.

My question: How can I modify my code to consistently retain exactly the specified percentage of elements from a List? Ideally, I'd like a solution that:

  • Keeps exactly percentageToRetain percent of the elements (rounding on the lower as needed)

  • Still provides randomness in which specific elements are selected

  • Works efficiently for both small and large lists

How to handle this kind of "exact percentage sampling" problem?

3
  • 1
    Hi user, welcome, the idea of “exact percentage” is already wrong, imagine such a list: [ 10, 5, 1 ] and you want to keep it to *50%, what would you put, an item and a half, no, you have to set a tolerance, which will be determined by the number of item's. Commented May 28 at 5:04
  • 1
    Random numbers are exactly that, random. With a set of 10 and therefore 10 rolls of the dice, you cannot expect equally spread results. For better understanding, write a little test program to print 10 random numbers in the range 0 to 99. Commented May 28 at 5:42
  • What are your constraints about list size, is the initial list a mutable one and so on? Calculating the required number of elements and rounding down is easy, but depending on the constraints you may use the standard library, or you may need to write your own algorithm. Commented May 28 at 6:47

3 Answers 3

6

Your Approach

@Marce Puente correctly points outin the question comments that retaining an exact percentage of elements is impossible, at least for smaller list sizes.

the idea of “exact percentage” is already wrong, imagine such a list: [ 10, 5, 1 ] and you want to keep it to *50%, what would you put, an item and a half, no, you have to set a tolerance, which will be determined by the number of item's.

For larger input sizes the rounding error decreases.

Another problem, as the busybee pointed out, is that discarding elements with a chance is not guranteed to return a stable amount of elements.

To illustrate that, here are 10 sets of random numbers between 0 and 100 (I noted the number of elements that would be retained given percentageToRetain = 30 in brackets):

0 64 93 11 77 13 82 87 8 68 (4 retains)
77 60 39 97 64 46 33 42 87 79 (0 retains)
26 91 0 32 49 31 15 7 2 7 (5 retains)
95 90 38 10 70 37 32 17 82 98 (2 retains)
11 43 96 58 10 40 30 4 40 54 (3 retains)
2 66 86 87 17 88 92 64 10 79 (4 retains)
3 86 75 51 30 37 2 96 49 89 (2 retains)
9 10 23 84 52 28 32 14 12 33 (5 retains)
53 91 54 66 28 61 67 35 45 70 (1 retains)
3 77 79 24 37 60 6 80 8 39 (4 retains)

My suggestion

  1. calculate the.number of elements you wish to retain: int numberOfElementsToRetain = (int) (source.size() * percentageToRetain / 100.0);
  2. make a copy of the input list to a known mutable type (you can skip this step if the input is guranteed to be mutable & you don't mind ruining it)
  3. shuffle it
  4. copy the first numberOfElementsToRetain elements to a new list & return that.

Thus we get:

public static <E> List<E> retainNumberOfElementsRandomly2(List<E> source, int percentageToRetain){
    int numberOfElementsToRetain = (int) (source.size() * percentageToRetain / 100.0);
    ArrayList<E> defcopy = new ArrayList<>(source);
    
    Random r = new Random();
    
    Collections.shuffle(defcopy, r);

    List<E> result = new ArrayList<>(numberOfElementsToRetain);
    
    for(int i = 0; i < numberOfElementsToRetain; i++){
        result.add(defcopy.get(i));
    }
    
    return result;
}
Sign up to request clarification or add additional context in comments.

1 Comment

Nice, could return defcopy.subList(0, numberOfElementsToRetain) or wrapped with List.copyOf()
2

This algorithm preserves ordering of entries. The percentage to keep is adjusted by the number of items selected so far. Consider list size 10 and percentage 30.0 - this suggests retaining 3 elements. If first item was kept (with percent 30.0), the ratio of subsequent items needed is now (retain - kept.size()) / (list.size() - 1) => 22.22%. Whereas, if first item was not retained the second item choice is (retain - kept.size()) / (list.size() - 1) => 33.33% And so on:

List<String> list = List.of("1", "2", "3", "4", "5", "6", "7", "8", "9", "10");
double percentageToRetain = 30.0; // Change this value to adjust the percentage
double retain = percentageToRetain / 100.0 * list.size();

ArrayList<String> result = new ArrayList<>((int)Math.ceil(retain));
for (int i = 0, sz = list.size(); i < sz; i++) {
    double pc =  (retain - result.size()) / (list.size()-i);
    if (random.nextDouble() < pc)
        result.add(list.get(i));
}
System.out.println(percentageToRetain+"% #"+result.size()+" = "+result);

This leads to lists sized close to the rounded value of retain. However, it may not be quite as random as other methods - such as the shuffle approach - because it adjusts the selection based on the earlier results. If too many items are selected early, fewer of the tail are included, and vice versa.

Comments

1

I'm not a Java guy, but this is a fairly simple algorithm to implement.

  1. Calculate how many elements you want to save. You can use the percentage for this of course, but we want an actual number at the end of the day. Something along the lines of var resultSize = list.size() * .3.
  2. Start by copying the first resultSize elements into result. This seeds your result with the number you want.
  3. Iterate over the remaining elements and roll a random number between 0 and list.size(). If it's less than resultSize, you replace the existing value in result[index] (List probably lacks random access, so adjust as needed). It ensures that every element has an equal chance of making it into the final result.

3 Comments

sounds good actually. May I ask if you have a snippet?
The third point should probbably be "Iterate over the remaining elements and roll a random number between 0 and list.size() [...]". As written, 10 has a guranteed chance to make it into the list - *somewhere. *
@JannikS. - That's what I get for coming on stack overflow when I can't fall asleep. Thanks, fixed.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.