2

I've been trying to figure out the most efficient way where many threads are altering a very big byte array on bit level. For ease of explaining I'll base the question around a multithreaded Sieve of Eratosthenes to ease explaining the question. The code though should not be expected to fully completed as I'll omit certain parts that aren't directly related. The sieve also wont be fully optimised as thats not the direct question. The sieve will work in such a way that it saves which values are primes in a byte array, where each byte contains 7 numbers (we can't alter the first bit due to all things being signed).

Lets say our goal is to find all the primes below 1 000 000 000 (1 billion). As a result we would need an byte array of length 1 000 000 000 / 7 +1 or 142 857 143 (About 143 million).

class Prime {
    int max = 1000000000;    
    byte[] b = new byte[(max/7)+1];

    Prime() {
        for(int i = 0; i < b.length; i++) {
            b[i] = (byte)127;  //Setting all values to 1 at start
        }
        findPrimes();
    }

    /*
     * Calling remove will set the bit value associated with the number
     * to 0 signaling that isn't an prime
     */
    void remove(int i) {
        int j = i/7; //gets which array index to access
        b[j] = (byte) (b[j] & ~(1 << (i%7)));
    }

    void findPrimes() {
        remove(1); //1 is not a prime and we wanna remove it from the start
        int prime = 2;
        while (prime*prime < max) {
            for(int i = prime*2; i < max; i = prime + i) {
                remove(i);
            }
            prime = nextPrime(prime); //This returns the next prime from the list
        }
    }

... //Omitting code, not relevant to question
}

Now we got a basic outline where something runs through all numbers for a certain mulitplication table and calls remove to remove numbers set bits that fits the number to 9 if we found out they aren't primes.

Now to up the ante we create threads that do the checking for us. We split the work so that each takes a part of the removing from the table. So for example if we got 4 threads and we are running through the multiplication table for the prime 2, we would assign thread 1 all in the 8 times tables with an starting offset of 2, that is 4, 10, 18, ...., the second thread gets an offset of 4, so it goes through 6, 14, 22... and so on. They then call remove on the ones they want.

Now to the real question. As most can see that while the prime is less than 7 we will have multiple threads accessing the same array index. While running through 2 for example we will have thread 1, thread 2 and thread 3 will all try to access b[0] to alter the byte which causes an race condition which we don't want.

The question therefore is, whats the best way of optimising access to the byte array.

So far the thoughts I've had for it are:

  1. Putting synchronized on the remove method. This obviously would be very easy to implement but an horrible ideas as it would remove any type of gain from having threads.
  2. Create an mutex array equal in size to the byte array. To enter an index one would need the mutex on the same index. This Would be fairly fast but would require another very big array in memory which might not be the best way to do it
  3. Limit the numbers stored in the byte to prime number we start running on. So if we start on 2 we would have numbers per array. This would however increase our array length to 500 000 000 (500 million).

Are there other ways of doing this in a fast and optimal way without overusing the memory?

(This is my first question here so I tried to be as detailed and thorough as possible but I would accept any comments on how I can improve the question - to much detail, needs more detail etc.)

7
  • Which is the problem with race condition? if 1 or 10 threads are changing the array position to zero it doesn't matter. Right!? Commented Mar 6, 2014 at 16:15
  • The issue is how we bitshift. We are reading b[i] and altering it, which means that thread 1 might read it, thread 2 might read it and then thread 1 alters it followed by thread 2 altering it and thread 2 would of course write the result from its alteration on what it read which means the alteration from thread 1 would get lost. Maybe thats a solution though, can we do the changing of bits in an atomic way? Commented Mar 6, 2014 at 16:21
  • it is not possible. =( you could try to isolate thread by range. Commented Mar 6, 2014 at 16:38
  • Just a question: I suppose the data are persisted by the threads. If not, why using several threads? Commented Mar 6, 2014 at 17:15
  • @C.Champagne Might be an language barrier so not sure I fully understood the question. But yes, the threads would be run inside an inner class with the data saved outside for use after the threads are done. Commented Mar 6, 2014 at 18:04

2 Answers 2

5

You can use an array of atomic integers for this. Unfortunately there isn't a getAndAND, which would be ideal for your remove() function, but you can CAS in a loop:

java.util.concurrent.atomic.AtomicIntegerArray aia;

....

void remove(int i) {
    int j = i/32; //gets which array index to access
    do {
        int oldVal = aia.get(j);
        int newVal = oldVal & ~(1 << (i%32));
        boolean updated = aia.weakCompareAndSet(j, oldVal, newVal);
    } while(!updated);
}

Basically you keep trying to adjust the slot to remove that bit, but you only succeed if nobody else modifies it out from under you. Safe, and likely to be very efficient. weakCompareAndSet is basically an abstracted Load-link/Store conditional instruction.

BTW, there's no reason not to use the sign bit.

Sign up to request clarification or add additional context in comments.

15 Comments

Why use weakCompareAndSet(...)? To quote the javadocs: May fail spuriously and does not provide ordering guarantees, so is only rarely an appropriate alternative to compareAndSet(...).
@Gray And this is exactly one of those situations. Spurious failures are A-OK, because we can just loop and try it again. Guaranteeing no spurious failures is not worth any extra time, because we don't rely on such a guarantee.
I'm really not sure you want to use weak... unless you know specifically what it does. It seems like it is very OS dependent on the difference between it and the "strong". See here: stackoverflow.com/questions/2443239/…
@Sneftel: what about happens-before relationship that isn't guaranteed with the weak version? a thread reading the array may see inconsistent values in it, due to thread visibility problems.
@Sneftel: remember than once the code completes, some thread has to go over the array and check the set bits. How can it guarantee seeing the most up to date values, if happens-before isn't established between the writes and the reads? Java Memory Model spec becomes tricky once you want to prove correctness on code with data races.
|
1

I think you could avoid synchronizing threads...

For example, this task:

for(int i = prime*2; i < max; i = prime + i) {
            remove(i);
}

it could be partitioned in small tasks.

for (int i =0; i < thread_poll; i++){
    int totalPos =  max/8; // dividing virtual array in bytes
    int partitionSize  = totalPos /thread_poll; // dividing bytes by thread poll
    removeAll(prime, partitionSize*i*8,  (i + 1)* partitionSize*8);
}
....

// no colisions!!!

void removeAll(int prime, int initial; int max){
    k = initial / prime;
    if (k < 2) k = 2;
    for(int i = k * prime; i < max; i = i + prime) {
        remove(i);
    }
}

3 Comments

This might work yeah, would have to include a check so that the splitting isn't inside the same index byte array.
I tested implementing this yesterday into an sorting through an very big array (2 billion numbers) and an optimised sieve (no even numbers etc) and it showed a speed up with the threads of about 1.5 (I can optimise it even better with better logic on starting number for each thread in the remove loop so we could get a better speed up). The Atomic array has a speed up of about 3 though. +1 on the idea though, I liked it.
Unfortunately, I don't have the whole code here to test by myself. But, take a look in this RecursiveAction... omsn.de/blog/…. In C, I will suggest to try openmp.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.