Concurrent and scalable data structure in Java to handle tasks?

Question

for my current development I have many threads (Producers) that create Tasks and many threads that that consume these Tasks (consumers)

Each Producers is identified by a unique name; A Tasks is made of:

the name of its Producers
a name
data

My question concerns the data structure used by the (Producers) and the (consumers).

Concurrent Queue?

Naively, we could imagine that Producers populate a concurrent-queue with Tasks and (consumers) reads/consumes the Tasks stored in the concurrent-queue.

I think that this solution would rather well scale but one single case is problematic: If a Producers creates very quickly two Tasks having the same name but not the same data (Both tasks T1 and T2 have the same name but T1 has data D1 and T2 has data D2), it is theoretically possible that they are consumed in the order T2 then T1!

Task Map + Queue?

Now, I imagine creating my own data structure (let's say MyQueue) based on Map + Queue. Such as a queue, it would have a pop() and a push() method.

The pop() method would be quite simple
The push() method would:
- Check if an existing Task is not yet inserted in MyQueue (doing find() in the Map)
  - if found: data stored in the Task to-be-inserted would be merged with data stored in the found Task
  - if not found: the Task would be inserted in the Map and an entry would be added in the Queue

Of course, I'll have to make it safe for concurrent access... and that will certainly be my problem; I am almost sure that this solution won't scale.

So What?

So my question is now what are the best data structure I have to use in order to fulfill my requirements

"it is theoretically possible that they are consumed in the order "T2" then "T1"!" No, it is more accurate to says that it's highly likely that they are consumed in parallel (aka concurrently) by different consumers, and will be processed in parallel. In concurrent programming, order is a very fluid concept. — Andreas
– Andreas, Commented Dec 19, 2017 at 7:15
Since elements are unique you could also try sets such as ConcurrentSkipListSet which will enforce uniqueness. You can also avoid the synchronization entirely and write different threads and merge them at the end. This could keep the things clean. — Hemang Rindani
– Hemang Rindani, Commented Dec 19, 2017 at 7:19
So you're not actually looking to use multiple consumers. Maybe not even multiple producers. This question has nothing to do with scalability. you've got a different design problem. — Kayaman
– Kayaman, Commented Dec 19, 2017 at 7:58
If you want two tasks to be dependent on each other as you describe, you need to bundle them for execution; otherwise you'll never be able to reliably achieve what you're trying to do in a multithreaded environment - that's what @Kayaman means with "different design problem". Maybe look into ForkJoinPool. Also note that "increase the number of Consumers" will usually decrease throughput, not increase it (except you run them on a different machine). — daniu
– daniu, Commented Dec 19, 2017 at 8:15

OldCurmudgeon · Accepted Answer · 2017-12-19 08:33:06Z

1

You could try Heinz Kabutz's Striped Executor Service a possible candidate.

This magical thread pool would ensure that all Runnables with the same stripeClass would be executed in the order they were submitted, but StripedRunners with different stripedClasses could still execute independently.

answered Dec 19, 2017 at 8:33

OldCurmudgeon

66k18 gold badges126 silver badges220 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

snovelli · Accepted Answer · 2017-12-25 19:51:54Z

Instead of making a data structure safe for concurrent access, why not opting out concurrent and go for parallel?

Functional programming models such as MapReduce are a very scalable way to solve this kind of problems.

I understand that D1 and D2 can be either analyzed together or in isolation and the only constraint is that they shouldn't be analyzed in the wrong order. (Making some assumption here ) But in case the real problem is only the way the results are combined, there might be an easy solution.

You could remove the constraint all together allowing them to be analyzed separately and then having a reduce function that is able to re-combine them together in a sensible way.

In this case you'd have the first step as map and the second as reduce.

Even if the computation is more efficient if done in a single go, a big part of scaling, especially scaling out is accomplished by denormalization.

buggy · Accepted Answer · 2017-12-26 09:28:46Z

If consumers are running in parallel, I doubt there is a way to make them execute tasks with the same name sequentially. In your example (from comments):

BlockingQueue can really be a problem (unfortunately) if a Producer "P1" adds a first task "T" with data D1 and quickly a second task "T" with data D2. In this case, the first task can be handled by a thread and the second task by another thread; If the threads handling the first task is interrupted, the thread handling the second one can complete first

There is no difference if P1 submits D2 not so quickly. Consumer1 could still be too slow, so consumer 2 would be able to finish first. Here is an example for such scenario:

P1: submit D1
C1: read D1
P2: submit D2
C2: read D2
C2: process D2
C1: process D1

To solve it, you will have to introduce some kind of completion detection, which I believe will overcomplicate things.

If you have enough load and can process some tasks with different names not sequentially, then you can use a queue per consumer and put same named tasks to the same queue.

public class ParallelQueue {

    private final BlockingQueue<Task>[] queues;
    private final int consumersCount;

    public ParallelQueue(int consumersCount) {
        this.consumersCount = consumersCount;

        queues = new BlockingQueue[consumersCount];
        for (int i = 0; i < consumersCount; i++) {
            queues[i] = new LinkedBlockingQueue<>();
        }
    }

    public void push(Task<?> task) {
        int index = task.name.hashCode() % consumersCount;
        queues[index].add(task);
    }

    public Task<?> pop(int consumerId) throws InterruptedException {
        int index = consumerId % consumersCount;
        return queues[index].take();
    }

    private final static class Task<T> {
        private final String name;
        private final T data;

        private Task(String name, T data) {
            this.name = name;
            this.data = data;
        }
    }
}

Collectives™ on Stack Overflow

Concurrent and scalable data structure in Java to handle tasks?

Concurrent Queue?

Task Map + Queue?

So What?

3 Answers 3

Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

Concurrent Queue?

Task Map + Queue?

So What?

3 Answers 3

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related