0

I want to count the frequency of words within a large string.

The simple single threaded solution looks like this

use hashbrown::HashMap;

fn main() {
    let buffer = String::from("Hello World Hello Rust");
    let mut frequency: HashMap<&str, u32> = HashMap::new();

    for word in buffer.split_whitespace() {
        *frequency.entry(word).or_insert(0) += 1;
    }
}

Then I tried to add some multi threading capabilities and ended up with the code below:

extern crate crossbeam;

use hashbrown::HashMap;
use std::sync::{Arc, Mutex};

fn main() {
    let buffer = Arc::new(String::from("Hello World Hello Rust"));
    let frequency: Arc<Mutex<HashMap<&str, u32>>> = Arc::new(Mutex::new(HashMap::new()));

    crossbeam::scope(|scope| {
        for _ in 0..1 {
            let buffer = buffer.clone();
            let frequency = frequency.clone();

            scope.spawn(move |_| {
                for word in buffer.split_whitespace() {
                    let mut frequency = frequency.lock().unwrap();
                    *frequency.entry(word).or_insert(0) += 1;
                }
            });
        }
    });
}

The compiler fails with the following message:

error[E0597]: `buffer` does not live long enough
  --> src/main.rs:16:29
   |
13 |             let frequency = frequency.clone();
   |                 --------- lifetime `'1` appears in the type of `frequency`
...
16 |                 for word in buffer.split_whitespace() {
   |                             ^^^^^^ borrowed value does not live long enough
17 |                     let mut frequency = frequency.lock().unwrap();
18 |                     *frequency.entry(word).or_insert(0) += 1;
   |                      --------------------- argument requires that `buffer` is borrowed for `'1`
19 |                 }
20 |             });
   |             - `buffer` dropped here while still borrowed

To simplify the code I removed string chunking and only spawn 1 thread within the for loop.

2
  • 1
    Why is your input string in an Arc if you are using crossbeam? Why are you going out of your way to use hashbrown? Commented Mar 28, 2019 at 11:55
  • The problem here is that parallelization doesn't help you at all, because a thread needs to own exclusivly the string. You can't operate with multiple threads on the same string at the same time. You have to split it beforehand. Your examples isn't very well thought. Commented Mar 28, 2019 at 11:56

1 Answer 1

2

Scoped threads allow you to borrow something outside the thread and use it inside the thread. They cannot allow you to do the opposite (borrow something inside the thread and let it escape).

buffer.split_whitespace() borrows buffer, which has been moved into the inner closure and is therefore owned by the current thread. Each word is a reference with a lifetime dependent on buffer, which will go out of scope when the thread exits. (The underlying String isn't destroyed, but word can only borrow from the Arc, which has a shorter lifetime. The same thing would be true if you just cloned a String).

Arc and scoped threads are somewhat at odds. Arc is used when you're sharing a thing between threads and you want the thing to be destroyed whenever the last thread exits. You don't generally know or care which thread is the one to destroy it, only that it gets destroyed. On the other hand, scoped threads are used when you do know where a thing should be destroyed, and all the threads that want access to it must definitely exit before that. Because the lifetimes are statically verified with scoped threads, you can use ordinary & references instead of Arc. This goes both for the String and the Mutex.

So let's apply this:

let buffer = String::from("Hello World Hello Rust");
let frequency: Mutex<HashMap<&str, u32>> = Mutex::new(HashMap::new());

crossbeam::scope(|scope| {
    for _ in 0..1 {
        scope.spawn(|_| {
            for word in buffer.split_whitespace() {
                let mut frequency = frequency.lock().unwrap();
                *frequency.entry(word).or_insert(0) += 1;
            }
        });
    }
});

Oh, that was easy. Note there are no moves, no Arcs and no clone()s, and frequency will contain references to buffer, which is presumably what you wanted. For this to work, whatever string chunking approach you use must also borrow from the original str; you can't have a separate String for each thread.

Caveat

I'm unsure exactly how similar your example is meant to be to the original code. The above solution fixes the compile problem, but as Shepmaster points out:

I'd add that the original algorithm is not very efficient as the amount of contention for the HashMap is going to be extreme. It would likely be much more efficient for each thread to have its own HashMap and merge them at the end. [...] Something like this

See also

Sign up to request clarification or add additional context in comments.

2 Comments

I'd add that the original algorithm is not very efficient as the amount of contention for the HashMap is going to be extreme. It would likely be much more efficient for each thread to have its own HashMap and merge them at the end. See also Is it possible to share a HashMap between threads without locking the entire HashMap?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.