1

I'm trying to build a scientific computing software with Rust, which requires manipulation of the matrix during operation. A typical matrix operation is to append sub-blocks of small matrices to a specific position of a large matrix. Let me demonstrate this process using the following formula:

Suppose I have a matrix M_{master} with 8 rows and 8 columns, and all elements are 0 in the initial state:

enter image description here

There are also two 6-row 6-column matrices M_{slave1} and M_{slave2}:

enter image description here enter image description here

Step 1) Append write the four sub-blocks in M_{slave1} ​​into M_{master}, we get:

enter image description here

Step 2) Append write the four sub-blocks in M_{slave2} ​​into M_{1}, we get:

enter image description here

The green element in M_{master12} ​​is located at the position where both M_{slave1} and M_{slave2} need to be written.

For the problem described above, I have implemented a single-threaded solution with the help of nested for loops. Here is my code and the results:(You can run the single-threaded version of the code directly in the Rust Playground and see the results: https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=cc5db512a4445ae3a92b1668e0611eba

use rayon::prelude::*;
use std::time::Instant;

fn main() {
    const R: usize = 8;
    const C: usize = 8;

    const R_SUB1: usize = 6;
    const C_SUB1: usize = 6;

    const R_SUB2: usize = 6;
    const C_SUB2: usize = 6;

    let mut arr1: [[u8; C]; R] = [[0; C]; R];
    for row in 1..=R {
        for col in 1..=C {
            arr1[row - 1][col - 1] = 0;
        }
    }
    arr2d_print(arr1);

    let sub1: [[u8; C_SUB1]; R_SUB1] = [[9; C_SUB1]; R_SUB1];
    arr2d_print(sub1);
    let cpld1: Vec<usize> = vec![0, 1, 3]; //Parameters related to the position where the sub-block is written, not the direct position

    let sub2: [[u8; C_SUB2]; R_SUB2] = [[90; C_SUB2]; R_SUB2];
    arr2d_print(sub2);
    let cpld2: Vec<usize> = vec![0, 2, 3];

    let subs = vec![sub1, sub2];
    let cplds = vec![cpld1, cpld2];

    let t1 = Instant::now();
    let _ = cplds
        .iter()
        .zip(subs.iter())
        .map(|(cpld, sub)| sqare_matrix_block_fill(&mut arr1, sub, cpld))
        .count();
    let tot1 = t1.elapsed();
    arr2d_print(arr1);
    println!("The single-threaded version takes time: {:?}\n", tot1);
}

fn sqare_matrix_block_fill<const D1: usize, const D2: usize>(
    master: &mut [[u8; D1]; D1],
    slave: &[[u8; D2]; D2],
    cplds: &[usize],
) {
    if D2 != 2 * cplds.len() {
        panic!("!!! Wrong !!!");
    }

    for idx in 0..cplds.len() {
        for idy in 0..cplds.len() {
            for row in 0..2 {
                for col in 0..2 {
                    master[cplds[idx] * 2 + row][cplds[idy] * 2 + col] +=
                        slave[idx * 2 + row][idx * 2 + col];
                }
            }
        }
    }
}

fn arr2d_print<const D: usize>(arr: [[u8; D]; D]) {
    for i in 0..D {
        for j in 0..D {
            print!("{:-4.1} ", arr[i][j]);
        }
        print!("\n");
    }
    println!("");
}

The result of running the above code:

Standard Output
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 
   0    0    0    0    0    0    0    0 

   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 
   9    9    9    9    9    9 

  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 
  90   90   90   90   90   90 

  99   99    9    9   90   90   99   99 
  99   99    9    9   90   90   99   99 
   9    9    9    9    0    0    9    9 
   9    9    9    9    0    0    9    9 
  90   90    0    0   90   90   90   90 
  90   90    0    0   90   90   90   90 
  99   99    9    9   90   90   99   99 
  99   99    9    9   90   90   99   99 

The single-threaded version takes time: 4.561µs

Now my question is: how can I build a parallel version of the sqare_matrix_block_fill function using Rayon?

I know that there will inevitably be data races on the locations that are written repeatedly in large matrices, but hey, this is exactly why I came to Stack Overflow, sincerely begging for help!

5
  • The reason I have this requirement is: when a large number of append operations need to be done on sub-blocks at different positions of a large matrix, the single-threaded version of the program takes a lot of time. I hope to use Rayon to implement a parallel computing matrix assembly function to cope with the scenario of large-scale matrix sub-block operations. Commented Mar 4 at 1:57
  • My first thought would be use rayon with an array of AtomicU8 elements. You'd have to test whether the overhead of threads is worth it. I would also consider breaking the main matrix into chunks, each protected by a Mutex, and lock a whole chunk at a time. Commented Mar 4 at 3:36
  • @BallpointBen Great idea! Could you please provide some code that solves the above example problem so I can understand your thinking more concretely? Commented Mar 4 at 6:32
  • You can use split_at_mut to split the data into non-overlapping parts that you can distribute to scoped threads. Commented Mar 4 at 7:40
  • See e.g. this answer for an example of combining split_at_mut() with Rayon parallel processing using safe code. Commented Mar 4 at 10:16

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.