1

In Rust Polars(might apply to python pandas as well) assigning values in a new column with a complex logic involving values of other columns can be achieved in two ways. The default way is using a nested WhenThen expression. Another way to achieve same thing is with LeftJoin. Naturally I would expect When Then to be much faster than Join, but it is not the case. In this example, When Then is 6 times slower than Join. Is that actually expected? Am I using When Then wrong?

In this example the goal is to assign weights/multipliers column based on three other columns: country, city and bucket.

use std::collections::HashMap;

use polars::prelude::*;
use rand::{distributions::Uniform, Rng}; // 0.6.5

pub fn bench() {
    // PREPARATION
    // This MAP is to be used for Left Join
    let mut weights = df![
        "country"=>vec!["UK"; 5],
        "city"=>vec!["London"; 5],
        "bucket" => ["1","2","3","4","5"],
        "weights" => [0.1, 0.2, 0.3, 0.4, 0.5]
    ].unwrap().lazy();
    weights = weights.with_column(concat_lst([col("weights")]).alias("weihts"));

    // This MAP to be used in When.Then
    let weight_map = bucket_weight_map(&[0.1, 0.2, 0.3, 0.4, 0.5], 1);


    // Generate the DataSet itself
    let mut rng = rand::thread_rng();
    let range = Uniform::new(1, 5);
    let b: Vec<String> = (0..10_000_000).map(|_| rng.sample(&range).to_string()).collect();
    let rc = vec!["UK"; 10_000_000];
    let rf = vec!["London"; 10_000_000];
    let val = vec![1; 10_000_000];
    let frame = df!(
        "country" => rc,
        "city" => rf,
        "bucket" => b,
        "val" => val,
    ).unwrap().lazy();

    // Test with Left Join
    use std::time::Instant;
    let now = Instant::now();
    let r = frame.clone()
        .join(weights, [col("country"), col("city"), col("bucket")], [col("country"), col("city"), col("bucket")], JoinType::Left)
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Left Join took: {:.2?}", elapsed);

    // Test with nested When Then
    let now = Instant::now();
    let r1 = frame.clone().with_column(
        when(col("country").eq(lit("UK")))
            .then(
                when(col("city").eq(lit("London")))
                .then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
                .otherwise(NULL.lit())
            )
            .otherwise(NULL.lit())
        )
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Chained When Then: {:.2?}", elapsed);

    // Check results are identical
    dbg!(r.tail(Some(10)));
    dbg!(r1.tail(Some(10)));
}

/// All this does is building a chained When().Then().Otherwise()
fn rf_rw_map(col: Expr, map: HashMap<String, Expr>, other: Expr) -> Expr {
    // buf is a placeholder
    let mut it = map.into_iter();
    let (k, v) = it.next().unwrap(); //The map will have at least one value

    let mut buf = when(lit::<bool>(false)) // buffer WhenThen
        .then(lit::<f64>(0.).list()) // buffer WhenThen, needed to "chain on to"
        .when(col.clone().eq(lit(k)))
        .then(v);

    for (k, v) in it {
        buf = buf
            .when(col.clone().eq(lit(k)))
            .then(v);
    }
    buf.otherwise(other)
}

fn bucket_weight_map(arr: &[f64], ntenors: u8) -> HashMap<String, Expr> {
    let mut bucket_weights: HashMap<String, Expr> = HashMap::default();
    for (i, n) in arr.iter().enumerate() {
        let j = i + 1;
        bucket_weights.insert(
            format!["{j}"],
            Series::from_vec("weight", vec![*n; ntenors as usize])
                .lit()
                .list(),
        );
    }
    bucket_weights
}

The result is surprising to me: Left Join took: 561.26ms vs Chained When Then: 3.22s

Thoughts?

UPDATE

This does not make much difference. Nested WhenThen is still over 3s

// Test with nested When Then
    let now = Instant::now();
    let r1 = frame.clone().with_column(
        when(col("country").eq(lit("UK")).and(col("city").eq(lit("London"))))
            .then(rf_rw_map(col("bucket"),weight_map,NULL.lit()))
            .otherwise(NULL.lit())
        )
        .collect().unwrap();
    let elapsed = now.elapsed();
    println!("Chained When Then: {:.2?}", elapsed);

1 Answer 1

1

The joins are one of the most optimized algorithms in polars. A left join will be executed fully in parallel and has many performance related fast paths. If you want to combine data based on equality, you should almost always choose a join.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.