0

I have a DataFrame with a string column that looks like this:

    let df = df!(
        "names" => &["None", "0", "1", "15", "1|2", "5 ??", "293 ", "XX"]);

I want to filter this down to only rows that are a single integer (multiple digits is fine) where that integer is greater than 0. I will also want to strip leading and trailing spaces, unless parse() does that for me. (None of the numbers will be very high, nothing over 3000). In the above case, the indexes that get through the filter would be 2, 3, and 6.

I've found this other answer, but it doesn't quite have what I need. The filtering page of the Polars user guide only shows very simple cases. Perhaps I haven't found the right page of the docs?

This successfully removes all the "0"s, but I don't want to have to exclude things one by one:

    let filtered = df
        .lazy()
        .filter(col("names").neq(lit("0")))
        .collect()?;
    println!("filtered: {}", filtered);

Thanks in advance!

Update: It looks like I want to create a new column by casting the string to an integer column, presumably also using CastOptions::NonStrict. But I can't figure out how to do that… When I try to use the CastOptions enum, the compiler complains that it's private?? Also, I'm getting the error "no method named cast_with_options found for enum Expr in the current scope" on the call following col("names"), but that's exactly what the docs do with the plain cast(), so I'm really confused now. The below DOES NOT WORK yet.

use polars::chunked_array::cast::CastOptions; // Error

// ...

    let out = df
        .clone()
        .lazy()
        .select([col("names")
            .cast_with_options(DataType::UInt16, CastOptions::NonStrict) // Error
            .alias("int_names")])
        .collect()?;
    println!("post-cast: {}", out);

Update 2: Here are the full errors:

 1  error[E0603]: struct `CastOptions` is private
  --> rust/orphaned_splits.rs:5:34
   |
 5 | use polars::chunked_array::cast::CastOptions;
   |                                  ^^^^^^^^^^^ private struct
   |
 note: the struct `CastOptions` is defined here
  --> /Users/nick/.cargo/registry/src/index.crates.io-6f17d22bba15001f/polars-core-0.38.3/src/chunked_array/cast.rs:3:5
   |
 3 | use arrow::compute::cast::CastOptions;
   |     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 help: import `CastOptions` directly
   |
 5 | use polars_arrow::compute::cast::CastOptions;
   |     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 2  error[E0599]: no method named `cast_with_options` found for enum `Expr` in the current scope
    --> rust/orphaned_splits.rs:151:14
     |
 150 |           .select([col("names")
     |  __________________-
 151 | |             .cast_with_options(DataType::UInt16, CastOptions::NonStrict)
     | |_____________-^^^^^^^^^^^^^^^^^
     |
 help: there is a method `over_with_options` with a similar name
     |
 151 |             .over_with_options(DataType::UInt16, CastOptions::NonStrict)
     |              ~~~~~~~~~~~~~~~~~

Update 3: Here's the dependency:

polars = { version = "0.38.3", features = ["parquet", "lazy", "dtype-categorical", "dtype-i16"] }
5
  • 1
    Please post the full error from cargo check. Commented Jun 23, 2024 at 1:39
  • What is the version of polars you're using? Commented Jun 23, 2024 at 1:42
  • 1
    You're using an outdated Polars (current is 0.41.0). Update it. Commented Jun 23, 2024 at 1:44
  • Thanks for the suggestion. I have installed the latest build, but it fails to compile inside polars. 🙄 So I'll have to wait for that to be fixed before I can try this new version on the above code. Commented Jun 23, 2024 at 2:12
  • I think they only test with no features and with all features, so failures like that are kind of expected. Try with more features. Commented Jun 23, 2024 at 2:15

1 Answer 1

1

There is a string_to_integer feature flag that allows you to parse integers from strings, via:

use polars::{lazy::dsl::is_not_null, prelude::*};
use std::error::Error;

fn main() -> Result<(), Box<dyn Error>> {
    let df = df!(
        "names" => &["None", "0", "1", "15", "1|2", "5 ??", "293 ", "XX"])?
        .lazy()
        .select([col("names")
            .str()
            .to_integer(lit("10"), false)
            .alias("parsed_int")])
        .filter(is_not_null(col("parsed_int")).and(col("parsed_int").gt(lit(0))))
        .collect()?;

    println!("{df}");
    //  shape: (2, 1)
    // ┌────────────┐
    // │ parsed_int │
    // │ ---        │
    // │ i64        │
    // ╞════════════╡
    // │ 1          │
    // │ 15         │
    // └────────────┘
    Ok(())
}

Here's my cargo.toml (I had to add some additional features to avoid compilation errors):

[dependencies]
polars = { version = "0.41.0", features = [
    "lazy",
    "parquet",
    "regex",
    "string_to_integer",
    "strings",
], git = "https://github.com/pola-rs/polars" }
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you very much, this works perfectly. I swapped in with_columns() because I want the rest of the columns too, and I discovered that you can use lit(10) for the base, without the quotes. Also, "293 " is getting parsed as I hoped it would.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.