-2

I am wondering about the equivalent to the following Python code in Rust:

import numpy as np
import pandas as pd

X = pd.read_csv('./myfile.tsv', sep='\t')
X1 = (X > 0).astype(np.float64).T
X2 = X1.to_numpy()

I've seen that polars can be used as Rust equivalent of pandas, but perhaps there is a better way of doing it, since no data frame manipulation is intended (the rest of my Python code operates with a numpy array, and pandas is used just as a convenient way of parsing a tsv file).

Related
How can I create an array from a CSV column encoded as a string using Polars in Rust? (not answered)
How to read local csv file as ndarray using Rust (seems to be the standard way (see also here) - but it seems very verbose - perhaps the code provided does a lot more than what I need.

13
  • why would you even want to use either of these to read a csv , there's a csv module available in rust and then cast the members to whatever you need, there's also serde to serial/deserialise streams ..... Commented Jan 30 at 9:18
  • @ticktalk I am only beginning to learn rust, so I am genuinely unaware of how to do this and in which direction to look. By now I see three ways: reading and parsing a file line by line, using polars or something panda-like, and the csv module that you brought up in your comment. Commented Jan 30 at 9:21
  • For Polars, the user guide has code examples. docs.pola.rs/user-guide/io/csv (click the Rust tab) Commented Jan 30 at 14:43
  • 1
    If it literally is just a file with a floating point number on each line, it'd be much more straightforward to just .lines() and .map(|line| line.parse::<f64>().unwrap()) or something similar. Commented Jan 30 at 17:39
  • 1
    I agree that our discussion here is off topic. That's why I will remove the comments. Commented Feb 2 at 8:03

1 Answer 1

1

From the discussion in the comments, I assume that you use the following input file:

0   1   2
1.0279113360087446  -1.2832284218816041 -0.9511599763983775
-1.1854242089984073 -0.008517913446124657   -1.3300888479137685
-0.17521484641409052    -0.12088194195850789    -0.08723124550308935
0.061450180456309234    0.6382691829655216  -0.3221706205270814
-0.17264583969234573    0.3906165503608199  -0.7023512952269605
-0.5688680458505405 0.7629597902952466  0.1591487223247267
-0.2866576739505336 0.8416529504197675  -0.21334731046185212
-0.3731653844853498 -0.03664374978977539    1.0659217203299267
0.2522037897994046  -1.2283963325279825 0.582406079711755
1.066724230889717   -0.630727285387302  0.9536582516683726
0.629243230148583   -0.6960460436000655 0.4521748684016147
-1.5540598822950011 0.9873241509921236  0.6415246342947979
-0.0284157295982256 -0.18702110687157966    1.7770271811904519
1.2382847672051143  -0.3760108519487906 -0.16110341746476323
-0.2808874342459878 0.6487504756926984  1.9778474878186199
-0.37522505843289716    1.7209367591622693  -0.19706519630516914
-0.33530410802770294    -0.04999186121022599    -0.675375947654844
-2.0252464624551956 -0.27944625298143044    1.385051832284722
1.2957606931360681  0.7365431841643268  1.3572525489892076
-1.3877762318274933 1.166958182611204   0.685506702653605

Which, combined with this code:

import numpy as np
import pandas as pd

X = pd.read_csv('./myfile.tsv', sep='\t')
X1 = (X > 0).astype(np.float64).T
X2 = X1.to_numpy()

print(X2)

produces the following output:

[[1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1.]]

Algorithmically, I assume:

  • You read a CSV with tab separators whilst ignoring the first row (this is what pd.read_csv(.., sep='\t') seems to do)
  • You convert the data into 1.0 or 0.0, depending on whether the value is larger than zero ((X > 0).astype(np.float64))
  • You transpose the data (.T)

And I assume you want the data to be stored somewhat efficiently, so that not every row/column is its own object.


That all said, there are many ways to achieve this in Rust. But the fact that you use numpy and pandas in Python shows me that you probably want to mostly base your code on existing high level data manipulation libraries and not implement stuff yourself.

Although this choice in Python could be for performance reasons; iterating through data with loops is highly inefficient in Python. Be aware that in Rust, a manually written for loop to manipulate the data will have similar performance to using libraries, because Rust objects and primitives have very little overhead compared to their Python counterparts.

There are a couple of high-level Rust crates that fulfill similar functions as your Python libraries. For data deserialization, I recommend serde and its implementations, in this case probably csv.

Then for data representation, I recommend ndarray.

You seem to have found those two already yourself, but I just wanted to confirm that those two are good choices.


Here is a possible Rust equivalent of your Python code.

Dependencies (in Cargo.toml):

[dependencies]
csv = "1.3.1"
ndarray = "0.16.1"
ndarray-csv = "0.5.3"

Code:

use ndarray_csv::Array2Reader;

fn main() {
    let arr = csv::ReaderBuilder::new() // Configure your own CSV reader (required because tab separated)
        .delimiter(b'\t') // Specify tab separated
        .from_path("./myfile.tsv") // Open file
        .expect("Unable to open input file!") // Handle file open error
        .deserialize_array2_dynamic::<f64>() // Deserialize as f64 (f64 is the equivalent to Python floats, so I assume you want this)
        .expect("Unable to parse file content!") // Handle file parsing error
        .mapv_into(|val| if val > 0.0 { 1.0 } else { 0.0 }) // Perform the `X > 0` conversion
        .reversed_axes(); // Transpose in-place without actually copying any data

    println!("{:?}", arr); // Debug print
}
[[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
 [0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0],
 [0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]], shape=[3, 20], strides=[1, 3], layout=Ff (0xa), const ndim=2
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.