From the discussion in the comments, I assume that you use the following input file:
0 1 2
1.0279113360087446 -1.2832284218816041 -0.9511599763983775
-1.1854242089984073 -0.008517913446124657 -1.3300888479137685
-0.17521484641409052 -0.12088194195850789 -0.08723124550308935
0.061450180456309234 0.6382691829655216 -0.3221706205270814
-0.17264583969234573 0.3906165503608199 -0.7023512952269605
-0.5688680458505405 0.7629597902952466 0.1591487223247267
-0.2866576739505336 0.8416529504197675 -0.21334731046185212
-0.3731653844853498 -0.03664374978977539 1.0659217203299267
0.2522037897994046 -1.2283963325279825 0.582406079711755
1.066724230889717 -0.630727285387302 0.9536582516683726
0.629243230148583 -0.6960460436000655 0.4521748684016147
-1.5540598822950011 0.9873241509921236 0.6415246342947979
-0.0284157295982256 -0.18702110687157966 1.7770271811904519
1.2382847672051143 -0.3760108519487906 -0.16110341746476323
-0.2808874342459878 0.6487504756926984 1.9778474878186199
-0.37522505843289716 1.7209367591622693 -0.19706519630516914
-0.33530410802770294 -0.04999186121022599 -0.675375947654844
-2.0252464624551956 -0.27944625298143044 1.385051832284722
1.2957606931360681 0.7365431841643268 1.3572525489892076
-1.3877762318274933 1.166958182611204 0.685506702653605
Which, combined with this code:
import numpy as np
import pandas as pd
X = pd.read_csv('./myfile.tsv', sep='\t')
X1 = (X > 0).astype(np.float64).T
X2 = X1.to_numpy()
print(X2)
produces the following output:
[[1. 0. 0. 1. 0. 0. 0. 0. 1. 1. 1. 0. 0. 1. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1.]]
Algorithmically, I assume:
- You read a CSV with tab separators whilst ignoring the first row
(this is what
pd.read_csv(.., sep='\t') seems to do)
- You convert the data into
1.0 or 0.0, depending on whether the value is larger than zero ((X > 0).astype(np.float64))
- You transpose the data (
.T)
And I assume you want the data to be stored somewhat efficiently, so that not every row/column is its own object.
That all said, there are many ways to achieve this in Rust. But the fact that you use numpy and pandas in Python shows me that you probably want to mostly base your code on existing high level data manipulation libraries and not implement stuff yourself.
Although this choice in Python could be for performance reasons; iterating through data with loops is highly inefficient in Python. Be aware that in Rust, a manually written for loop to manipulate the data will have similar performance to using libraries, because Rust objects and primitives have very little overhead compared to their Python counterparts.
There are a couple of high-level Rust crates that fulfill similar functions as your Python libraries. For data deserialization, I recommend serde and its implementations, in this case probably csv.
Then for data representation, I recommend ndarray.
You seem to have found those two already yourself, but I just wanted to confirm that those two are good choices.
Here is a possible Rust equivalent of your Python code.
Dependencies (in Cargo.toml):
[dependencies]
csv = "1.3.1"
ndarray = "0.16.1"
ndarray-csv = "0.5.3"
Code:
use ndarray_csv::Array2Reader;
fn main() {
let arr = csv::ReaderBuilder::new() // Configure your own CSV reader (required because tab separated)
.delimiter(b'\t') // Specify tab separated
.from_path("./myfile.tsv") // Open file
.expect("Unable to open input file!") // Handle file open error
.deserialize_array2_dynamic::<f64>() // Deserialize as f64 (f64 is the equivalent to Python floats, so I assume you want this)
.expect("Unable to parse file content!") // Handle file parsing error
.mapv_into(|val| if val > 0.0 { 1.0 } else { 0.0 }) // Perform the `X > 0` conversion
.reversed_axes(); // Transpose in-place without actually copying any data
println!("{:?}", arr); // Debug print
}
[[1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0],
[0.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 1.0],
[0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 1.0, 1.0]], shape=[3, 20], strides=[1, 3], layout=Ff (0xa), const ndim=2
.lines()and.map(|line| line.parse::<f64>().unwrap())or something similar.