3

I have a desktop application where the majority of calculations (>90%) happen on the Rust side of it. But I want the user to be able to write scripts in Python that will operate on the df.

Can this be done without serializing the dataframe between runtimes to a file?

A simple invocation could be this:

Rust: agg -> Rust: calculate new column -> Python: groupby -> Rust: count results

The serializing approach works for small datasets. It doesn't really scale to larger ones. The optimal solution would somehow be to be able to tell the python side: Here is a lazy dataframe in-memory. You can manipulate it.

I've read the documentation and the only solution I could see is to use Apache IPC.

1 Answer 1

3

The Lazyframes are (mostly) serializable to json. The serialize operations are fallible, so depending on the type of query, it may not serialize. The languages all have slightly different apis for this operation though.

This serializes the LogicalPlan itself & doesnt perform any computations on the dataframes. This makes the operation very inexpensive.

Python

lf = get_lf_somehow()
json_buf = lf.write_json(to_string=True)
lf = pl.LazyFrame.from_json(json)

Node.js

let lf = get_lf_somehow()
const json_buf = lf.serialize('json')
lf = pl.LazyFrame.deserialize(buf , 'json')

Rust

let lf = get_lf_somehow();
let json_buf = serde_json::to_vec(&lf)?;
let lf = serde_json::from_slice(&json_buf)?;
Sign up to request clarification or add additional context in comments.

7 Comments

When deserializing, how do you "restore the connection" between the lazyframe and the actual data?
not sure I understand what you mean by "restore the connection". Are you referring to when get_lf_somehow would be constructed from a literal dataframe via df.lazy()?
yes that's what I mean
when serializing a lazyframe that has a literal dataframe as the source, it will serialize the entire dataframe, which may not be practical for large datasets. If you want to pass the literal dataframe across boundaries, IPC will be the most performant.
I think that would be the way to go. you would need to create some abstraction that created the lazyframe separately from the dataframe. it may also be possible to use the arrow metadata spec to directly pass the entire body as IPC. I'm not quite sure how flexible it is though. Ideally one should be able to serialize directly to IPC just like JSON or Bincode.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.