Hacker Newsnew | past | comments | ask | show | jobs | submit | ritchie46's commentslogin

> but quite a bit of pandas code will run as-is with polars

I highly doubt this. Aside from dataframe generation and series assignment, almost everything in the API surface is different.

Strictness is also not something you can transplant easily. It is checking data types at the IR query planning level before you run the query and being able to resolve schema's independent of the data. In pandas schemas do depend on data within operations and therefore it isn't uncommon that data types change if data gets missing values nor can it check if a correct type is passed to an operation without running the compute.


Depends on how you use pandas. Pre-polars I would do a lot of single column/series manipulation which works the same way (though heavily discouraged by polars because you lose out on optimization opportunities). There are plenty of surface level keyword API changes (merge vs join, sort_values vs sort), but you can operate polars in a very panda-esque manner which do not seem all that alien to each other.

Strictness, I understand you cannot just slap it in, more just an idle thought.


He means that he wants our Rust library as easy as our Python lib. Which I understand as our focus has been mostly on Python.

It is where most of our userbase is and it is very hard for us to have a stable Rust API as we have a lot of internal moving parts which Rust users typically want access to (as they like to be closer to the metal), but has no stability guarantees from us.

In python, we are able to abstract and provide a stable API.


I understand the user pool comment but don’t understand why you wouldn’t be able to have a rust layer that’s the same as the Python one API-wise.

I say this as a user of neither - just that I don’t see any inherent validity to that statement.

If you are saying Rust consumers want something lower level than you’re willing to make stable, just give them a higher level one and tell them to be happy with it because it matches your design philosophy.


The issue with Rust is that as a strict language with no function overloading (except via traits) or keyword arguments, things get very verbose. For instance, in python you can treat a string as a list of columns as in `df.select('date')` whereas in Rust you need to write `df.select([col('date')])`. Let's say you want to map a function over three columns, it's going to look something like this:

``` df.with_column( map_multiple( |columns| { let col1 = columns[0].i32()?; let col2 = columns[1].str()?; let col3 = columns[3].f64()?; col1.into_iter() .zip(col2) .zip(col3) .map(|((x1, x2), x3)| { let (x1, x2, x3) = (x1?, x2?, x3?); Some(func(x1, x2, x3)) }) .collect::<StringChunked>() .into_column() }, [col("a"), col("b"), col("c")], GetOutput::from_type(DataType::String), ) .alias("new_col"), ); ```

Not much polars can do about that in Rust, that's just what the language requires. But in Python it would look something like

``` df.with_columns( pl.struct("a", "b", "c") .map_elements( lambda row: func(row["a"], row["b"], row["c"]), return_dtype=pl.String ) .alias("new_col") ) ```

Obviously the performance is nowhere close to comparable because you're calling a python function for each row, but this should give a sense of how much cleaner Python tends to be.


> Not much polars can do about that in Rust

I'm ignorant about the exact situation in Polars, but it seems like this is the same problem that web frameworks have to handle to enable registering arbitrary functions, and they generally do it with a FromRequest trait and macros that implement it for functions of up to N arguments. I'm curious if there are were attempts that failed for something like FromDataframe to enable at least |c: Col<i32>("a"), c2: Col<f64>("b")| {...}

https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...

https://github.com/tokio-rs/axum/blob/86868de80e0b3716d9ef39...


You'd still have problems.

1. There are no variadic functions so you need to take a tuple: `|(Col<i32>("a"), Col<f64>("b"))|`

2. Turbofish! `|(Col::<i32>("a"), Col::<f64>("b"))|`. This is already getting quite verbose.

3. This needs to be general over all expressions (such as `col("a").str.to_lowercase()`, `col("b") * 2`, etc), so while you could pass a type such as Col if it were IntoExpr, its conversion into an expression would immediately drop the generic type information because Expr doesn't store that (at least not in a generic parameter; the type of the underlying series is always discovered at runtime). So you can't really skip those `.i32()?` calls.

Polars definitely made the right choice here — if Expr had a generic parameter, then you couldn't store Expr of different output types in arrays because they wouldn't all have the same type. You'd have to use tuples, which would lead to abysmal ergonomics compared to a Vec (can't append or remove without a macro; need a macro to implement functions for tuples up to length N for some gargantuan N). In addition to the ergonomics, Rust’s monomorphization would make compile times absolutely explode if every combination of input Exprs’ dtypes required compiling a separate version of each function, such as `with_columns()`, which currently is only compiled separately for different container types.

The reason web frameworks can do this is because of `$( $ty: FromRequestParts<S> + Send, )*`. All of the tuple elements share the generic parameter `S`, which would not be the case in Polars — or, if it were, would make `map` too limited to be useful.


Thanks for the insight!


Ah, of course. Slightly ambiguous English tricked me there. Thank you Ritchie!


I apologize for that, English isn't my first language. Glad it was explained so well!


With Polars Cloud you don't have to choose those either. You can pick cpu/memory and we will offer autoscaling in a few months.

Cluster configuration is optional if you want this control. Anyhow, this doesn't have much to do with the query API, be it SQL or DataFrame.


We also target ad-hoc analysis. If your data doesn't fit on your laptop, you can spin up a larger box or a cluster and run interactive queries.


On-premises is in the works. We expect this in a couple of months. Currently it is managed on AWS only.


Thanks! Will it be paid or open source?


Paid


I am not an expert on Spark RDDs, but AFAIK they are a more low-level data structure that offer resilience and a lower level map-reduce API.

Polars Cloud maps the Polars API/DSL to distributed compute. This is more akin to Spark's high level DataFrame API.

With regard to implementation, we create stages that run parts of Polars IR (internal representation) on our OSS streaming engine. Those stages run on 1 or many workers create data that will be shuffled in between stages. The scheduler is responsible for creating the distributed query plan and work distribution.


Can you tell a little about the status of Iceberg write support? Partitioning, maintenance etc.


We have full iceberg read support. We have done some preliminary work for iceberg write support. I think we will ship that once we have decided which Catalog we will add. The iceberg write API is intertwined with that.


You don't have to. Passing cpu and memory works as well.

    pc.ComputeContext{
        cpus=4, 
        memory=16
    }
We are working on a minimal cluster and auto-scaling based on the query.


Nice!

Ritchie, curious you mentioned in other responses that the SQL context stuff is out of scope for now. But I thought the SQL things were basically syntactic sugar to the dataframes in other words they both “compile” down to the same thing. If true then being able to run arbitrary SQL queries should be doable out of the box?


Not right now. Our current SQLContext locally inspects schema's to convert the SQL to Polars LazyFrames (DSL).

However, this should happend during IR-resolving. E.g. the SQL should translate directly to Polars IR, and not LazyFrames. That way we can inspect/resolve all schema's server-side.

It requires a rewrite of our SQL translation in OSS. This should not be too hard, but it is quite some work. Work we eventually get to.


Thanks for the context.


Your billing partner is AWS. Polars' markup is on your AWS bill.


Hi, I am the original author and CEO of Polars. We are not focused on SQL at this time and provide a DataFrame native API.

Polars cloud will for the moment only support our DataFrame API. SQL might come later on the roadmap, but since this market is very saturated, we don't feel there is much need there.


Polars | Rust Engineer | Amsterdam ONSITE / HYBRID

At Polars we're building a fast distributed query engine for Polars DataFrames. Our mission is scale DataFrame processing and offer a modern verstatile API to process data fast and easy.

https://hiring.pola.rs/o/database-engineer


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: