More

dmoura · on Sept 23, 2023

Thank you for sharing your learnings and for your transparency! Congrats!

dmoura · on June 3, 2023

I prefer a SQL-like format. It’s not as complete but it cover most of the day-to-day use cases. Take a look at https://github.com/dcmoura/spyql (I am the author). Congrats on fq!

wwader · on June 5, 2023

Thanks! i actually experimented a bit with an SQL-like interface for while, dump things into sqlite and use that as query engine. Problem usually was that file formats tend to be mix of array and tree structures more then relational and at least standard SQL is not great for that. Maybe some graph-SQL dialect could work?

dmoura · on Nov 3, 2022

DuckDB is great! I love what you guys are building. The main gap for me is native support of JSON (lines), like you have for CSV and Parquet.

dmoura · on Nov 3, 2022

Things you can do with SPyQL CLI that you can't with clickhouse local (AFAIK, top of my mind, not exhaustive):

- use python code in your queries

- import python libs (just install them with pip/conda)

- write your one UDFs in Python

- run OS commands from within the query (using os.system)

- have guaranty of row order (like in grep, sed, etc)

And there is more, please take a look at: https://spyql.readthedocs.io/en/latest/distinctive.html

reichardt · on Nov 3, 2022

As shown in the first example you can pipe data into clickhouse-local: https://clickhouse.com/docs/en/operations/utilities/clickhou...

dmoura · on Nov 3, 2022

updated, thank you

dmoura · on Nov 3, 2022

Author of the benchmark and of SPyQL here. ClickHouse is fantastic. Amazing performance. SPyQL is built on top of Python but still can be faster than jq and several other tools as shown in the benchmark. SPyQL can handle large datasets but Clickhouse local should always show better performance.

SPyQL CLI is more oriented to work in harmony with the shell (piping), to be very simple to use and to leverage the Python ecosystem (you can import Python libs and use them in your queries).

dmoura · on Nov 3, 2022

This is great!

I am the author of SPyQL [1]. Combining JC with SPyQL you can easily query the json output and run python commands on top of it from the command-line :-) You can do aggregations and so forth in a much simpler and intuitive way than with jq.

I just wrote a blogpost [2] that illustrates it. It is more focused on CSV, but the commands would be the same if you were working with JSON.

[1] https://github.com/dcmoura/spyql [2] https://danielcmoura.com/blog/2022/spyql-cell-towers/

sbt567 · on Nov 3, 2022

Wow, this looks super useful. Will definitely check this out. Thanks

dmoura · on April 21, 2022

Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:

* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);

* OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;

* SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);

* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.

This benchmark is a living document. If you are interested in receiving updates, please subscribe to the following issue: https://github.com/dcmoura/spyql/issues/72

Thank you!

dmoura · on April 19, 2022

Are you able to calculate aggregates, like an average?

dmoura · on April 18, 2022

The initial idea was to focus on cmd line tools... I added pandas for comparison, as it is one of the most used libs to work with datasets. I will either remove Pandas from the equation or add Polars. By the way, I run some benchmarks and polars seems a bit faster than spyql for the aggregation challenge, but does not scale (loads everything into memory)

dmoura · on April 18, 2022

Yes, it is much faster, I will be updating the benchmark and reposting