I prefer a SQL-like format. It’s not as complete but it cover most of the day-to-day use cases. Take a look at https://github.com/dcmoura/spyql (I am the author). Congrats on fq!
Thanks! i actually experimented a bit with an SQL-like interface for while, dump things into sqlite and use that as query engine. Problem usually was that file formats tend to be mix of array and tree structures more then relational and at least standard SQL is not great for that. Maybe some graph-SQL dialect could work?
Author of the benchmark and of SPyQL here.
ClickHouse is fantastic. Amazing performance. SPyQL is built on top of Python but still can be faster than jq and several other tools as shown in the benchmark. SPyQL can handle large datasets but Clickhouse local should always show better performance.
SPyQL CLI is more oriented to work in harmony with the shell (piping), to be very simple to use and to leverage the Python ecosystem (you can import Python libs and use them in your queries).
I am the author of SPyQL [1]. Combining JC with SPyQL you can easily query the json output and run python commands on top of it from the command-line :-) You can do aggregations and so forth in a much simpler and intuitive way than with jq.
I just wrote a blogpost [2] that illustrates it. It is more focused on CSV, but the commands would be the same if you were working with JSON.
Thank you all for your feedback. The benchmark was updated and the fastest tool is NOT written in Python. Here are the highlights:
* Added ClickHouse (written in C++) to the benchmark: I was unaware that the clickhouse-local tool would handle these tasks. ClickHouse is now the fastest (together with OctoSQL);
* OctoSQL (written in Go) was updated as a response to the benchmark: updates included switching to fastjson, short-circuiting LIMIT, and eagerly printing when outputting JSON and CSV. Now, OctoSQL is one of the fastest and memory is stable;
* SPyQL (written in Python) is now third: SPyQL leverages orjson (Rust) to parse JSONs, while the query engine is written in Python. When processing 1GB of input data, SPyQL takes 4x-5x more time than the best, while still achieving up to 2x higher performance than jq (written in C);
* I removed Pandas from the benchmark and focused on command-line tools. I am planning a separate benchmark on Python libs where Pandas, Polars and Modin (and eventually others) will be included.
The initial idea was to focus on cmd line tools... I added pandas for comparison, as it is one of the most used libs to work with datasets. I will either remove Pandas from the equation or add Polars. By the way, I run some benchmarks and polars seems a bit faster than spyql for the aggregation challenge, but does not scale (loads everything into memory)