Hi, Till here, worked on the DuckDB-NSQL model on MotherDuck side.
1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.
3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.
I didn't see this in the blog post, but did you train this from scratch or finetune an existing base model?
If from scratch, quite impressive that the model is capable of understanding natural language prompts (English presumably) from such a small, targeted training set.
I've been working on the DuckDB-NSQL model on MotherDuck side. I fully agree that general text-2-sql, in the sense of "give me a question, and I'll produce you an arbitrary complex query", is a very tough problem and I actually believe that that's not the right problem to solve. Not necessarily because models are not (going to be) cabable enough, but rather because it's way too hard for humans to express in a single prompt what they actually want. Furthermore, it's simply not the right UX for most SQL users. A much better approach IMO is to keep the SQL analysts in the driver seat, and provide nuanced support wherever/whenever they need it most. The FixIt feature we recently launched goes into the same direction: https://motherduck.com/blog/introducing-fixit-ai-sql-error-f...
In that sense I emphasized in our Blogpost that users should think of it as a documentation oracle that always gives you the exact DuckDB SQL query snippet you are looking for, which is a tremendoues time-saver if you have an abstrat idea of the query you want to write, but you're just not sure about the syntax, expecially with DuckDB having so many functions and SQL extensions.
Here are a few exammples:
- create tmp table from test.csv
- load aws credentials from 'test' profile
- get max of all columns in rideshare table
- show query plan with runtimes for 'SELECT * FROM rideshare'
- cast hvfhs_license_num column to int
- get all columns ending with _amount from taxi table
- show summary statistics of rideshare table
- get a 10% reservoir sample of rideshare table
- get length of drivers array in taxi table
- get violation_type field from other_violations json column
in taxi table
- get passenger count, trip distance and fare amount from taxi
table and oder by all of them
Thanks for replying, that's a perspective I didn't consider. The capability to "talk to your data" just seems so enticing as a solution that I was tunnel-visioned into that UX. If I'm understanding correctly, what you're suggesting is more of a SQL assistant to help people write the correct SQL queries instead of writing the entire SQL query from scratch to answer a generic natural-language question?
I have found LLMs to be extremely helpful in mapping between schemas as well as helping me formulate queries where, because of decay, data in tables and column names, etc don't map to what you think they would.
You need to provide as much context as you can to the LLM. So full schema definitions, and histographic summarization and samples from the tables themselves.
I hear what you're saying, and actually agree that - unfortunately - is the current state of LLMs today. But I think OP still has a point, and I'd argue as well that that is the aim of GAI, to be able to take a simple question, understand and produce the desired results.
As it stands, and as you mentioned, the current generation of LLMS seem a long way from there. Thus the need for prompt engineers. It'll be nice to see how long they take to crack this problem.
MotherDuck | NYC, Amsterdam, San Francisco, or Seattle (Onsite preferred) | Software Engineers in AI, Database, and Ecosystems | Full-Time | $140K – $275K + Equity | https://motherduck.com/
Don’t let the quirky name fool you! We’re on a mission to radically transform the data analyst user experience. We think we can empower and delight analysts using a novel cloud architecture built on top of DuckDB, an open-source foundation with a vibrant community. Our team is a diverse mix of passionate, collaborative, and empathetic data industry veterans from Google, Amazon, Meta, Snowflake, SingleStore, Databricks, and the like.
We're expanding our focus on AI, and are looking for Python/C++ engineers to help us bring the power of large language models into the hands of database users. Check out the open role here: https://jobs.ashbyhq.com/MotherDuck/b485b0d6-2cbe-4083-b253-...
We also have open positions in our Database and our Ecosystems teams. Come work with us and disrupt the future of data analytics. https://motherduck.com/careers/
1. definitely training data (for me), we explored about 10 different directions before settling on the current approach. It's easy to underestimate the effect of training data on the quality of the model. Starting point was the benchmark dataset though, which we assembled manually (to avoid data pollution and also because there was simply no text2sql benchmark that covers anything else than plain old SQL select statements with a handful of aggregate functions). And training is also not a one-off thing. With large datasets it is hard to evaluate the quality of the dataset without actually training a few epochs on it and run the benchmark.
2. I left a comment about my view on where such models are effective in a previous commment: https://news.ycombinator.com/item?id=39133155
3. No way - I see a common stack emerging (take a look at companies like https://vanna.ai/, https://www.dataherald.com/, or https://www.waii.ai) that is mainly centered around foundation models like GPT-4 with strong in-context learning capabilities (that's a kind of a must to make these approaches work and comes with long inference times and higher costs). These solutions include things like embedding-based schema filtering, options for users to enrich metadata about tables and columns, including previous related queries into the context etc. around the model. I'd say it's a bit of a different problem from what we aimed at solving.