I'm also curious. It seems like 6 months ago everyone was afraid of "model collapse" but now synthetic training generation and teacher models are all the rage. Have we solved the problem of model collapse?
Model collapse was basically a coping idea made up by artists who were hoping AI image generators would all magically destroy themselves at some point; I don't think it was ever considered likely to happen.
It does seem to be true that clean data works better than low quality data.
We've by now reached a "probably not inevitable" - https://arxiv.org/abs/2404.01413 argues there's a finite upper bound to error - but I'd also point out that that paper assumes training data cardinality increases with the number of training generations and is strictly accumulative.
To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.
but it's probably fair to say current SOTA is still more or less "it's neither impossible nor inevitable".
Oh, no, they definitely believe both are going to happen and ChatGPT is just going to stop working because it'll see itself on the internet. It goes with the common belief that LLMs learn from what you type into them.
> To a first order, that means you better have a pre-2022 dataset to get started, and have archived it well.
I think that will always be available, or at least, a dataset with the distribution you want will be available.
Don't know why you have such a disdain for artists, but either way, the original point was that model collapse wasn't "a coping idea made up by artists", but a valid research backed scientific model.
>I think that [clean pre-2022 data set] will always be available