Presumably the ChatGPT content that makes it onto the web is at the very least curated by humans, making that text on average slightly higher quality than the raw output of ChatGPT. If that's the case than you would expect model performance to continue to improve even if the dataset is polluted.
Doesn't matter. We want high-quality text - it's not necessary for it to be human-written. Social signals like upvotes or PageRank will still remain useful even if most text is AI generated.
I certainly don't want most of discussion forums to be generated by bots. I'd rather there was none of it. High-quality generated text is good for fiction and summaries, but not when you want to hear what actual humans have to say.
You just gotta get the AIs to do the upvoting, then cut the humans out of the loop all together and only have AIs read the AI generated text, and then everything will be fine. Just an endless death spiral of ai gen, ai filtering, and ai consumption, forever and ever.
Presumably at some point computers will become (already are for all I know?) the largest consumers of content on the internet as well as its producers.
Bold assumption that AI generated text won't get cheaper exponentially. It already costs less than human generated text of the same quality by magnitudes.
I think you're very confused about the costs required in operating a human... Or are you assuming because the human was going to be doing it anyway the cost is free?
I don't think this problem matters as much as people say it does, except maybe from a research perspective. The chatbot has essentially become part of human culture, it speaks human languages and could actually subtly influence the way human language works. It may develop its own idioms and communication style, and humans may adopt some of this. So yes: now that LLMs are released, everything is polluted in some way, similar to radioactive isotopes. But language is descriptive, not prescriptive: it always works as long as there is shared understanding. People will cherry pick the ChatGPT answers they were able to understand when publishing to the internet, and ignore/ridicule the output that didn't make sense to them.
Note that GPT-3.5 and above are already intentionally polluted with their own output by the RLHF process.
I think my comment was misunderstood. I didn’t mean the output text would contain some identifying information. Rather, OpenAI could generate a fingerprint from the text, similar to Apple’s neural has for images, and store that so they can filter out generated text later.
Well, they have all of the outputs of ChatGPT stored on their own servers. I suppose it wouldn't be out of the question to filter any future datasets they scrape against the outputs they have.
A watermark is absolutely possible - see for example some of the work Scott Aaronson has mentioned doing for OpenAI.
But: very fragile, especially if people are specifically trying to hide their GPT use, or have access to the watermarking algorithm or online oracle.
And: other methods – like remembering all output ever, or fuzzy summary representations of all output ever – seem to me similarly fragile, & introduce other problems & impracticalities.
A guess: OpenAI internally initially shared the common concern that "consuming its own junk outputs" could be a problem. But their own experiments so far, private & public, may have convinced them it's not as much of a problem in practice as it seems in theory. The model outputs have a mix of good and bad text – just like the pre-LLM internet. And, the same filterings/weightings that have worked on pre-LLM content keep working. And, counter to some early intuitions, often one LLM's quality output is in fact very-useful input for other later LLMs.
These kind of fractal have actually 4 dimentional structure since c_x and c_y can also be parameters. I'd love to see their slice as 3D, but have yet to find a good way to visualize them...
> These kind of fractal have actually 4 dimentional structure since c_x and c_y can also be parameters
They're not parameters in that sense.
The fractal is computed by taking each point on the plane as coordinates (c_x, c_y), and then iteratively applying the recursion relation. Then, with luminosity depending on how quickly that sequence escapes to infinity, we color in that point (c_x, c_y) in our image.
That's not what kenshoen meant. For example for the Mandelbrot set we have a function f(z) = z^2 + c for complex z,c and each pixel in the image represents c_x,c_y in c = c_x + i*c_y and then you iterate f(0), f(f(0))... On the other hand, if you have c constant and each pixel represents z_x, z_y in z=z_x + i*z_y instead, then iterating f(z),f(f(z))... gives you a Julia set.
But you can think of f as a function of two complex arguments f(z,c)=z^2 + c and iterate it on the whole domain (two complex = four real dimensions) and then have a picture being a slice through any 2D or (even 3D, which is what parent is talking about) plane you like. In other words, the famous Mandelbrot fractal picture is a slice of f(z,c) through a plane z=0, and Julia set pictures are slices through planes c=constant but there is no reason one cannot make other pictures of f(z,c) (just be careful what you meant by iterating a function f: C^2 -> C).
The burning ship fractal in the article is the same but the function f(z,c) is a bit weirder