Hacker Newsnew | past | comments | ask | show | jobs | submit | kenshoen's commentslogin

I wonder how OpenAI are going to avoid the problem after the web is littered with its content?


Presumably the ChatGPT content that makes it onto the web is at the very least curated by humans, making that text on average slightly higher quality than the raw output of ChatGPT. If that's the case than you would expect model performance to continue to improve even if the dataset is polluted.


That's a bold assumption. I can imagine a world where 99.999% of the web will be filled with non-human curated AI generated text.

The rate at which AI can generate text will be so much greater than what humans can generate.


Doesn't matter. We want high-quality text - it's not necessary for it to be human-written. Social signals like upvotes or PageRank will still remain useful even if most text is AI generated.


I certainly don't want most of discussion forums to be generated by bots. I'd rather there was none of it. High-quality generated text is good for fiction and summaries, but not when you want to hear what actual humans have to say.


The point is that AIs will run out of human-generated text or that it won't be able to distinguish from AI or human generated text to train on.

You're already assuming pagerank and upvote systems won't break down in the future.


You just gotta get the AIs to do the upvoting, then cut the humans out of the loop all together and only have AIs read the AI generated text, and then everything will be fine. Just an endless death spiral of ai gen, ai filtering, and ai consumption, forever and ever.

Presumably at some point computers will become (already are for all I know?) the largest consumers of content on the internet as well as its producers.


"bold assumption" says the guy who assumes $2 worth of energy spent on AI generated text for every single written word by humans.

Now go ahead and spend $50 dollars on AI generated text nobody is ever going to read, just like almost nobody is going to read this comment.


Bold assumption that AI generated text won't get cheaper exponentially. It already costs less than human generated text of the same quality by magnitudes.


Costs a lot more than free text written by thinking humans.


I think you're very confused about the costs required in operating a human... Or are you assuming because the human was going to be doing it anyway the cost is free?


I don't think this problem matters as much as people say it does, except maybe from a research perspective. The chatbot has essentially become part of human culture, it speaks human languages and could actually subtly influence the way human language works. It may develop its own idioms and communication style, and humans may adopt some of this. So yes: now that LLMs are released, everything is polluted in some way, similar to radioactive isotopes. But language is descriptive, not prescriptive: it always works as long as there is shared understanding. People will cherry pick the ChatGPT answers they were able to understand when publishing to the internet, and ignore/ridicule the output that didn't make sense to them.

Note that GPT-3.5 and above are already intentionally polluted with their own output by the RLHF process.


My apologies, but as a human language model, it is unlikely that ChatGPT would have much impact on human culture.


why not?

i'd say llm's represent a institutionalized reinforcement of bias (much like journalism) combined with some in-human autonomy.


what do we say to people who has the argument of "but the web is already littered with spam blog and SEO stuff"


They probably fingerprint their generated content.


This has been researched, but no such thing has been implemented by OpenAI or Bard.


I think my comment was misunderstood. I didn’t mean the output text would contain some identifying information. Rather, OpenAI could generate a fingerprint from the text, similar to Apple’s neural has for images, and store that so they can filter out generated text later.


How could that possibly work?


Well, they have all of the outputs of ChatGPT stored on their own servers. I suppose it wouldn't be out of the question to filter any future datasets they scrape against the outputs they have.


Keep track of all embeddings ever emitted. While scraping, check all data against those embeddings.

So, not like a watermark, which would be impossible.


A watermark is absolutely possible - see for example some of the work Scott Aaronson has mentioned doing for OpenAI.

But: very fragile, especially if people are specifically trying to hide their GPT use, or have access to the watermarking algorithm or online oracle.

And: other methods – like remembering all output ever, or fuzzy summary representations of all output ever – seem to me similarly fragile, & introduce other problems & impracticalities.

A guess: OpenAI internally initially shared the common concern that "consuming its own junk outputs" could be a problem. But their own experiments so far, private & public, may have convinced them it's not as much of a problem in practice as it seems in theory. The model outputs have a mix of good and bad text – just like the pre-LLM internet. And, the same filterings/weightings that have worked on pre-LLM content keep working. And, counter to some early intuitions, often one LLM's quality output is in fact very-useful input for other later LLMs.


Computerphile has a video that explains it very well: https://youtu.be/XZJc1p6RE78

(You can skip to the section “Verifying“)


Sub-base 5 makes the multiplication table actually smaller than decimal. Nice.


Obligatory xkcd: https://xkcd.com/1285/


So is it a sintax for n-ary trees? Nice!

  newtype Jevko = Jevko ([(String, Jevko)], String)


Indeed that's a useful way to look at it and a type definition that'll do the job of storing Jevko parse trees.

See also: https://xtao.org/blog/rose.html


These kind of fractal have actually 4 dimentional structure since c_x and c_y can also be parameters. I'd love to see their slice as 3D, but have yet to find a good way to visualize them...

https://i.imgur.com/JRfLy6R.mp4


Here you can interactively explore the 4d mother of the Mandelbrot fractal and all its Julia fractals:

https://rawgit.com/MatthiasHu/FractalsWebGL/4d/page.html


> These kind of fractal have actually 4 dimentional structure since c_x and c_y can also be parameters

They're not parameters in that sense.

The fractal is computed by taking each point on the plane as coordinates (c_x, c_y), and then iteratively applying the recursion relation. Then, with luminosity depending on how quickly that sequence escapes to infinity, we color in that point (c_x, c_y) in our image.


That's not what kenshoen meant. For example for the Mandelbrot set we have a function f(z) = z^2 + c for complex z,c and each pixel in the image represents c_x,c_y in c = c_x + i*c_y and then you iterate f(0), f(f(0))... On the other hand, if you have c constant and each pixel represents z_x, z_y in z=z_x + i*z_y instead, then iterating f(z),f(f(z))... gives you a Julia set.

But you can think of f as a function of two complex arguments f(z,c)=z^2 + c and iterate it on the whole domain (two complex = four real dimensions) and then have a picture being a slice through any 2D or (even 3D, which is what parent is talking about) plane you like. In other words, the famous Mandelbrot fractal picture is a slice of f(z,c) through a plane z=0, and Julia set pictures are slices through planes c=constant but there is no reason one cannot make other pictures of f(z,c) (just be careful what you meant by iterating a function f: C^2 -> C).

The burning ship fractal in the article is the same but the function f(z,c) is a bit weirder


Visions of Chaos perhaps?


Yes, that looks like a right way to handle this problem without ignoring YAML spec. Define what to parse upfront.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: