GPTZero Case Study – Exploring False Positives

wand3r · on Feb 19, 2023

I saw this[1] interview with Sam Altman touching on interim AI impact. I really agree with his point that basically detecting output from LLMs is basically going to be futile and only really relevant in the near term. Accuracy is obviously going to improve in models and detection isnt that difficult now but will be in the future, especially if output is modified or an attempt to obfuscate origin is made.

[1]https://youtu.be/ebjkD1Om4uw

sebzim4500 · on Feb 19, 2023

>detection isnt that difficult now

I would have thought this, but every attempt I've seen at detecting chatGPT generated text has failed miserably.

Closi · on Feb 19, 2023

It fails on false positives, but you don't often get false negatives, which might be enough at the moment for quite a few use-cases.

Also false positives are typically "this text is likely to contain parts that were AI generated" rather than "This text is higly likely to be AI generated" (which is what GPT-generated content generally produces).

When I've tried to prompt-engineer GPT to produce text that GPTZero will flag as negative it has been pretty tough!

theptip · on Feb 19, 2023

Of course false positives matter; the naive heuristic “everything is AI generated” has zero false negatives, and mostly false positives. In the OP half of the positives are false. That’s not a useful signal IMO. You couldn’t use that to police homework for example.

Closi · on Feb 19, 2023

I said they might not matter for quite a few use cases, not that they don't matter for all use-cases.

e.g. If you were Sam Altman at OpenAI and your use-case is mostly looking for training data and wanting to tell if it is AI-Generated or not (so you can exclude this from training data), you probably care much more about false negatives than false positives (false positives just reduce your training data set size slightly, while false negatives pollute it).

Of course they matter if you are marking homework (where conversely false negatives aren't actually that important!), but it's pretty trivial to think of use-cases where the opposite is true.

sebzim4500 · on Feb 20, 2023

Even for that the false positives could end up mattering. E.g. if the data being incorrectly excluded turns out to be the most important part of the training set.

cypress66 · on Feb 20, 2023

I've read that you can literally explain chatgpt what perplexity and burstiness is, and tell it to generate it's answer with low perplexity and high burstiness (what detectors check).

I haven't tried it though.

anothernewdude · on Feb 20, 2023

You don't prompt-engineer, this is solvable by changing parameters. You need a higher temperature

martinflack · on Feb 19, 2023

It's probably a short-term social phenomenon. We don't bother detecting mathematical output from calculators or spreadsheets; we just like that folks give us the right answer, even if they had easy tooling to produce it. However, watching someone do things the old way would seem bemusing. If you watched a manager notating all over a physical spreadsheet with a pencil (as was commonly done at one time) it would seem quaint or backwards depending on context. Likewise, waiting for someone to write a letter and taking more than 90 seconds because they didn't co-author it with AI might seem slow.

mtlmtlmtlmtl · on Feb 19, 2023

"Write an email that says I did X and they should do Y, but if Z then W, and we should schedule a meeting with P and Q."

I feel like for most emails I write, information density is close to a maximum. This means there's no actual gain to be had from a language model. The email I would write myself is going to be about the same length as the prompt I'd have to write anyway.

robocat · on Feb 19, 2023

There is a huge gain if English is your second language, and you use ChatGPT to rewrite, or translate.

Even if English is your mother tongue, if your written English is crappy or you need to write in a style you are unfamiliar with (e.g. formal), then ChatGPT can help.

mtlmtlmtlmtl · on Feb 19, 2023

If you're writing in a style you're unfamiliar with, how do you know the model is doing it correctly?

I also think writing yourself might be far better practice. This tool can easily become a crutch. This is unlikely to be free anytime soon. In fact it's likely to be quite expensive.

robocat · on Feb 19, 2023

I think we all tend to be better at picking out a correct answer than generating a correct answer from scratch.

I can ask ChatGPT to rewrite an email in the style an American news reporter from 1950, and I can judge whether some of the cliches it generates feel correct. I cannot write in that style at all.

simonh · on Feb 19, 2023

ChatGPT is free now, although there is a paid tier, and MS and Google are building similar capabilities right into their search interfaces.

mtlmtlmtlmtl · on Feb 19, 2023

ChatGPT won't be free forever.

Whatever LLM search stuff comes along will only be free as long as it brings in ad revenue. Which involves making the models fundamentally worse most likely. Or they'll use it to collect personal data. Probably both.

Computationally, GPT is wildly expensive. This idea people have that it's gonna be used all over the place for all sorts of tiny tasks, as if it's just another REST API, is nuts. Unless something fundamentally new comes along that makes these models much cheaper, adoption is likely going to end up much more limited than people expect. Or siloed off into expensive business-facing products.

Dylan16807 · on Feb 20, 2023

The cost estimates I see are things like less than half a cent per query and a few cents per conversation.

If I'm making (or valuing my free time at) $20 an hour, then ChatGPT only needs to save me one or two seconds and half a minute respectively.

chrisdbanks · on Feb 20, 2023

And like everything, the price will drop rapidly as the models become smaller through advances. There's already sparseGPT.

scotty79 · on Feb 19, 2023

I hope gpt detectors will evolve into general bs detectors.

Nevermark · on Feb 20, 2023

I wonder how good the best language models are at detecting bs.

So many experiments to run...

stevenhuang · on Feb 20, 2023

It sounds feasible to detect cases of low temperature output, as the text output this way sounds very generic (like an average of all content seen near the prompt's latent space).

However once you prompt the LLM with a higher temperature, or tell it to roleplay as someone with elaborate personas, or suggest to use certain linguistic styles, or train it on example text... then it becomes much harder.

I imagine pathological cases of formulaic word use, sentence/paragraph structure will only be detectable in longer form text. After all text is already pretty low-resolution, not much for adversarial models to work with.

e2le · on Feb 20, 2023

Perhaps instead, the solution is to hold people to a higher standard. AI as a tool to help people in identifying logical fallacies (special pleading, anecdotal, the fallacy fallacy) written or otherwise could be very valuable. Maybe in the same way that ChatGPT can identify and explain the problems a piece of code has. It could even be integrated with moderation tools. This could in-effect make individuals less vulnerable to convincing nonsense said by professional bullshitters or parroting individuals.

RC_ITR · on Feb 19, 2023

> Accuracy is obviously going to improve in models

Well, to be clear, they can put rules based filters and other things on top of the neural net, but the core GPT will never get more accurate since it has no mechanism to understand what words mean.

valine · on Feb 19, 2023

GPT3 is far more accurate than GPT2. Seems reasonable that larger models trained on more data will continue to improve accuracy. I'd also expect larger models to be better at summarizing text, ie potentially fixing the Bing issues where it hallucinates numbers.

Our models sizes are a product of our scaling and hardware limitations. There's no reason to believe we are anywhere near optimal.

flir · on Feb 19, 2023

> Seems reasonable that larger models trained on more data will continue to improve accuracy.

It also seems reasonable to assume that they will eventually encounter diminishing returns, and that the current issues, such as hallucinations, are inherent to the approach and may never be resolved.

To be clear I don't have a clue which statement is true (though I don't see why scaling would solve the hallucination problem).

PeterisP · on Feb 20, 2023

Scaling of models is a very researched area, and currently all the experiments show that scaling doesn't really get diminishing returns - that was checked in GPT-2 "era" with model sizes from very small up to GPT-2, and reconfirmed with GPT-3 and then with newer models. As far as we can see, scaling does not result in diminishing returns; and while it's certainly possible that we eventually encounter diminishing returns, it is not reasonable to presume that we actually will any time soon (as we have literally zero evidence for that and at least some evidence to the contrary), and even if we will, there's currently no reason to assume that the eventual breaking point is somewhere at "GPT-5" and not "GPT-15" or "GPT-55".

Jensson · on Feb 20, 2023

If significantly bigger models than now got better results we would have seen papers about that a long time ago so that the team/company can get more funding, lots of rich actors has worked on that for years.

If it doesn't produce better results however then they want their competitors to waste lots of money to make the same mistakes, there is really no benefit from publishing that and lots of drawbacks.

Otherwise it seems too much of a coincidence that Google and OpenAI ended up with models of basically the same size. Google could have trained a model 5x-10x larger easily, it isn't that expensive to them, but for some reason we didn't see that, and GPT-4 just never seems to launch.

valine · on Feb 20, 2023

It’s not just the cost of training the model, it’s the cost of doing inference at scale. ChatGPT boarder line too expensive to operate already. It’s hard to imagine a larger model that both economical and used by millions of people with our current hardware.

Jensson · on Feb 20, 2023

But if a larger model was good enough to replace a human, like for example a Google engineer, then it would still be worth it. So they have for sure tried to scale up, and if that extra scale gave results they would have published something about it.

Now, since the larger model wasn't good enough to replace a human engineer we can rest easy, it wont replace programmers anytime soon. If GPT-4 for example could replace engineers, OpenAI wouldn't need to monetise ChatGPT, they would just rent out artificial engineers to do coding for $10k a year.

mtsr · on Feb 19, 2023

Might turn out that for rules based systems such as prescriptive grammars (for grammatically correct language rather than natural spoken) there is still use for a system that explicitly represents those rules.

Then again, we are a big old bulb of wetware and we can generally learn to apply grammar rules correctly most of the time (when explicitly thinking about them, anyway).

Maybe what we need is some kind of meta cognition: being able to apply and evaluate rules that the current LLMs can already correctly reproduce.

int_19h · on Feb 19, 2023

The biggest problem is that scaling is non-linear. The returns might well be non-diminishing wrt model size, but if we have to throw N^2 hardware at it to make it (best-case) 2N better, we'll still hit the limit pretty quickly.

RC_ITR · on Feb 20, 2023

> GPT3 is far more accurate than GPT2

Please say more about what you mean here because I disagree.

It’s certainly more eloquent, but it still can’t multiple 2 4-digit numbers…

snapcaster · on Feb 20, 2023

But it did learn some basic arithmetic. If GPT4 can multiple 2 4-digit numbers will you change your mind?

RC_ITR · on Feb 21, 2023

GPT saw a bunch of arithmetic and can repeat it. That’s the joke behind the Reddit usernames and /r/counting.

The four digit number thing is just the current lower bound of where it gets confused because a lack of training data.

Once you teach a patient 8 year old the rules of multiplication once, they can multiple any two numbers (that they’ve never seen before) with an arbitrary number of digits. An LLM cannot and will not ever be able to do that because it is a specific tool and it is not designed to do that (doing that would be a bad outcome for an LLM since we have different tools that can do multiplication much more efficiently).

So yes, if a LLM learns rules based math (which it is not intended to do) I’ll eat not only my, but every hat in existence.

chrisdbanks · on Feb 20, 2023

They probably mean that if you test it on common NLP benchmarks it performs better.

photochemsyn · on Feb 19, 2023

Just wrote this myself, although I did try to chatGPT-style it a bit. I thought the final third would serve to identify it as non-AI as it goes off on a tangent about isotopes...

> "The periodic table is a systematic ordering of elements by certain charcteristics including: the number of protons they contain, the number of electrons they usually have in their outer shells, and the nature of their partially-filled outermost orbitals."

> "Historically, there have been several different organizational approaches to classifying and grouping the elements, but the modern version originates with Dmitri Mendeleeve, a Russian chemist working in the mid-19th century."

> "However, the periodic table is also somewhat incomplete as it does not immediately reveal the distribution of isotopic variants of the individual elements, although that may be more of an issue for physicists than it is for chemists."

GPTZero says: "Your text is likely to be written entirely by AI"

Now I'm feeling existential dread... perhaps I am an AI running in a simulation and I just don't know it?

thomastjeffery · on Feb 19, 2023

Technical writing, in order to be relatively unambiguous - the "technical" part - defines itself as a subset of English with a constrained grammar and vocabulary.

You just illustrated technical writing. Naturally, your writing style is very similar to that of other technical writing.

Take one guess what kind of writing exists in most of the text GPT is trained on.

sdwr · on Feb 19, 2023

You made a few unforced errors that move it away from the quietly authoritative, mirror-sheen AI voice.

The colon in line 1 is clunky, the combination of "but" and "with" in line 2 reads as passive, and line 3 is full person.

robocat · on Feb 19, 2023

Perhaps you just write very predictably.

A partisan writing cliched slogans and regurgitating tired political statements has a “temperature” closer to 0, likewise an engineer stringing together typical word combinations. A poetical writer with surprising twists and counterintuitive mixtures of words has a temperature closer to 1.0.

I can see a few clichés in your writing, also I did a search on some fragments of sentences which showed a number of results.

If you want to be less “robotic” then you could add whimsical or poetic wording, and less common turns of words (be a phrase rotator).

catchnear4321 · on Feb 19, 2023

How has your “success” been with chatgpt? Qualitatively, generally, positive/negative.

Being able to speak to machines would likely correlate with “sounding like one.”

There could be a different (mis)categorization here but also por que no los dos.

LoveMortuus · on Feb 19, 2023

I mean, technically speaking you are a man made intelligence, thus it would be fair to say that you are artificial intelligence. ^^

panarky · on Feb 19, 2023

As millions of people interact with ChatGPT, their writing will subtly, gradually, begin to mimic its style. As future versions of the model are trained on this new text, both human and AI styles will converge until any difference between the two are infinitesimal.

brookst · on Feb 19, 2023

One of the big complaints with LLMs is the confident hallucination of incorrect facts, like software APIs that don’t exist.

But the way I see it, if ChatGPT thinks the Python list object should have a .is_sorted() property, that’s a pretty good indication that maybe it should.

I work in PM (giant company, not Python), and one of these days my self-control will fail me and I will open a bug for “product does not support full API as specified by ChatGPT”.

matthewdgreen · on Feb 19, 2023

ChatGPT also thinks certain coding algorithms should violate well-known information-theoretic bounds ;) I’ll put a ticket in with Claude Shannon.

brookst · on Feb 19, 2023

I’m sure he’ll be excited to learn about this gap!

howlin · on Feb 19, 2023

> LLMs is the confident hallucination of incorrect facts

This is a very common feature of delirium in people. Chatting with an LLM seems a lot like what it would be to talk to a clever person with encyclopedic knowledge, who is just waking up from anaesthesia or is sleep talking.

moffkalast · on Feb 19, 2023

Or just the average person on reddit's r/confidentlyincorrect.

dr_dshiv · on Feb 19, 2023

> it, if ChatGPT thinks the Python list object should have a .is_sorted() property, that’s a pretty good indication that maybe it should.

Yes! And when it hallucinates references for articles, often times those articles probably should exist…

visarga · on Feb 19, 2023

And if they don't exist, you ask the model to write them from title and link.

moffkalast · on Feb 19, 2023

The year is 2145.

When a new person is born their entire life is hallucinated in its entirety by the all great and powerful GPT. Deviation from His plan is met with swift and severe consequences.

xiphias2 · on Feb 19, 2023

Or rather you can just catch method missing in the runtime and patch it with a chatgpt call

brookst · on Feb 19, 2023

Love it. Get ChatGPT to write the missing method, execute it this once, then store it in a file, update the current source file with the include to cache it for next time.

int_19h · on Feb 19, 2023

I can't find it, but someone already did a Python module that plugs into GPT-3 and automatically generates functions on the fly as you call them - and then the same for methods on the returned values etc.

visarga · on Feb 19, 2023

> if ChatGPT thinks the Python list object should have a .is_sorted() property, that’s a pretty good indication that maybe it should.

Hahaha, Python language fixing itself!!!

joe_the_user · on Feb 19, 2023

An opposite possibility is that the commonness of ChatGPT will cause people to adopt a style as distinct from it as possible.

Of course, this might mean future Chatbots would successfully emulate that. But it's not impossible an "adversarial style" exists - this wouldn't be impossible to emulate but it might be more likely to cause the emulator to say things the reader can immediately tell are false.

One idea is to "flirt" with all things that people have come up with that AI chokes on. "Back when the golden gate bridge was carried across Egypt..."

andyjohnson0 · on Feb 19, 2023

Prediction #1: Once enough ChatGPT output gets posted online, it will inevitably find its way into the training corpus. When that happens, ChatGPT becomes stateful and develops episodic memory.

Prediction #2: As more people discuss ChatGPT online, by late 2023 discussion of Roko's Basilisk exceeds discussion of ChatGPT. (half /s)

ClumsyPilot · on Feb 19, 2023

Or. ChatGPT will overtrain on it's own data and go to shit the way google search did

visarga · on Feb 19, 2023

Training on its own data is a tradition already. For example RLHF example pairs rated by humans are generated by the model. So even our best models trained on their own outputs + rating from human labellers. The internet is a huge rating machine, AI will distill this signal and improve even while ingesting its own text.

andyjohnson0 · on Feb 19, 2023

Meta-ChatGPT's loss function optimises for ChatGPT generating training data that maximises the shittyness of Google's LLM.

LelouBil · on Feb 20, 2023

Did you see the new Bing chat ?

#1 is already happening !

See here (other HN thread) : https://twitter.com/tobyordoxford/status/1627414519784910849

zdragnar · on Feb 19, 2023

So long as ChatGPT is forbidden from communicating in certain ways (swearing, speaking ill or positive of controversial people or topics, etc), convergence will never happen. People interact with other people more than they do ChatGPT, so the majority force will remain dominant.

wankle · on Feb 19, 2023

AI will turn half the people into Eloi (https://en.wikipedia.org/wiki/Eloi) while the rest of us will become the Morlocks.

fullshark · on Feb 19, 2023

Sounds accurate and horrifying, I don't get the enthusiasm for this at all beyond a desire to be there first and make a ton of money. All manuscripts get a run through an AI editor, all business writing is even more soullessly devoid of purpose beyond accomplishing task X, all blogposts are finetuned for maximum engagement and therefore ad/referral revenue.

That's already happening I know but it will be amplified to the point that all humanity in writing in lost. All ideas in writing will be a copy of a copy of a copy and merely resemble something once meaningful. Time to go touch grass.

visarga · on Feb 19, 2023

Too much AI for you? You can fix your problem with even more AI! Get your own AI, running on your hardware, loyal only to you. It will act like a firewall between you, the vulnerable, hackable human, and the wild internet full of other AIs. To go out on the internet without an AI is like going for a walk during COVID without a mask, or browsing without AdBlock. Their AI will talk to your AI, and that's how you can be safe.

onos · on Feb 19, 2023

Interesting idea, but isn’t there variance in the output? Eg I’ve seen people ask it to “write in the style of x” etc and different people also clearly have different writing styles.

Filligree · on Feb 19, 2023

You can try, but it isn’t very good at that. The style remains very ChatGPT.

moralestapia · on Feb 19, 2023

+1

We train AIs but they also train us.

ominous · on Feb 19, 2023

Some related idea, in case you like to see that thought explore: https://medium.com/@freddavis/we-shape-our-tools-and-thereaf...

moralestapia · on Feb 20, 2023

Great read, thanks for sharing!

Maken · on Feb 19, 2023

Alternative title: Most academic papers are indistinguishable from AI generated babble.

sigmoid10 · on Feb 19, 2023

...to AI. It's kinda funny how this is yet another area where these models suck very much in the same way that most humans do. LLMs are bad at arithmetic? So are most people. Can't tell science from babble? I already wouldn't ask a non-expert to rate any aspect of an academic paper. Trusting the average Joe who has only completed some basic form of education would be tremendously stupid. Same with these models. Maybe we can get more out of it in specific areas with fine tuning, but we're very far away from a universal expert system.

brookst · on Feb 19, 2023

The best was the Ted Chiang article making numerous category errors and forest/trees mistakes in arguing that LLMs just store lossy copies of their training data. It was well-written, plausible, and so very incorrect.

supriyo-biswas · on Feb 19, 2023

Neural network based compression algorithms[1] are a thing, so I believe Ted Chiang's assessment is right. Memorization (albeit lossy) is also how the human brain works and develops reasoning[2].

[1] https://bellard.org/nncp/

[2] https://www.pearlleff.com/in-praise-of-memorization

brookst · on Feb 20, 2023

The fact that some neural network architectures can compress data does not mean that data compression is the only thing any neural network can do.

It’s like saying that GPUs can render games, so GPT is a game because it uses GPU.

dr_dshiv · on Feb 19, 2023

I felt the same way. But I’d love to read a specific critique. Have you seen one?

elefanten · on Feb 19, 2023

Here’s one from a researcher (which also links to another), though I’m not qualified to assess it’s content in depth.

https://twitter.com/raphaelmilliere/status/16240731504754319...

jejeyyy77 · on Feb 19, 2023

I mean, humans can't distinguish AI written text either - which is why this tool was built?

I don't see how it will be possible to build such a tool either as the combination of words that can come after another is finite.

Maken · on Feb 19, 2023

I do agree that the most likely reason is that scientific papers tend to be highly formulaic and follow strict structures, so a LLM is be able to generate something much more alike to human writing than if it tries to generate narrative.

But it's still fun to deduce that the reason is that the quality of technical writing has sunk so low, that is even below the standards for AI generated text.

jldugger · on Feb 19, 2023

>...to AI.

Perhaps we can call it the "Synthromorphic principle," the bias of AI agents to project AI traits onto conversants that are not in fact AI.

brookst · on Feb 19, 2023

I’ve got a fun little side project that uses GPT. I tested gptzero against 10 of my projects’ writings and 10 of my own. It detected 6 out of 10 correctly in both cases (4 gpt-written bits were declared human, 4 human-written were declared gpt).

Which is better than 50% but not nearly good enough to base any kind of decision on.

d0mine · on Feb 20, 2023

> which is better than 50%

Unrelated: p-value for getting 12 from 20 correct just by chance is ~0.4 that is there is not enough data for the conclusion "better" in this case.

Null hypothesis: 50%/50%, the result random, normal distribution:

  H0: p=1/2
  H1: p!=1/2 (two-tail) 


  import statistics
  
  p0 = 0.5  # proportion of successes according to null hypothesis
  n = 20   # sample size
  p_sample = 12/n  # 12 from 20 are correct
  
  sigma = (p0 * (1 - p0) / n)**.5  # std according to H0
  z_score = (p_sample - p0) / sigma  # test statistic
  p_value = 2*statistics.NormalDist().cdf(-abs(z_score))  # prob. two-tails
  # p-value -> 0.4

wongarsu · on Feb 19, 2023

This is testing on abstracts of medical papers. I wouldn't be surprised if the way abstracts are carefully written and reviewed by > 3 people is very different from normal text, and in some ways more similar to LLM output.

nobu-mori · on Feb 19, 2023

That's great! All we need to do is negate the output and it will be more accurate.

lwhi · on Feb 19, 2023

Perhaps AI generated text should be created with a specific signature in mind _specifically_ to be identifiable?

wongarsu · on Feb 19, 2023

There's a large body of research into invisible text watermarking, so this would certainly be possible. Maybe the simplest to implement in LLMs would be to bias the token generation slightly, for example by making tokens that include the letter i slightly more likely. In a long enough text you could then see the deviation from normal human text characteristics.

PeterisP · on Feb 20, 2023

All the main scenarios for detecting generating text that I can imagine do have to assume that the LLM isn't "cooperative" but actually is specifically designed (or fine-tuned) to avoid detection.

lwhi · on Feb 19, 2023

Yep, I think it's a good idea.

mattnewton · on Feb 19, 2023

isn’t this essentially asking anyone who runs a model to flip the evil bit[0]? People who want to misrepresent model output as human written output will trivially be able to beat this protection by removing the signature or using a version of the model that simply doesn’t add it.

[0] https://en.m.wikipedia.org/wiki/Evil_bit

lwhi · on Feb 19, 2023

I think you're correct, but it would promote the idea that what's produced is a basis or source for further work .. and would mean that effort is required.

Feels like a good basis tech for something like ChatGPT.

gtsnexp · on Feb 19, 2023

Detecting text generated by large language models like ChatGPT is a challenging task. One of the main difficulties is that the generated text can be highly variable and can cover a wide range of topics and styles. These models have learned to mimic human writing patterns and can produce text that is grammatically correct, semantically coherent, and even persuasive, making it difficult for humans to distinguish between the text generated by machines and the ones written by humans.

Another challenge is that large language models are highly complex and constantly evolving. GPT-3, for example, was trained on a massive dataset of text and can generate text in over 40 languages. With this level of complexity, it can be challenging to develop detection systems that can keep up with the ever-changing text generated by these models.

To implement a reliable detection system like GPTZero, which is designed to detect text generated by GPT-3, several challenges need to be addressed. First, the system needs to be highly accurate and efficient in identifying text generated by GPT-3. This requires a deep understanding of the underlying language model and the ability to analyze the text at a granular level.

Second, the system needs to be scalable to handle the vast amounts of data generated by GPT-3. The detection system should be able to analyze a large volume of text in real-time to identify any instances of generated text.

Finally, the system needs to be adaptable to the evolving nature of large language models. As these models continue to improve and evolve, the detection system needs to keep up and adapt to the changing landscape.

ericmcer · on Feb 19, 2023

It was weird how easy this was to identify if you have read any amount of ChatGPT content. It has a particular writing style that is pretty obvious.

I am not sure how you would code something to detect an author based on writing style. It feels like something people would have tried to do before. Probably using a similar approach that LLMs use but with a separate predictor for specific authors.

mds · on Feb 19, 2023

ChatGPT in particular writes in middle school essay format: introduction, point 1, point 2, point n, conclusion.

throwaway675309 · on Feb 20, 2023

It's because it reads exactly like every middle schooler writing their first MLA formatted five paragraph essay.

sdflhasjd · on Feb 19, 2023

Ironic, because to me this was so obviously ChatGPT to me from literally the first sentence.

I think this is probably because it doesn't match the conversational style of a forum discussion.

JoeAltmaier · on Feb 19, 2023

If it's like many of these 'AI' engines, it's a statistical map of text. I'd expect it to produce no better than what it's trained on, then muddy that by randomly combining different versions of similar statements.

I'm impressed it is ever accurate.

k__ · on Feb 19, 2023

Yes, it's really bad.

False positives and negatives all over the place.

I really wish it worked, but generally, it doesn't.

dsign · on Feb 19, 2023

I had some fun yesterday when ChatGPT hallucinated a bibliographic reference to an article that didn't exist. But the journal existed, and it had plenty of articles that made ChatGPT's hallucination plausible. I think that at least this use case can be fixed with some pragmatic engineering[^1].

[^1]: Which may take a bit to happen, because our current crop of AI researchers have all taken "The bitter lesson"[^2] to heart.

[^2]: http://www.incompleteideas.net/IncIdeas/BitterLesson.html

FPGAhacker · on Feb 19, 2023

Did chatgpt post the reference as a footnote (or parenthetical)?

At least for now,I was thinking it didn’t do that, and maybe the lack of references would be an indicator of unedited gpt output.

dsign · on Feb 19, 2023

It did:

...

> However, according to a study published in the Journal of Dairy Science, the diacetyl content of butter can range from approximately 0.5 to 14 parts per million (ppm) (https://doi.org/10.3168/jds.S0022-0302(03)73775-3).

The doi I could not find, so I'm pretty sure is bogus.

I asked it to produce a full reference:

> Sure, the full reference for the study I mentioned is:

> Yvon, M., Chambellon, E., & Bolotin, A. (2003). Effect of pH on diacetyl and acetoin production by Lactococcus lactis subsp. lactis biovar diacetylactis. Journal of dairy science, 86(12), 4068-4076.

I went to the index of the journal 86(12), and that article is not there.

int_19h · on Feb 19, 2023

It's okay, GPT just gets confused about which of the quantum many-worlds it is currently in. Just ask it to write the article as needed. ~

air7 · on Feb 20, 2023

I read an interesting paper about an idea of watermarking LLM output text in such a way that makes detection very accurate for a long enough text. This is done by subtly changing the probabilities of the next word to be generated based on the last word that was outputed. Circumventing it by manually changing words post hoc would potentially require almost as much work as writing it from scratch.

The idea seems quite roboust to me and I can envisage a future where companies that provide access to LLMs would also publish a detection tool for their models.

tluyben2 · on Feb 20, 2023

It's not hard to make a model that rewrites the text without changing the meaning which fails this. Our model[0] which is based on feeding chatgpt random things from the interwebs from before 2020 and letting it wobble on about it is pretty good and nice to play with, but it's pretty easy to change the score radically with just changing a few words. This is whack-a-mole no matter how it's done. For now you can be sure people are too lazy to do it, but there will be many tools in the future to evade tools like this.

[0] https://filteroutai.com/validate/a07e081b71b294ba2de236441be... https://filteroutai.com/validate/2c3fa6de32845df02be7a4ff185...

bentcorner · on Feb 19, 2023

Random thought: In the future the simplest method to reduce your chance of GPT usage being detected is to pretend your paper was written in 2022 or earlier.

mt_ · on Feb 19, 2023

Isn't the GPTZero based on model detector by OpenAI for GPT-2? In the initial preview it was exaclty like the demo, a text box styled diferently.

[1] - https://github.com/openai/gpt-2-output-dataset/blob/master/d...

janalsncm · on Feb 19, 2023

GPTZero cannot work because they don’t have all of the logits used to judge perplexity. They only have the top five or so. Even OpenAI which has the full set of logits is not able to correctly classify all texts. It’s a futile effort.

wankle · on Feb 19, 2023

"Tell me more about your perspective customers."

https://projects.csail.mit.edu/films/aifilms/AIFilms.html

jaimex2 · on Feb 20, 2023

Who didnt see this coming when GPTZero was announced by clueless media?

I didn't even have to look at the internals to know GPTZero and OpenAI's solutions would be pointless. I'm still very surprised OpenAI made snake oil.

icapybara · on Feb 19, 2023

Right, I mean, how could it know?

I could write in the tone of ChatGPT if I tried hard enough. It's an intractable problem and a tool like this probably does more harm than good.

explaininjs · on Feb 19, 2023

Non-editorialized title: GPTZero Case Study (Exploring False Positives)

@dang

netfortius · on Feb 19, 2023

FoxNews is accurate 10% of the time, at best, and it's allegedly produced by humans, so I'm not really seeing the problem here...

ClumsyPilot · on Feb 19, 2023

you do it all wrong, it's 90% accurate and 90% sucessfull - its goal to be as inaccurate as possible without its audience realising.

If fox news deletws the last 10% of reality from it's broadcasting, the people might start catching on. Although these days i am not sure

jmfldn · on Feb 19, 2023

It's important to shine a light on the limitations of AI detection software, and this case study on GPTZero does just that. False positives can have serious consequences, particularly in sensitive areas such as healthcare.

MonkeyMalarky · on Feb 19, 2023

Thank you for your heroic effort in copying and pasting a chat log, I am left in awe by it.

...nice edit...

jmfldn · on Feb 19, 2023

Lighten up mate.