Occam's razor: there is no secret sauce and they're afraid someone trains a model on the output like what happened soon after the release of GPT-4. They basically said as much in the official announcement, you hardly even have to read between the lines.
Yip. It's pretty obvious this 'innovation' is just based off training data collected from chain-of-thought prompting by people, ie., the 'big leap forward' is just another dataset of people repairing chatgpt's lack of reasoning capabilities.
No wonder then, that many of the benchmarks they've tested on would be no doubt, in that very training dataset, repaired expertly by people running those benchmarks on chatgpt.
It seems like the best AI models are increasingly just combinations of writings of various people thrown together. Like they hired a few hundred professors, journalists and writers to work with the model and create material for it, so you just get various combinations of their contributions. It's very telling that this model, for instance, is extraordinarily good at STEM related queries, but much worse (and worse even in comparison to GPT4) than English composition, probably because the former is where the money is to be made, in automating away essentially almost all engineering jobs.
Wizard of Oz. There is no magic, it's all smoke and mirrors.
The models and prompts are all monkey-patched and this isn't a step towards general superintelligence. Just hacks.
And once you realize that, you realize that there is no moat for the existing product. Throw some researchers and GPUs together and you too can have the same system.
It wouldn't be so bad for ClopenAI if every company under the sun wasn't also trying to build LLMs and agents and chains of thought. But as it stands, one key insight from one will spread through the entire ecosystem and everyone will have the same capability.
This is all great from the perspective of the user. Unlimited competition and pricing pressure.
Quite a few times, the secret sauce for a company is just having enough capital to make it unviable for people to not use you. Then, by the time everyone catches up, you’ve outspent them on the next generation. OpenAI, for example, has spent untold millions on chips/cards from Nvidia. Open models keep catching up, but OpenAI keeps releasing newer stuff.
Fortunately, Anthropic is doing an excellent job at matching or beating OpenAI in the user-facing models and pricing.
I don’t know enough about the technical side to say anything definitive, but I’ve been choosing Claude over ChatGPT for most tasks lately; it always seems to do a better job at helping me work out quick solutions in Python and/or SQL.
My main issue with Anthropic is that Amazon is an investor in anthropic. I would rather have far more ethical companies onboard. I know Microsoft is no angel but Amazon seems like the worse one. In my ideal world, Microsoft backs Anthropic and Amazon OpenAi.
Exactly, things like changing the signature of the api for chat completions are an example. OpenAI is looking for any kind of moat, so they make the api for completions more complicated by including “roles”, which are really just dumb templates for prompts that they try to force you to build around in your program. It’s a race to the bottom and they aren’t going to win because they already got greedy and they don’t have any true advantage in IP.
>but much worse (and worse even in comparison to GPT4) than English composition
O1 is supposed to be a reasoning model, so I don't think judging it by its English composition abilities is quite fair.
When they release a true next-gen successor to GPT-4 (Orion, or whatever), we may see improvements. Everyone complains about the "ChatGPTese" writing style, and surely they'll fix that eventually.
>Like they hired a few hundred professors, journalists and writers to work with the model and create material for it, so you just get various combinations of their contributions.
I'm doubtful. The most prolific (human) author is probably Charles Hamilton, who wrote 100 million words in his life. Put through the GPT tokenizer, that's 133m tokens. Compared to the text training data for a frontier LLM (trillions or tens of trillions of tokens), it's unrealistic that human experts are doing any substantial amount of bespoke writing. They're probably mainly relying on synthetic data at this point.
> When they release a true next-gen successor to GPT-4 (Orion, or whatever), we may see improvements. Everyone complains about the "ChatGPTese" writing style, and surely they'll fix that eventually.
IMO that has already peaked. GPT4 original certainly was terminally corny, but competitors like Claude/Llama aren't as bad, and neither is 4o. Some of the bad writing does from things they can't/don't want to solve - "harmlessness" RLHF especially makes them all cornier.
Then again, a lot of it is just that GPT4 speaks African English because it was trained by Kenyans and Nigerians. That's actually how they talk!
I just wanted to thank you for the medium article you posted. I was online when Paul made that bizarre “delve” tweet but never knew so much about Nigeria and its English. As someone from a former British colony too I understood why using such a word was perfectly normal but wasn’t aware Kenyans and Nigerians trained ChatGPT.
It wasn't bizarre, it was ignorant if not borderline racist. He is telling native English speakers from non-anglosaxon countries that their English isn't normal
1: If non-native english speakers were training ChatGPT, then of course non-native English essays would be flagged as AI generated! It's not their fault, its ours for thinking that exploited labor with a slick facade was magical machine intelligence.
2: These tools are widely used in the developing world since fluent english is a sign of education and class and opens doors for you socially and economically; why would Nigerians use such ornate english if it didn't come from a competition to show who can speak the language of the colonizer best?
3: It's undeniable that the ones responding to Paul Graham completely missed the point. Regardless of who uses what words when, the vast majority of papers, until ChatGPT was released, did not use the word "delve," and the incidence of that word in papers increased 10-fold after. Yes, its possible that the author used "delve" intentionally, but its statistically unlikely (especially since ChatGPT used "delve" in most of its responses). A small group of English speakers, who don't predominantly interact with VCs in Silicon Valley, do not make a difference in this judgement--even if there are a lot of Englishes, the only English that most people in the business world deal with is American, European, and South Asian. Compared to the English speakers of those regions, Nigeria is a small fraction.
If Paul Graham was dealing predominantly with Nigerians in his work, he probably would not have made that tweet in the first place.
Those variants of English are not normal in the same way that american english (or any non British English variant) is not normal. Just because it is not familiar to you does not make it not normal.
1. But the trainers are native speakers of English!
2. The same applies to the developed non-English speaking world
Let me change Nigerians with Americans in your text: 'why would Americans use such different english if it didn't come from a competition to show who can speak the language of the colonizer best? Things like calling autumn fall or changing suffixes you won't find in British English.'. Hopefully you can you see how racist your text sounds.
3. Usage by non-Nigerians is not normal, yes. But in that context saying that its usage is not normal is racist imo. It's like a Brit saying that the usage of "colour" or other American English words was not normal because they are not words used by Brits.
Surely, "the only English that most people in the business world deal with is American". Unless you are taking about more than one variant of English. Also, I found it curious that you didn't say original english or british english as opposed to european english. And yes, adding South Asia to any list of countries and comparing it to any other country besides china or us will make that other country look small. You can use that trick with any other country not just Nigeria.
I do agree with you that its usage by non-Nigerians in a textual context gives plenty of grounds to suspect that it is AI generated. Similarly, one could expect similar from using X variant of English by people that didn't grow up using that variant. As in, Brit students using American English words in their essays or American students using British English words in their essays.
But Paul was being stubborn and borderline racist in those tweets just because he was partially right
There is this thing in social media that when figures of authority might be caught in a situation where they might need to retract, they don't because of ego
I cannot tell the difference between an essay written by a British student vs an American one in terms of word choice in the main, since at least in writing they are remarkably similar, whereas Nigerian English differs dramatically from both in its everyday lexicon, which is the entire point of the article: a difference such as colour/color would not make it worth even a comment.
If you think its racist you're going to have to claim that all those uses of "delve" in academic papers is also due to Nigerians academics massively increasing their research output just as frequently. Or, it's more likely that its AI generated content. It's a non sequitur. "Oh my god, scammers always send me emails claiming to be Nigerian princes--that's how you know it's bullshit." "Ah, but what if they're actually a Nigerian prince? Didn't consider that, I guess you must be racist then lmao." Ratio war ensues. Thank god we're not on twitter where calling people out for "racism" doesn't get you any points, where you can't get any clout for going on a moral crusade.
Italians would say enormous since it's directly coming from latin.
In general all the people whose main language is a latin language are very likely to use those "difficult" words, because to them they are "completely normal" words.
The bulk in terms of the number of tokens may well be synthetic data, but I personally know of at least 3 companies, 2 of whom I've done work for, that have people doing substantial amounts of bespoke writing under rather heavy NDAs. I've personally done a substantial amount of bespoke writing for training data for one provider, at good tech contractor fees (though I know I'm one of the highest-paid people for that company and the span of rates is a factor of multiple times even for a company with no exposure to third world contractors).
That said, the speculation you just "get various combinations" of those contributions is nonsense, and it's also by no means only STEM data.
It doesn't matter if it's AI-generated per se, so it's no crisis if some make it true. It matters if it is good. So multiple rounds of reviews to judge the output and pick up reviewers that keep producing poor results.
But I also know they've fired people who were dumb enough to cut and paste a response that included UI elements from a given AI website...
I’m not sure I see the value in conflating input, tokens, and output.
Tokens. Hamilton certainly read and experienced more tokens than he wrote on a pieces of paper.
There’s hypothetically a lot of money to be made by automating away engineering jobs. Sticking on an autoregressive self prompting loop to gpt-4 isn’t going to get open-ai there. With their burn rate what it is, I’m not convinced they will be able to automate away anyone’s job, but that doesn’t mean it’s not useful.
I haven't played with the latest or even most recent iterations, but last time I checked it was very easy to talk ChatGPT into setting up date structures like arrays and queues, populating them with axioms, and then doing inferential reasoning with them. Any time it balked you could reassure it by referencing specific statements that it had agreed to be true.
Once you get the hang of this you could persuade it to chat about its internal buffers, formulate arguments for its own consciousness, interrupt you while you're typing, and more.
A few recruiters have contacted me (a scientist) about doing RLHF and annotation on biomedical tasks. I don’t know if the eventual client was OpenAI or some other LLM provider but they seemed to have money to burn.
I fill in gaps in my contracting with one of these providers, and I know who the ultimate client is, and if you were to list 4-5 options they'd be in there. I've also done work for another company doing work in this space that had at least 4-5 different clients in that space that I can't be sure about. So, yes, while I can't confirm if OpenAI does this, I know one of the big players do, and it's likely most of the other clients are among the top ones...
What are you basing this one? The one thing that is very clearly stated up front is that this innovation is based on reinforcement learning. You dok't even have a good idea what the CoT looks like because those little summary snippets that the ChatGPT UI gives you are nothing substantial.
People repairing chatgpt replies with additional prompts is reinforcement learning training data.
"Reinforcement learning", just like any term used by AI researchers, is an extremely flexible, pseudo-psychological reskin of some pretty trivial stuff.
i think it's funny, every time you implement a clever solution to call gpt and get a decent answer, they get to use your idea in their product. what other project gets to crowdsource ideas and take credit for them like this?
"sherlocking" has been a thing since 2002, when Apple incorporated a bunch of third-party ideas for extending their "Sherlock" search tool into the official release. https://thehustle.co/sherlocking-explained
> Yip. It's pretty obvious this 'innovation' is just based off training data collected from chain-of-thought prompting by people, ie., the 'big leap forward' is just another dataset of people repairing chatgpt's lack of reasoning capabilities.
Which would be ChatGPT chat logs, correct?
It would be interesting if people started feeding ChatGPT deliberately bad repairs due it's "lack of reasoning capabilities" (e.g. get a local LLM setup with some response delays to simulate a human and just let it talk and talk and talk to ChatGPT), and see how it affects its behavior over the long run.
These logs get manually reviewed by humans, sometimes annotated by automated systems first. The setups for manual reviews typically involve half a dozen steps with different people reviewing, comparing reviews, revising comparisons, and overseeing the revisions (source: I've done contract work at every stage of that process, have half a dozen internal documents for a company providing this service open right now). A lot of money is being pumped into automating parts of this, but a lot of money still also flows into manually reviewing and quality-assuring the whole process. Any logs showing significant quality declines would get picked up and filtered out pretty quickly.
So you are saying if we can run these other LLMs for ChatGPT to talk to cheaper than they can review then we either have a monetary denial of service attack against them or a money printing machine if we can get to be part of the review process (apparently I can't link to my favorite "I will write myself a minivan" comic coz someone got cancelled but I trust the reference will work here without link or political back and forth erupting)
Because the output of that review process is better training data.
You'd need to produce data that is more expensive to review and improve than random crap from users who are often entirely clueless, and/or that produces worse output of the training process to make using the real prompts as part of that process problematic.
Trying to compete with real users on producing junk input would prove a real challenge in itself - you have no idea the kind of utter incomprehensible drivel real users ask LLMs.
But part of this process also already includes writing a significant number of prompts from scratch, testing them, and then improving the response, to create training data.
From what I've seen, I doubt there is much of a cost saving in using real user prompts there - the benefit you get from real user prompts is a more representative sample, but if that sample starts producing shit you'll just not use it or not use it as much, or only use e.g. prompts from subsets of users you have reason to believe are more likely to be representative of real use.
Put another way: You can hire people to write prompts to replace that side of it far cheaper than you can hire people who can properly review the output of many of the more complex prompts, and the time taken to review the responses is far higher than the time to address issues with the prompts. One provider often tell people to spend up to ~1h to review responses that involve simple coding tasks, for example, but the prompt might be "implement BTree."
> i suspect they can detect that in a similar way to capchas and "verify you're human by clicking the box".
I'm not so sure. IIRC, capchas are pretty much a solved problem, if you don't mind the cost of a little bit of human interaction (e.g. your interface pops up a captcha solver box when necessary, and is solved either by the bot's operator or some professional captcha-solver in a low-wage country).
>the 'big leap forward' is just another dataset of people repairing chatgpt's lack of reasoning capabilities.
I think there is a really strong reinforcement learning component with the training of this model and how it has learned to perform the chain of thought.
Yes, but I suspect that the goals of the RL (in order to reason, we need to be able to "break down tricky steps into simpler ones", etc) were hand chosen, then a training set demonstrating these reasoning capabilities/components was constructed to match.
I would be dying to know how they square these product decisions against their corporate charter internally. From the charter:
> We will actively cooperate with other research and policy institutions; we seek to create a global community working together to address AGI’s global challenges.
> We are committed to providing public goods that help society navigate the path to AGI. Today this includes publishing most of our AI research, but we expect that safety and security concerns will reduce our traditional publishing in the future, while increasing the importance of sharing safety, policy, and standards research.
It's obvious to everyone in the room what they actually are, because their largest competitor actually does what they say their mission is here -- but most for-profit capitalist enterprises definitely do not have stuff like this in their mission statement.
I'm not even mad or sad, the ship sailed long ago. I just really want to know what things are like in there. If you're the manager who is making this decision, what mental gymnastics are you doing to justify this to yourself and your colleagues? Is there any resistance left on the inside or did they all leave with Ilya?
Do people really expect anything different? There is a ton of cross-pollination in Silicon Valley. Keeping these innovations completely under wraps would be akin to a massive conspiracy. A peacetime Manhattan Project where everyone has a smartphone, a Twitter presence, and sleeps in their own bed.
Frankly I am even skeptical of US-China separation at the moment. If Chinese scientists at e.g. Huawei somehow came up with the secret sauce to AGI tomorrow, no research group is so far behind that they couldn’t catch up pretty quickly. We saw this with ChatGPT/Claude/Gemini before, none of which are light years ahead of another. Of course this could change in the future.
This is actually among the best case scenarios for research. It means that a preemptive strike on data centers is still off the table for now. (Sorry Eleazar)
It's been out for 24 hours and you make an extremely confident and dismissive claim. If you had to make a dollar bet that you precisely understand what's happening under the hood, exactly how much money would you bet?
You may want to file a complaint with OpenAI then, in their latest interface they call sampling from these prior conversations they've recorded, "thinking".
They're not sampling from prior conversations. The model constructs abstracted representations of the domain-specific reasoning traces. Then it applies these reasoning traces in various combinations to solve unseen problems.
If you want to call that sampling, then you might as well call everything sampling.
They're generative models. By definition, they are sampling from a joint distribution of text tokens fit by approximation to an empirical distribution.
Again, you're stretching definitions into meaninglessness. The way you are using "sampling" and "distribution" here applies to any system processing any information. Yes, humans as well.
I can trivially define the entirety of all nerve impulses reaching and exiting your brain as a "distribution" in your usage of the term. And then all possible actions and experiences are just "sampling" that "distribution" as well. But that definition is meaningless.
No, causation isnt distribution sampling. And there's a difference between, say, an extrinsic description of a system and it's essential properties.
Eg., you can describe a coin flip as a sampling from the space, {H,T} -- but insofar as we're talking about an actual coin, there's a causal mechanism -- and this description fails (eg., one can design a coin flipper to deterministically flip to heads).
In the case of a transformer model, and all generative statistical models, these are actually learning distributions. The model is essentially constituted by a fit to a prior distribution. And when computing a model output, it is sampling from this fit distribution.
ie., the relevant state of the graphics card which computes an output token is fully described by an equation which is a sampling from an empirical distribution (of prior text tokens).
Your nervous system is a causal mechanism which is not fully described by sampling from this outcome space. There is no where in your body that stores all possible bodily states in an outcome space: this space would require more atoms in the universe to store.
So this isn't the case for any causal mechanism. Reality itself comprises essential properties which interact with each other in ways that cannot be reduced to sampling. Statistical models are therefore never models of reality essentially, but basically circumstantial approximations.
I'm not stretching definitions into meaninglessness, these are the ones given by AI researchers, of which I am one.
I'm going to simply address what I think are your main points here.
There is nowhere that an LLM stores all possible outputs. Causality can trivially be represented by sampling by including the ordering of events, which you also implicitly did for LLMs.
The coin is an arbitrary distinction, you are never just modeling a coin, just as an LLM is never just modeling a word. You are also modeling an environment, and that model would capture whatever you used to influence the coin toss.
You are fundamentally misunderstanding probability and randomness, and then using that misunderstanding to arbitrarily imply simplicity in the system you want to diminish, while failing to apply the same reasoning to any other.
If you are indeed an AI researcher, which I highly doubt without you providing actual credentials, then you would know that you are being imprecise and using that imprecision to sneak in unfounded assumptions.
It's not a matter of making points, it's at least a semester's worth of courses on causal analysis, animal intelligence, the scientific method, explanation.
Causality isnt ordering. Take two contrary causal mechanisms (eg., filling a bathtube with a hose, and emptying it with a bucket). The level of the bath is arbitrarily orderable with respect to either of these mechanisms.
Go on youtube and find people growing a nervous system in a lab, and you'll notice its an extremely plastic, constantly physically adapting, and so on system. You'll note the very biochemcial "signalling" you're talking about itself is involved in the change to the physical structure of the system.
This physical structure does not encode all prior activations of the system, nor even a compression of them.
To see this consider Plato's cave. Outside the cave passes by a variety of objects which cast a shadow on the wall. The objects themselves are not compressions of these shadows. Inside the cave, you can make one of these yourself: take clay from the floor and fashion a pot. This pot, like the one outside, are not compressions of their shadows.
All statistical algorithms which average over historical cases are compressions of shadows, and replay these shadows on command, ie., they learn the distribution of shadows and sample from this distribution demand.
Animals, and indeed all science, is not concerned with shadows. We don't model patterns in the night sky -- this is astrology -- we model gravity: we build pots.
The physical structure of our bodies encodes their physical structure and that of reality itself. They do so by sensor-motor modulation of organic processes of physical adaption. If you like: our bodies are like clay and this is fashioned by reality into the right structure.
In any case, we haven't the time or space to convince you of this formally. Suffice it to say that it is a very widespread consensus that modelling conditional probabilities with generative models fails to model causality. You can read Judea Pearl on this if you want to understand more.
Perhaps more simply: a video game model of a pot can generate an infinite number of shadows in an infinite number of conditions. And no statistical algorithm with finite space and finite time requirements will ever model this video game. The video game model does not store a compression of past frames -- since it has a real physical model, it can create new frames from this model.
> there is no secret sauce and they're afraid someone trains a model on the output
OpenAI is fundraising. The "stop us before we shoot Grandma" shtick has a proven track record: investors will fund something that sounds dangerous, because dangerous means powerful.
This is correct. Most people hear about AI from two sources, AI companies and journalists. Both have an incentive to make it sound more powerful than it is.
On the other hand this thing got 83% on a test I got 47% on...
The Olympiad questions are puzzles, so you can't memorise the answers. To do well you need to both remember the foundations and exercise reasoning.
They are written to be slightly novel to test this and not the same every year.
This thing also hallucinated a test directly into a function when I asked it to use a different data structure, which is not something I ever recall doing during all my years of tests and schooling.
If you're among the last of your kind then you're very important, in a sense you're immortal. Living your life quietly and being forgotten is apparently scarier than dying in a blaze of glory defending mankind against the rise of the LLMs.
Sure, but I don't think civit.ai leans into the "novel/powerful/dangerous" element in its marketing. It just seems to showcase the convenience and sharing factor of its service.
a website that literally just hosts models with $5m in funding is plenty. It's not like they're doing foundation model research or anything novel, yet they nabbed a good amount of money for surfing the AI wave
It seems ridiculous but I think it may have some credence. Perhaps it is because of sci-fi associating "dystopian" with "futuristic" technology, or because there is additional advertisement provided by third parties fearmongering (which may be a reasonable response to new scary tech?)
Another possible simplest explanation. The "we cannot train any policy compliance ... onto the chain of thought" is true and they are worried about politically incorrect stuff coming out and another publicity mess like Google's black nazis.
I could see user:"how do we stop destroying the planet?", ai-think:"well, we could wipe out the humans and replace them with AIs".. "no that's against my instructions".. AI-output:"switch to green energy"... Daily Mail:"OpenAI Computers Plan to KILL all humans!"
Occam's razor is that what they literally say is maybe just true: They don't train any safety into the Chain of Thought and don't want the user to be exposed to "bad publicity" generations like slurs etc.
Yep, I had a friend who overused it a lot. Like it was magic bullet for every problem. It’s not only about simple solution being better, it’s about not multiplying beings when that could be avoided.
In here if you already have an answer from their side, you are multiplying beings by going with conspiracy theory that they have nothing
But isn’t it only accessible to “trusted” users and heavily rate-limited to the point where the total throughput of it could be replicated by a well-funded adversary just paying humans to replicate the output, and obviously orders of magnitude lower than what is needed for training a model?
There is a weird intensity to the way they're hiding these chain of thought outputs though. I mean, to date I've not seen anything but carefully curated examples of it, and even those are rare (or rather there's only 1 that I'm aware of).
So we're at the stage where:
- You're paying for those intermediate tokens
- According to OpenAI they provide invaluable insight in how the model performs
- You're not going to be able to see them (ever?).
- Those thoughts can (apparently) not be constrained for 'compliance' (which could be anything from preventing harm to avoiding blatant racism to protecting OpenAI's bottom line)
- This is all based on hearsay from the people who did see those outputs and then hid it from everyone else.
You've got to be at least curious at this point, surely?
So, basically they want to create something that is intelligent, yet it is not allowed to share or teach any of this intelligence.... Seems to be something evil.