I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.
I don't know how people use one service can manage...
I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.
Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?
Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.
One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.
I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.
> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant
This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.
For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.
This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )
Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.
There's actually a bit of crossover. This paper is quite old, but I know that other teams and other surgeons have visited each other plenty since this was published, and there's this cross fertilisation of ideas.
F1 is "important" in the sense that it is competitive, and so teams want to iterate and improve constantly. I think the fastest pit stop in the 2025 season 1.91 seconds, in which: the car is jacked, four tyres are removed, four new tyres are placed and secured into place, the car is dropped, the lane is checked for traffic, and then the car can move. There are thousands of permutations of how to get this right and that fast. And accuracy is important: get it wrong there is a risk of injury at worst, or a fine for an unsafe release at best.
ICU is obviously important in a different way. You can't really "experiment". Iteration needs data. So you need to go out and learn what good looks like from different disciplines, and then carefully plan the changes you want to make and get buy-in. Get it wrong, and people die. Best case scenario you're struck off, worst case you're going to prison for murder.
In dev speak, F1 can afford to be agile, ICUs need to be waterfall.
But because F1 needs to be precise and they perceive the dangers of imprecision so acutely from a monetary perspective (where you finish in the Worldwide Constructors Championship directly affects the profitability and viability of the team), they want to borrow ideas too.
It sounds ridiculous that surgeons and F1 garages would have so much to talk about, but it turns out, they really do feed ideas off each other sometimes.
> ICU is obviously important in a different way. You can't really "experiment". Iteration needs data. So you need to go out and learn what good looks like from different disciplines, and then carefully plan the changes you want to make and get buy-in.
I would think test runs with simulated patients offer plenty of opportunity to experiment.
> Get it wrong, and people die. Best case scenario you're struck off, worst case you're going to prison for murder.
Get it right and people may still die. The whole reason for the improvement effort is that the current practice is excessively risky. No one is getting fired, nonetheless going to prison for trying a sensible improvement to reduce the odds of a child dying which they were approved to attempt.
I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:
* Research and planning
* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)
* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding
I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.
As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)
For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.
I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.
I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.
I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.
I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.
Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)
reply