More

scrollop · 2025-12-17T20:04:21 1766001861

Yes! Gemini 3 pro is significantly slower than opus (surprisingly) , and prefer opus' output.

Might be using flash for my MCP research/transcriber/minor tasks modl over haiku, now, though (will test of course)

scrollop · 2025-12-17T20:03:36 1766001816

Says the CEO of MySpace.

scrollop · 2025-12-17T20:03:05 1766001785

Just do it.

I use a service where I have access to all SOTA models and many open sourced models, so I change models within chats, using MCPs eg start a chat with opus making a search with perplexity and grok deepsearch MCPs and google search, next query is with gpt 5 thinking Xhigh, next one with gemini 3 pro, all in the same conversation. It's fantastic! I can't imagine what it would be like again to be locked into using one (or two) companies. I have nothing to do with the guys who run it (the hosts from the podcast This day in AI, though if you're interested have a look in the simtheory.ai discord.

I don't know how people use one service can manage...

dandiep · 2025-12-17T20:06:48 1766002008

99% of what I do is fine-tuned models, so there is a certain level of commitment I have to make around training and time to switch.

scrollop · 2025-12-17T20:01:16 1766001676

Looking at this they are:

https://artificialanalysis.ai/evaluations/omniscience

https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582

scrollop · 2025-12-17T20:00:08 1766001608

Also

https://artificialanalysis.ai/evaluations/omniscience

Prepare to be amazed

albumen · 2025-12-17T21:57:44 1766008664

I’m amazed by how much Gemini 3 flash hallucinates; it performs poorly in that metric (along with lots of other models). In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant; GPT-5.1 (high), opus 4.5 and 4.5 haiku are.

Can someone explain how Gemini 3 pro/flash then do so well then in the overall Omniscience: Knowledge and Hallucination Benchmark?

wasabi991011 · 2025-12-18T02:42:50 1766025770

Hallucination rate is hallucination/(hallucination+partial+ignored), while omniscience is correct-hallucination.

One hypothesis is that gemini 3 flash refuses to answer when unsuure less often than other models, but when sure is also more likely to be correct. This is consistent with it having the best accuracy score.

Wyverald · 2025-12-18T00:29:16 1766017756

I'm a total noob here, but just pointing out that Omniscience Index is roughly "Accuracy - Hallucination Rate". So it simply means that their Accuracy was very high.

> In the Hallucination Rate vs. AA-Omniscience Index chart, it’s not in the most desirable quadrant

This doesn't mean much. As long as Gemini 3 has a high hallucination rate (higher than at least 50% others), it's not going to be in the most desirable quadrant by definition.

For example, let's say a model answers 99 out of 100 questions correctly. The 1 wrong answer it produces is a hallucination (i.e. confidently wrong). This amazing model would have a 100% hallucination rate as defined here, and thus not be in the most desirable quadrant. But it should still have a very high Omniscience Index.

scrollop · 2025-12-17T19:59:18 1766001558

This one is more powerful than openai models, including gpt 5.2 (which is worse on various benchmarks than 5.1 which is worse than 5.1, and that's where 5.2 was using XHIGH, whiulst the others were on high eg: https://youtu.be/4p73Uu_jZ10?si=x1gZopegCacznUDA&t=582 )

https://epoch.ai/benchmarks/simplebench

scrollop · 2025-12-17T19:58:07 1766001487

Alright so we have more benchmarks including hallucinations and flash doesn't do well with that, though generally it beats gemini 3 pro and GPT 5.1 thinking and gpt 5.2 thinking xhigh (but then, sonnet, grok, opus, gemini and 5.1 beat 5.2 xhigh) - everything. Crazy.

https://artificialanalysis.ai/evaluations/omniscience

tallclair · 2025-12-17T22:57:36 1766012256

On your Omniscience-Index vs. Cost graph, I think your Gemini 3 pro & flash models might be swapped.

scrollop · 2025-12-13T08:31:28 1765614688

Interesting.

Another take-

"Two F1 fan surgeons found a way to visit Ferrari headquarters as a business trip."

PaulRobinson · 2025-12-13T11:28:02 1765625282

There's actually a bit of crossover. This paper is quite old, but I know that other teams and other surgeons have visited each other plenty since this was published, and there's this cross fertilisation of ideas.

F1 is "important" in the sense that it is competitive, and so teams want to iterate and improve constantly. I think the fastest pit stop in the 2025 season 1.91 seconds, in which: the car is jacked, four tyres are removed, four new tyres are placed and secured into place, the car is dropped, the lane is checked for traffic, and then the car can move. There are thousands of permutations of how to get this right and that fast. And accuracy is important: get it wrong there is a risk of injury at worst, or a fine for an unsafe release at best.

ICU is obviously important in a different way. You can't really "experiment". Iteration needs data. So you need to go out and learn what good looks like from different disciplines, and then carefully plan the changes you want to make and get buy-in. Get it wrong, and people die. Best case scenario you're struck off, worst case you're going to prison for murder.

In dev speak, F1 can afford to be agile, ICUs need to be waterfall.

But because F1 needs to be precise and they perceive the dangers of imprecision so acutely from a monetary perspective (where you finish in the Worldwide Constructors Championship directly affects the profitability and viability of the team), they want to borrow ideas too.

It sounds ridiculous that surgeons and F1 garages would have so much to talk about, but it turns out, they really do feed ideas off each other sometimes.

jjk166 · 2025-12-14T02:46:07 1765680367

> ICU is obviously important in a different way. You can't really "experiment". Iteration needs data. So you need to go out and learn what good looks like from different disciplines, and then carefully plan the changes you want to make and get buy-in.

I would think test runs with simulated patients offer plenty of opportunity to experiment.

> Get it wrong, and people die. Best case scenario you're struck off, worst case you're going to prison for murder.

Get it right and people may still die. The whole reason for the improvement effort is that the current practice is excessively risky. No one is getting fired, nonetheless going to prison for trying a sensible improvement to reduce the odds of a child dying which they were approved to attempt.

scrollop · 2025-12-13T07:23:04 1765610584

And the Xhigh version is only available via API, not chatgpt.

noname120 · 2025-12-13T08:14:50 1765613690

Are you sure the that “extended thinking” option from the ChatGPT web client is something different?

seunosewa · 2025-12-13T21:41:09 1765662069

Probably high.

scrollop · 2025-12-11T21:47:14 1765489634

Why no grok 4.1 reasoning?

sanex · 2025-12-12T00:43:34 1765500214

Do people other than Elon fans use grok? Honest question. I've never tried it.

buu700 · 2025-12-12T06:23:27 1765520607

I use Grok pretty heavily, and Elon doesn't factor into it any more than Sam and Sundar do when I use GPT and Gemini. A few use cases where it really shines:

* Research and planning

* Writing complex isolated modules, particularly when the task depends on using a third-party API correctly (or even choosing an API/library at its own discretion)

* Reasoning through complicated logic, particularly in cases that benefit from its eagerness to throw a ton of inference at problems where other LLMs might give a shallower or less accurate answer without more prodding

I'll often fire off an off-the-cuff message from my phone to have Grok research some obscure topic that involves finding very specific data and crunching a bunch of numbers, or write a script for some random thing that I would previously never have bothered to spend time automating, and it'll churn for ~5 minutes on reasoning before giving me exactly what I wanted with few or no mistakes.

As far as development, I personally get a lot of mileage out of collaborating with Grok and Gemini on planning/architecture/specs and coding with GPT. (I've stopped using Claude since GPT seems interchangeable at lower cost.)

For reference, I'm only referring to the Grok chatbot right now. I've never actually tried Grok through agentic coding tooling.

mac-attack · 2025-12-12T04:45:36 1765514736

I can't understand why people would trust a CEO that regularly lies about product timelines, product features, his own personal life, etc. And that's before politicizing his entire kingdom by literally becoming a part of government and one of the larger donations of the current administration.

delaminator · 2025-12-12T09:17:01 1765531021

You’re not narrowing it down.

lkjdsklf · 2025-12-12T07:14:18 1765523658

If we stopped using products of every company that had a CEO that lied about their products, we’d all be sitting in caves staring at the dirt

fatata123 · 2025-12-12T08:02:03 1765526523

Because not everyone makes their decisions through the prism of politics

sz4kerto · 2025-12-12T10:50:46 1765536646

I'm using Gemini in general, but Grok too. That's because sometimes Gemini Thinking is too slow, but Fast can get confused a lot. Grok strikes a nice balance between being quite smart (not Gemini 3 Pro level, but close) and very fast.

ralusek · 2025-12-12T07:41:54 1765525314

Only thing I use grok for is if there is a current event/meme that I keep seeing referenced and I don't understand, it's good at pulling from tweets

wdroz · 2025-12-12T09:57:17 1765533437

Unlike openai, you can use the latest grok models without verifying your organization and giving your ID.

jbm · 2025-12-12T07:32:56 1765524776

I use a few AIs together to examine the same code base. I find Grok better than some of the Chinese ones I've used, but it isn't in the same league as Claude or Codex.

rsanek · 2025-12-12T13:43:06 1765546986

it's the biggest model on OpenRouter, even if you exclude free tier usage https://openrouter.ai/state-of-ai

irthomasthomas · 2025-12-12T14:03:21 1765548201

Roleplay is the largest use-case on openrouter.

bumling · 2025-12-12T05:34:07 1765517647

I dislike Musk, and use Grok. I find it most useful for analyzing text to help check if there's anything I've missed in my own reading. Having it built in to Twitter is convenient and it has a generous free tier.

scrollop · 2025-12-12T12:57:49 1765544269

I hate the guy, however grok scores high on arc-2 so it would be silly to not at least rank it.