Hacker Newsnew | past | comments | ask | show | jobs | submit | throwaway_2968's commentslogin

Throwaway account here. I recently spent a few months as a trainer for a major AI company's project. The well-paid gig mainly involved crafting specialized, reasoning-heavy questions that were supposed to stump the current top models. Most of the trainers had PhDs, and the company's idea was to use our questions to benchmark future AI systems.

It was a real challenge. I managed to come up with a handful of questions that tripped up the models, but it was clear they stumbled for pretty mundane reasons—outdated info or faulty string parsing due to tokenization. A common gripe among the trainers was the project's insistence on questions with clear-cut right/wrong answers. Many of us worked in fields where good research tends to be more nuanced and open to interpretation. I saw plenty of questions from other trainers that only had definitive answers if you bought into specific (and often contentious) theoretical frameworks in psychology, sociology, linguistics, history, and so on.

The AI company people running the projects seemed a bit out of their depth, too. Their detailed guidelines for us actually contained some fundamental contradictions that they had missed. (Ironically, when I ran those guidelines by Claude, ChatGPT, and Gemini, they all spotted the issues straight away.)

After finishing the project, I came away even more impressed by how smart the current models can be.


I'm currently pursuing PhD in theoretical condensed matter physics. I tried submitting questions to Humanity Last Exam [1], and it was not too hard to think of a problem that none of top llms (Claude, gpt, Gemini + both o1 models) got right. What was surprising for me is how small my bag of tricks was. I could think of 5-6 questions in my direct area of expertise with simple numerical answer that were hard for llms, and another maybe 5 that they were able to solve. But basically that was all my expertise. Of course there is stuff that can't be checked with simple numerical answer (quite a lot in my case), and there are probably additional questions that would require more effort from me to give a correct answer. But all in all, I suddenly felt I'm a one-trick pony, and that's given that my PhD is relatively diverse.

[1] https://agi.safe.ai/submit


  > Many of us worked in fields where good research tends to be more nuanced and open to interpretation
I've had a hard time getting people to understand this. It's always felt odd tbh. It's what's meant by "truth doesn't exist". Because it doesn't exist with infinite precision, though they are plenty of times where there's good answers. In our modern world I think one of the big challenges is that we've advanced enough that low order approximations are no longer good enough. It should make sense, as we get better we need more complex models. We need to account for more.

In many optimization problems there are no global solutions. This isn't because we lack good enough models, it's just how things are. And the environment is constantly changing, the targets moving. So the complexity will always exist. There's beauty in that, because what fun is a game when you beat it? With a universe like this, there's always a new level ahead of us.


  > Many of us worked in fields where good research tends to be more nuanced and open to interpretation
>> I've had a hard time getting people to understand this.

Why, can't you just tell them "it's not a science, it's more like performance art."


I wouldn’t look for questions with yes/no answers, but for questions where the answers can have correct/incorrect reasoning. Of course, you can’t turn those into automated benchmarks, but that’s maybe kinda the point.


I think that's the point: correctness can be a sliding scale. There is Newton correct, and there's Einstein correct.


It wouldn't surprise me if Newton correct is closer to correct for 'small numbers' then Einstein correct will be in the end in the general case.


> when I ran those guidelines by Claude, ChatGPT, and Gemini

Did you mention this to folks running the project? I would think that pasting the "detailed guidelines" from an internal project into a competitor's tool would run afoul of some confidentiality policy. At least, this sort of restriction has been a barrier to using such LLM tools in my own professional work.


> I saw plenth of questions that only had definitive answers if you bought into specific theoretical frameworks

That kind of stuff would be great to train on. As long as the answer says something like "If you abide by x framework, then y"


What were the contradictions?


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: