Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> we uncover that RL-trained models excel at low k (e.g., pass@1) but are consistently outperformed by base models at high k (e.g., pass@256).

This is a weak argument. I think I get what we are trying to say, but let's take this to the extreme, say pass@10^10^100. Just like a group of monkeys could write Shakespeare if given enough time, a complete random model could probably outperform an RL-trained model at pass@10^10^100. Would we then say the random model can reason too?

Of course the correct reasoning trace will be in the base model's distribution, just like any other well-formed, coherent paragraph. Kind of makes me think, maybe sampling efficiency _is_ intelligence?



If this was just the effect you mention you would not expect the base model to surpass the RL model though. Plus their k are much smaller than that.

I think it's a very interesting and meaningful study.


The authors of the paper address this argument in the QA section.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: