More

data_maan · 2026-02-07T18:54:37 1770490477

> these are problems of some practical interest, not just performative/competitive maths.

FrontierMath did this a year ago. Where is the novelty here?

> a solution is known, but is guaranteed to not be in the training set for any AI.

Wrong, as the questions were poses to commercial AI models and they can solve them.

This paper violates basic benchmarking principles.

offnominal · 2026-02-07T20:14:05 1770495245

> Wrong, as the questions were poses to commercial AI models and they can solve them.

Why does this matter? As far as I can tell, because the solution is not known this only affects the time constant (i.e. the problems were known for longer than a week). It doesn't seem that I should care about that.

data_maan · 2026-02-07T22:43:41 1770504221

Because the companies have the data and can solve them -- so providing the question to a company with the necessary manpower, one cannot guarantee anymore that the solution is not known, and not contained in the training sample.

data_maan · 2026-02-07T18:17:29 1770488249

Nothing prevents them, and they are already doing that. I work in this field and one can be sure that now, because of the notoriety this preprint got, the questions will be solved soon.

data_maan · 2026-02-07T18:16:13 1770488173

Looks like very sloppy research.

pickleRick243 · 2026-02-07T23:39:30 1770507570

I don't think it's that serious...it's an interesting experiment that assumes people will take it in good faith. The idea is also of course to attach the transcript log and how you prompted the LLM so that anyone can attempt to reproduce if they wish.

data_maan · 2026-02-08T08:26:23 1770539183

If you want to do this rigorously, you should run it as a competition like the guys at the AI-MO Prize are doing on Kaggle.

That way you get all the necessary data.

I still think this is bro science.

yorwba · 2026-02-08T09:34:15 1770543255

If this were a competition, some people would try hard to win it. But the goal here is exploration, not exploitation. Once the answers are revealed, it's unlikely a winner will be identified, but a bunch of mathematicians who tried prompting AI with the questions might learn something from the exercise.

data_maan · 2026-02-10T19:11:46 1770750706

But everything has been explored in other datasets already.

If only a bunch of mathematicians learn something, why are so many people talking about this, why is the NY Times posting about this?

This is the attention economy at its worst.

data_maan · 2026-02-07T18:15:30 1770488130

Very serious for mathematicians - not for ML researchers.

If the paper would not have had the AI spin, would those 10 questions still have been interesting?

It seems to me that we have here a paper that is solely interesting because of the AI spin -- while at the same time this AI spin is really poorly executed from the point of AI research, where this should be a blog post at most, not an arXiv preprint.

_alternator_ · 2026-02-08T01:57:18 1770515838

I’m confused by this comment. I’m pretty sure that someone at all the bigs labs is running these questions through their models and will report back as soon as the results arrive (if not sooner, assuming they can somehow verify the answers).

The fact that you find it odd that this landed on arXiv is maybe a cultural thing… mathematicians kinda reflexively throw work up there that they think should be taken seriously. I doubt that they intend to publish it in a peer reviewed journal.

data_maan · 2026-02-08T08:20:08 1770538808

Yes, but people at those labs may be running those problems because a Fields Medalist is in the paper, and it got hype.

Not because of the problems, and not because this is new methodology.

And once the labs report back, what do we know that we didn't know before? We already know, as humans, the answer to the problems, so that is not it. We already know that LLMs can solve some hard problems, and fail in easy problems, so that is not it either.

So what do we really learn?

_alternator_ · 2026-02-08T16:28:20 1770568100

Ah. I think the issue is that research mathematicians haven’t yet hit the point where the big models are helping them on the problems they care about.

Right now I can have Claude code write a single purpose app in a couple hours complete with a nice front end, auth, db, etc. (with a little babysitting). The models solve a lot of the annoying little issues that an experienced software developer has had to solve to get out an MVP.

These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)

data_maan · 2026-02-10T19:12:46 1770750766

> These problems are representative of the types of subproblems research mathematicians have to solve to get a “research result”. They are finding that LLMs aren’t that useful for mathematical research because they can’t crush these problems along the way. And I assume they put this doc together because they want that to change :)

Same holds true for IMProofBench problems. This dataset shows nothing new.

bwfan123 · 2026-02-08T17:09:42 1770570582

> So what do we really learn?

We will learn if the magical capabilities attributed to these tools are really true or not. Capabilities like they can magically solve any math problem out there. This is important because AI hype is creating the narrative that these tools can solve PhD level problems and this will dis-infect that narrative. In my book, any tests that refute and dispel false narratives make a huge contribution.

data_maan · 2026-02-10T19:15:22 1770750922

> We will learn if the magical capabilities attributed to these tools are really true or not.

They're not. We already know that. FrontierMath. Yu Tsumura's 553th problem, RealMath benchmark. The list goes on. As I said many times on this thread, there is nothing novel in this benchmark.

This fact that this benchmark is so hyped shows that the community knows nothing, NOTHING, about prior work in this space, which makes me sad.

heliumtera · 2026-02-08T02:48:13 1770518893

the last unsolved erdos problem proof generated by llms that hit the news was so non interesting that a paper published by erdos himself stated the proof

aaaaaaand no one cared enough to check

so i think the question is, are those interesting by themselves, or, are they just non interesting problems no one will ever care about except it would be indicative llms are good for solving complex novel problems that do not exists in their training set?

j_maffe · 2026-02-07T18:25:16 1770488716

The timed-reveal aspect is also interesting.

data_maan · 2026-02-07T22:42:04 1770504124

How is that interesting for a scientific point of view? This seems more like a social experiment dressed as science.

Science should be about reproducibility, and almost nothing here is reproducible.

bwfan123 · 2026-02-08T17:24:27 1770571467

> Science should be about reproducibility, and almost nothing here is reproducible.

I can see your frustration. You are looking for reproducible "benchmarks". But you have to realize several things.

1) research level problems are those that bring the "unknown" into the "known" and as such are not reproducible. That is why "creativity" has no formula. There are no prescribed processes or rules for "reproducing" creative work. If there were, then they would not be considered "research".

2) things learnt and trained are already in the realm of the "known", ie, boiler-plate, templated and reproducible.

The problems in 2) above are where LLMs excel, but they have been hyped into excelling at 1) as well. And this experiment is trying to test that hypothesis.

cowsandmilk · 2026-02-08T03:22:10 1770520930

Deepmind’s Nobel Prize was primarily for its performance in CASP which is pretty much exactly this. Labs solve structures of proteins, but don’t publish them until after all the computational teams predict structures.

So I’m not sure where you’re coming from claiming that this isn’t scientific.

data_maan · 2026-02-08T08:23:22 1770539002

It wasn't like this in any way.

CASP relies on a robust benchmark (not just 10 random proteins), and has clear participation criteria, objective metrics how the eval plays out, etc.

So I stand by my claim: This isn't scientific. If CASP is Japan, a highly organized & civilized society, this is a banana republic.

thesmtsolver2 · 2026-02-08T01:28:48 1770514128

Reproducibility is just one aspect of science, logic + reasoning from principles and data is the major aspect.

There are some experiments which cannot be carried out more than once.

data_maan · 2026-02-08T08:24:48 1770539088

> There are some experiments which cannot be carried out more than once

Yes, in which case a very detailed methodology is required: which hardware, runtimes, token counts etc.

This does none of that.

data_maan · 2026-02-07T18:09:00 1770487740

As mathematically interesting the 10 questions are that the paper presents, the paper is --sorry for the harsh language-- garbage from the point of view of benchmarking and ML research: Just 10 question, few descriptive statistics, no interesting points other than "can LLMs solve these uncontaminated questions", no long bench of LLMs that were evaluated.

The field of AI4Math has so many benchmarks that are well executed -- based of the related work section it seems the authors are bit familiar with AI4Math at all.

My belief is that this paper is even being discussed solely because a Fields Medalist, Martin Hairer, is on it.

bawolff · 2026-02-07T18:11:36 1770487896

Paper not about benchmarking or ML research is bad from the perspective of benchmarking. Not exactly a shocker.

The authors themselves literally state: "Unlike other proposed math research benchmarks (see Section 3), our question list should not be considered a benchmark in its current form"

data_maan · 2026-02-07T18:58:09 1770490689

On the website https://1stproof.org/#about they claim: "This project represents our preliminary efforts to develop an objective and realistic methodology for assessing the capabilities of AI systems to autonomously solve research-level math questions."

Sounds to me to be a benchmark in all but a name. And they failed pretty terribly at achieving what they set out to do.

bwfan123 · 2026-02-08T00:31:30 1770510690

> And they failed pretty terribly at achieving what they set out to do.

Why the angst ? If the ai can autonomously solve these problems, isnt that a huge step forward for the field.

data_maan · 2026-02-08T08:28:33 1770539313

It's not angst. It's intense frustration that they 1) are not doing the science correctly, and 2) that others (e.g. FrontierMath) already did everything they claim to be doing, so we won't learn anything new here, but somehow 1stproof get all the credit.

bawolff · 2026-02-08T08:55:53 1770540953

Are they really trying to do science, or are they just trying to determine pragmatically whether or not current AI is useful for a research mathematician in their day to day job?

data_maan · 2026-02-10T19:17:52 1770751072

If it's the latter case (which it has to be), it seems that attention credit (via, e.g., articles in NY Times) is very unfairly distributed.

None of the people that advanced the state of benchmarking and did the hard work on much bigger benchmarks got any, but a ridiculous benchmark of 10 question scored big.

bwfan123 · 2026-02-08T16:51:19 1770569479

> are not doing the science correctly

What do you mean ? These are top-notch mathematicians who are genuinely trying to see how these tools can help solve cutting edge research problems. Not toy problems like those in AIME/AMC/IMO etc. or other similar benchmarks which are gamed easily.

> that others (e.g. FrontierMath) already did everything they claim to be doing

You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

[1] https://siliconreckoner.substack.com/p/the-frontier-math-sca...

Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools. I see this as a welcome and positive development which is a win-win for the ecosystem.

data_maan · 2026-02-10T19:24:11 1770751451

> What do you mean ? These are top-notch mathematicians

YeS. I didn't dispute that. I disputed that they are NOT top notch ML specialist and have made one of the worst benchmarks of 2025-2026. Benchmarks like these would have worked maybe in early 2024 at latest. The field has moved on significantly since.

And yes, many many other benchmarks don't use toy problems -- their names are just a prompt away.

> You are kidding right ? FrontierMath benchmark [1] is produced by a startup whose incentives are dubious to say the least.

They did 1) open source some of their datapoints (on a similar order of magnitude) and 2) they carried out detailed evals. Here is much to learn from their blog posts, much more than from the current dataset.

But fair. If you don't like them, have a look at IMProofBench. Have a look at the AIMO competition. Have a loom at HardMath. It's quite a landscape of datasets already.

> Unlike the AI hypesters, these are real mathematicians trying to inject some realism and really test the boundaries of these tools

As mentioned above, realistic benchmarks that are bigger and better exist. Unfortunately, from a benchmarking POV, these mathematicians are the hypesters with a preprint that wouldnt even make it to the AI&Math workshops at ICML or NeurIPS.

data_maan · 2025-07-25T14:39:46 1753454386

The concept of pre-registered eval (an analogy to pre-registered study) will go a long way towards fixing this.

More information

https://mathstodon.xyz/@friederrr/114881863146859839

data_maan · 2025-07-18T05:05:56 1752815156

I always wondered why people are so trusting (gullible?) to use their real data

Terr_ · 2025-07-18T06:03:35 1752818615

If they have enough DNA and not-so-secret genealogical data, they can derive your real name anyway.

echelon · 2025-07-18T10:09:14 1752833354

They don't even need your DNA. Just your relatives.

data_maan · 2025-07-18T05:01:33 1752814893

Germany is a broken country, and this illustrates it on a micro-level

cheesepaint · 2025-07-21T13:02:54 1753102974

As someone who lived abroad - I really do not agree.

data_maan · 2025-07-12T18:44:09 1752345849

Open source" lol

It's open-weight. As usual, you don't get the dataset, training scripts, etc.

data_maan · 2025-07-12T18:43:34 1752345814

"Open source" lol

Open-weight. As usual, you don't get the dataset, training scripts, etc.

mistercheph · 2025-07-12T18:51:29 1752346289

Wont happen under the current copyright regime, it is impossible to train SOTA without copyrighted text, how do you propose distributing that?

msk-lywenn · 2025-07-12T19:18:12 1752347892

Bibtex

irthomasthomas · 2025-07-12T19:12:57 1752347577

List the titles.

mixel · 2025-07-12T19:36:35 1752348995

But probably they don't have the rights to actually train on them and that's why they do not publish the list. Otherwise it may be laziness who knows

CaptainFever · 2025-07-12T18:59:46 1752346786

It's not even open-weight. It's weight-available. It uses a "modified MIT license":

    Modified MIT License
    
    Copyright (c) 2025 Moonshot AI
    
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the “Software”), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:
    
    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.
    
    THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.
    
    Our only modification part is that, if the Software (or any derivative works
    thereof) is used for any of your commercial products or services that have
    more than 100 million monthly active users, or more than 20 million US dollars
    (or equivalent in other currencies) in monthly revenue, you shall prominently
    display "Kimi K2" on the user interface of such product or service.

mitthrowaway2 · 2025-07-12T19:47:44 1752349664

This seems significantly more permissive than GPL. I think it's reasonable to consider it open-weight.

MallocVoidstar · 2025-07-12T19:48:33 1752349713

4-clause BSD is considered open source by Debian and the FSF and has a similar requirement.

weitendorf · 2025-07-13T02:36:27 1752374187

So "MIT with attribution" (but only for huge commercial use cases making tons of money off the product) is not open-weight? Do you consider CC BY photos on Wikipedia to be Image Available or GPL licensed software to be code-available too?

Tangent: I don't understand the contingent that gets upset about open LLMs not shipping with their full training regimes or source data. The software a company spent hundreds of millions of dollars creating, which you are now free to use and distribute with essentially no restrictions, is open source. It has weights in it, and a bunch of related software for actually running a model with those weights. How dare they!

spookie · 2025-07-13T14:02:41 1752415361

We really need to stop diluting the meaning of open source