wenc
Once GPT is tuned more heavily on Lean (proof assistant) -- the way it is on Python -- I expect its usefulness for research level math to increase.

I work in a field related to operations research (OR), and ChatGPT 4o has ingested enough of the OR literature that it's able to spit out very useful Mixed Integer Programming (MIP) formulations for many "problem shapes". For instance, I can give it a logic problem like "i need to put i items in n buckets based on a score, but I want to fill each bucket sequentially" and it actually spits out a very usable math formulation. I usually just need to tweak it a bit. It also warns against weak formulations where the logic might fail, which is tremendously useful for avoiding pitfalls. Compare this to the old way, which is to rack my brain over a weekend to figure out a water-tight formulation of MIP optimization problem (which is often not straightforward for non-intuitive problems). GPT has saved me so much time in this corner of my world.

Yes, you probably wouldn't be able to use ChatGPT well for this purpose unless you understood MIP optimization in the first place -- and you do need to break down the problem into smaller chunks so GPT can reason in steps -- but for someone who can and does, the $20/month I pay for ChatGPT more than pays for itself.

side: a lot of people who complain on HN that (paid/good - only Sonnet 3.5 and GPT4o are in this category) LLMs are useless to them probably (1) do not know how to use LLMs in way that maximizes their strengths; (2) have expectations that are too high based on the hype, expecting one-shot magic bullets. (3) LLMs are really not good for their domain. But many of the low-effort comments seem to mostly fall into (1) and (2) -- cynicism rather than cautious optimism.

Many of us who have discovered how to exploit LLMs in their areas of strength -- and know how to check for their mistakes -- often find them providing significant leverage in our work.

fnordpiglet
Rewind your mind to 2019 and imagine reading a post that said

“The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.”

With regard to interacting with the equivalent of Alexa. That’s a remarkable difference in 5 years.

eigenvalue
The o1 model is really remarkable. I was able to get very significant speedups to my already highly optimized Rust code in my fast vector similarity project, all verified with careful benchmarking and validation of correctness.

Not only that, it also helped me reimagine and conceptualize a new measure of statistical dependency based on Jensen-Shannon divergence that works very well. And it came up with a super fast implementation of normalized mutual information, something I tried to include in the library originally but struggled to find something fast enough when dealing with large vectors (say, 15,000 dimensions and up).

While it wasn’t able to give perfect Rust code that compiled on the very first try, it was able to fix all the bugs in one more try after pasting in all the compiler warning problems from VScode. In contrast, gpt-4o usually would take dozens of tries to fix all the many rust type errors, lifetime/borrowing errors, and so on that it would inevitably introduce. And Claude3.5 sonnet is just plain stupid when it comes to Rust for some reason.

I really have to say, this feels like a true game changer, especially when you have really challenging tasks that you would be hard pressed to find many humans capable of helping with (at least without shelling out $500k+/year in compensation for).

And it’s not just the performance optimization and relatively bug free code— it’s the creative problem solving and synthesis of huge amounts of core mathematical and algorithmic knowledge plus contemporary research results, combined with a strong ability to understand what you’re trying to accomplish and making it happen.

Here is the diff to the code file showing the changes:

https://github.com/Dicklesworthstone/fast_vector_similarity/...

abstractbill
My experience with O1 has been very different. I wouldn't even say it's performing at a "good undergrad" level for me.

For example, I asked a pretty simple question here and it got completely confused:

https://moorier.com/math-chat-1.png https://moorier.com/math-chat-2.png https://moorier.com/math-chat-3.png

(Full chat should be here: https://chatgpt.com/share/66e5d2dd-0b08-8011-89c8-f6895f3217...)

bitexploder
The novelty to me is that the “The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.” in so many subject areas! I have found great value in using LLMs to sort things out. In areas where I am very experienced it can be really helpful at tons of small chores. Like Terrence was pointing out in his third experiment — if you break the problem down it does solid work filling in smaller blanks. You need the conceptual understanding. Part of this is prompting skill. If you go into an area you don’t know you have to try and build the prompts up. Dive into something small and specific and work outward if the answer is known. Start specific and focused if starting from the outside in. I have used this to cut through conceptual layers of very complex topics I have zero knowledge in and then verify my concepts via experts on YT/research papers/trusted sources. It is an amazing tool.
jameshart
‘Able to make the same creative mathematical leaps as Terence Tao’ seems like a pretty high bar to be setting for AI.

This is like when you’re being interviewed for a programming job and the interviewer explains some problem to you that it took their team months to figure out, and then they’re disappointed you can’t whiteboard out the solution they came up with in 40 minutes without access to google.

kzz102
It's interesting that humans would also benefit from the "chain of thought" type reasoning. In fact, I would argue all students studying math will greatly increase their competence if they are required to recall all relevant definition and information before using it. We don't do this in practice (including teachers and mathematicians!) because recall is effortful, and we don't like to spent more effort than necessary to solve a problem. If recall fails, then we have to look up information which takes even more effort. This is why in practice, there is a tremendous incentive to just "wing it".

AI has no emotional barrier to wasted effort, which make them better reasoners than their innate ability would suggest.

fsndz
Completely agree with Terence Tao. this is a real advancement. I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning
perihelions
I'm so excited in anticipation of my near-term return to studying math, as an independent curiosity hobby. It's going to be epically fun this time around with LLM's to lean on. Coincidentally like Terence Tao, I've also been asking complex analysis queries* of LLM's, things I was trying to understand better in my working through textbooks. Their ability to interpret open-ended math questions, and quickly find distant conceptual links that are helpful and relevant, astonishes me. Fields laureate Professor Tao (naturally) looks down on the current crop of mathematics LLM—"not completely incompetent graduate student..."—but at my current ability level that just means looking up.

*(I remember a specific impressive example from 6 months ago: I asked if certain definitions could be relaxed to allow complex analysis on a non-orientable manifold, like a Klein bottle, something I spent a lot of time puzzling over, and an LLM instantly figured out it would make the Cauchy-Riemann equations globally inconsistent. (In a sense the arbitrary sign convention in CR defines an orientation on a manifold: reversing manifold orientation is the same as swapping i with -i. I understand this now, solely because an LLM suggested looking at it). Of course, I'm sure this isn't original LLM thinking—the math's certainly written down somewhere in its training material, in some highly specific postgraduate textbook I have no knowledge of. That's not relevant to me. For me, it's absolutely impossible to answer this type of question, where I have very little idea where to start, without either an LLM or a PhD-level domain specialist. There is no other tool that can make this kind of semantic-level search accessible to me. I'm very carefully thinking how best to make use of such an, incredibly powerful but alien, tool...)

afro88
The o1 model is hit and miss for me. On one hand it has solved the NYT Connections game [0] each day I've tried it [1]. Other models, including Claude Sonnet 3.5 cannot.

But on the other hand it misses important detail and hallucinates, just like GPT-4o. And can need a lot of hand holding and correction to get to the right answer, so much so that sometimes you wonder if it would have been easier to just do it yourself. Only this time it's worse because you're waiting 20-60 seconds for an answer.

I wonder if what it excels at is just the stuff that I don't need it for. I'm not in classic STEM, I'm in software engineering, and o1 isn't so much better that it justifies the wait time (yet).

One area I haven't explored is using it to plan implementation or architectural changes. I feel like it might be better for this, but need the right problems to throw at it.

[0] https://www.nytimes.com/games/connections

[1] https://chatgpt.com/share/66e40d64-6f70-8004-9fe5-83dd3653a5...

ak_111
He mentions that he posed to O1 the same challenge he posed to a previous GPT (which he also previously blogged about), so I am wondering how much O1 benefited from potentially "seeing" this discussion in its training set (which probably contains a very well recent snapshot of the world wide web).
gcanyon
> The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.

Coming from Terence Tao that seems pretty remarkable to me?

ocular-rockular
I don't understand why this is news? This could have been said by any one particular contributor from HackerNews but just because it's from Terence Tao it hits the front page? I understand that the guy is a great mathematician, but why is his input on this any more valuable than the myriad of discussions about o1 from other professionals on here?
nybsjytm
Daniel Litt, an algebraic geometer on twitter, said "Pretty impressed by o1-preview! Still not having much luck asking it to do any interesting math but it seems much more reliable with simple things; I can actually imagine it being a net time-saver at this point with some non-mathematical tasks."

Any other takes by mathematicians out there?

nmca
Note the selection effect in “a mediocre graduate student” (that got to work with Terry Tao)
bbor
Does anyone here think this will change without a full cognitive apparatus? Aka “agents”, to use the modern term? I have my doubts, but I’m relatively uninformed about the cutting edge of pure ML itself.

Just off the top of my head, maybe a RLHF run performed by academic experts and geared towards “creative applications” could get us farther than we are? Given how much the original RLHF run cost with underpaid workers in developing countries that might be exorbitantly expensive, but it’s worth a dream. Perhaps as a governmental or NGO-driven open source initiative…

Of course, a core problem here is defining “creativity” in stringent — or in Chomsky’s words, “scientific” — terms. RLHF dodged that a bit by leaning on the intuitive capabilities of your human critics. I’m constantly opining about how LLMs solved the frame problem, but perhaps it’s better characterized as a partial solution for a relatively easy/basic environment: stories about the real world. The Abstract/Academic/Scientific Frame Problem might be another breakthrough away, yet…

s1mon
I'm not a mathematician much beyond AP Calc in high school (almost 40 years ago). I am deeply fascinated by Bézier curves and geometric continuity. I've spent a lot of time digging up research papers and references about this and related Computer Aided Geometric Design mathematics. Mostly I skim them for the illustrations and more geometric relations. For several years I've been trying to understand how to make sure that a Bézier curve is G3 to an adjoining curve, given the tangent direction, and first and second curvature derivatives.

I've tried a variety of ways to ask various LLMs to help solve this. Finally with access to ChatGPT o1-preview I was able to get a good answer. The first answer was wrong, but with a little more prompting and clarification I was able to get the answer I wanted to relate the positions of P0, P1, P2 and P3 so that a Bézier curve could be G3. This isn't something that is unknown because there are many CAD programs which can do this already, but I had not been able to find the answer I was looking for in a form that was useful to me.

I don't really know where that puts o1-preview relative to a math grad student, but after spending tons of time over a couple years on this pet project, getting an answer from a chat bot was one of the more magical moments I've had with technology in a long time.

busyant
Well, one thing is clear.

Math grad students everywhere now have a benchmark to determine if Terry Tao considers them to be mediocre or incompetent.

vavooom
Most surprising thing about this article is discovering that 'mathstodon' exists and Terence is active on it!
itissid
One thing it's certainly doing better is exploring the search space better e.g.: https://x.com/sg3487/status/1835040593703010714

If you know the contours of the answer and can describe what you are looking for it can quickly find it for you.

kldnav
Tao and Aaronson are optimistic about LLMs. What are they telling their students? That math and science degrees will soon have the same value as a degree in medieval dance theory?

If they are overly optimistic, perhaps it would be good to hear the opinions of Wiles and Perelman.

gary_0
Tao mentions grad students; I wonder how they feel reading this?

As LLMs continue to improve I feel like anyone making a living doing the "99% perspiration" part of intellectual labor is about to enter a world of hurt.

tambourine_man
Glad to see Tao using Mastodon instead of Twitter.
afian
As a previously "mediocre, but not completely incompetent, graduate student" at a top research university (who's famous advisor was understandably frustrated with him), I consider this a huge win!
sgt101
Is there a list of discoveries or siginficant works/constructions made by people collaborating with LLM's? I mean as opposed to specific deep networks like Alphafold or Graphcast?
lalaithion
It needs a bigger context, but the moment someone can feed an entire GitHub repo into this thing and ask it to fix bugs... I think O2 may be the beginning of the end.
benreesman
Reading anything Terrence Tao writes is thought provoking and I doubt I’m seeing anything others haven’t.

There’s at least a “complexity” if not a “problem” in terms of judging models that to a first approximation have been trained on “everything”.

Have people tried putting these things up against serious mathematical problems that are well studied? With or with Lean hinting has anyone gotten like, the Shimura-Taniyama conjecture/proof out?

maxglute
>The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student. However, this was an improvement over previous models, whose capability was closer to an actually incompetent graduate student.

Appreciate the no fucks given categorization of grad students.

artninja1988
>could not generate conceptual ideas of their own

Is the most important part imo. A big goal should be some ai system coming up with its own discovery and ideas. Really unclear how we can get from the current paradigm to it coming up with something like general relativity, like Einstein. Does it require embodiment?

ghransa
I suspect, but am not certain - that if it had all of formalized mathematics in its context window it could likely extend the edges slightly further. Would be an interesting experiment irregardless.
lupire
> GPT-o1, which performs an initial reasoning step before running the LLM.

Is that an accurate description? I thought it just runs the LLM for longer, and multiple times,and truncates the beginning of the output.

reverseblade2
Here's a little test I try on LLMs. So far only O1 and Microsoft Copilot (bing chat) was able to solve it:

Find a, b, c distinct positive integers satisfying a^3 + b^3 = c^4. Hint: try dividing all sides by c^3, then giving values to (a/c) and (b/c).

d0mine
"with even the latest tools the effort put in to get the model to produce useful output is still some multiple (but not an enormous multiple now, say 2x to 5x) of the effort needed to properly prompt and verify the output. However, I see no reason to prevent this ratio from falling below 1x in a few years, which I think could be a tipping point for broader adoption of these tools in my field"

Given the log scale on compute to improve performance, it is not a guarantee that the ratio can be improved so much in a few years

nyc111
I checked the links and I think it's amazing and it answers with Latex formatted notation.

But I was curious and I asked something very simple, Euclid's first postulate and I got this answer:

Euclid's Postulate 1: "Through any two points, there is exactly one straight line."

In fact Euclid's Postulate 1 is "To draw a straight line from any point to any point." http://aleph0.clarku.edu/~djoyce/java/elements/bookI/bookI.h...

I think AI answer is not correct, it may be some textbook interpretation but I was expecting Euclid's exact wording.

Edit: Google's Gemini gives the exact wording of the postulate and then comments that this means that you can draw one line bitween two points. I think this is better

fizx
I'm curious how well a o1-like model thinks, given minutes instead of seconds and the temperature set relatively high.
alexnewman
I tried giving it questions like 9.11 > 9.9 and it got basic stuff like that wrong more often than right
2muchcoffeeman
What a burn

“The experience seemed roughly on par with trying to advise a mediocre, but not completely incompetent, graduate student.”

giardini
Anyone tried using GPT in conjunction with Doug Lenat's tools (AM or Eurisko)?
ninetyninenine
A specialized LLM could possibly meet his criteria already.
ein0p
Idk I think the fact that it needs “hints” and “prodding” is a good thing, myself. Otherwise we wouldn’t need humans to get those answers, would we. I want it to augment humans, not replace them.
lupire
The GPT Share links are 404 for me
oglop
Ok lol. I mean just feed it serge lang problems and see what it does is all he’s saying.

It performs way better than undergrads. Funny he didn’t point that out but only made some slight to it about being a bad graduate student. Don’t believe me, open the book and ask away. It’s amazing, even if it is a “mediocre graduate student” which is far better than a good graduate student or professor that gives you no help or time for all that money you forked over.

It’s already worth the money, ignore this shitty write up by someone they doesn’t need its help.

jarbus
I wonder how long it took for each of the responses it gave
darby_nine
JUST when you thought the chatbot was dead
iE8MRJS3fV2k
[flagged]
HkJfJTle71cn
[flagged]
2j8mJtnePjLM
[flagged]
6qVRwS5iw3fA
[flagged]
yvnRRzrTbS4E
[flagged]
CIQkhyvj7lFQ
[flagged]
IAS4oB40A63x
[flagged]
809VpzIOXqtQ
[flagged]
tRys2l7G3RYa
[flagged]
apyyxvIySl57
[flagged]
WUt2nuuNDqaW
[flagged]
3KccUAJuWG1b
[flagged]
WPXQ0R7RgynH
[flagged]
yjPHSrxjtosi
[flagged]
rR9292wTfIdP
[flagged]
fZFKY14LpSbH
[flagged]
6EY7fcwLP3j0
[flagged]
iE8MRJS3fV2k
[flagged]
nektro
lol
MrFots
Incompetent Graduate Students is the name of my new sketch group.
iamyourcanary
Thrift your preloved clothes, preloved wedding dresses, and preloved country clothing at the best Online Thrift Stores in UK: https://everused.co.uk/best-online-thrift-stores-in-uk-onlin...
idunnoman1222
How did this guy not know how large language models work? Fancy compression algorithm for all written knowledge, how could it invent that which was not an input ?