OpenAI Chief Research Officer Mark Chen: GPT 4.5 Is Now Live and Scaling Isn’t Dead
GPT 4.5, OpenAI’s latest large language model, debuts today with more heft, EQ, and performance than its predecessors.
OpenAI announced the debut of GPT 4.5 on Thursday. The new model is its largest yet, and a a step change improvement over its predecessor GPT 4, according to the company. It will go live for Pro users on Thursday, followed by Plus, Enterprise, Team, and Edu next week.
For OpenAI, GPT 4.5 is an answer to those questioning whether foundational research labs have hit a wall scaling up these models.
“GPT 4.5 really is proof that we can continue the scaling paradigm,” Mark Chen, OpenAI’s chief research officer, told me in a conversation just hours before the release. “This is the point that lies at that next order of magnitude.”
In our conversation, taped for an episode of Big Technology Podcast today, Chen spoke about what the new model says about the AI scaling wall, how scaling traditional GPT models compares to building reasoning models, how important EQ is for AI models today, whether product or models matter more, and how OpenAI's talent bench looks after last year's departures.
You can read the Q&A below, edited for length and clarity, and listen to the full episode on Apple Podcasts, Spotify, or your podcast app of choice.
Alex Kantrowitz: The question most of us are asking today is why isn't this GPT-5? What is it going to take to get to GPT-5?
Mark Chen: Whenever we make these naming decisions, we try to keep with a sense of what the trends are. So when it comes to predictable scaling — like going from 3 to 3.5 — you can predict out what an order of magnitude of improvement in the amount of compute that you train the model with, and in efficiency improvements, will buy you. We find this model aligns with what 4.5 would be. And so we want to name it what it is.
But it seems like the expectations for GPT-5 are built up pretty high. Do you think it's going to be hard to meet those expectations whenever that GPT-5 model does come out?
I don't think so. And one of the fundamental reasons is because we now have two different axes on which we can scale. So GPT 4.5 is our latest scaling experiment along the axis of unsupervised learning. But there's also reasoning. And when you ask why there seems to be a little bit bigger of a gap in release time between 4 and 4.5, we've been really largely focused on developing the reasoning paradigm as well.
Our research program is really an exploratory research program. We're looking into all avenues of how we can scale our models. And over the last one and a half, two years, we've really found a new, very exciting paradigm through reasoning, which we're also scaling. GPT-5 really could be the combination of a lot of these things coming together.
This is OpenAI’s largest model. Did you find the so-called "scaling wall” and are you already seeing diminishing returns from scaling?
I have a different framing around scaling. So when it comes to unsupervised learning, you want to put more ingredients like compute, algorithmic efficiencies and more data. And GPT 4.5 really is proof that we can continue the scaling paradigm. And this paradigm is not the antithesis of reasoning.
You need knowledge to build reasoning on top of. A model can't go in blind and just learn reasoning from scratch. So we find these two paradigms to be fairly complimentary, and we think they have feedback loops on each other.
GPT 4.5 is smart in different ways from the ways that reasoning models are smart. When you look at the model today, it has a lot more world knowledge. When we look at comparisons against GPT 4o, you see that for everyday use cases people prefer [4.5] by a margin of 60%. For productivity and knowledge work against GPT 4-0, there's almost like a 70% preference rate. So people are really responding to this model, and it's this knowledge that we can leverage for our reasoning models in the future.
What are some of the examples you would use GPT 4.5 for that you would prefer it over a reasoning model?
It's a different profile from a reasoning model. With a larger model, it takes more time to process and think through the query, but it's also giving you an immediate response back. So this is very similar to what a GPT 4 would have done for you.
Whereas, with something like o1, you get a model where you give a query and it can think for several minutes. And I think these are fundamentally different trade offs. You have a model that immediately comes back to you, doesn't do much thinking, but comes up with a better answer, versus a model that thinks for a while and then comes up with an answer. But we find that there are areas like creative writing where this model outshines reasoning models.
This is the largest model that OpenAI has ever released. At this size, does adding similar amounts of compute, similar amounts of data, get you the same returns that you did? Or are we already starting to see the returns tail off?
No, we are seeing the same returns. And I do want to stress that GPT 4.5 is that next point on this unsupervised learning paradigm. And we're very rigorous about how we do this. We make projections based on all the models we've trained before on what performance to expect, and in this case, we put together the scaling machinery, and this is the point that lies at that next order of magnitude.
So, what's it been like getting here? There's been some reports that you had to start and stop a couple times to get this to work. Talk a little bit about the process, and maybe you can confirm or deny some of the things that we've heard.
It's interesting that this is a point that's attributed to this model, because in developing all of our foundation models, they're all experiments. Running all the foundation models, oftentimes, does involve stopping at certain points, analyzing what's going on, and then restarting the runs.
I don't think that this is a characteristic of GPT 4.5, it's something that we've done with GPT-4, with O series models. We want to go in, diagnose them in the middle, and if we want to make some interventions, we should make interventions. But I wouldn't characterize this as something that we do for GPT 4.5 that we don't do for other models.
With DeepSeek, we saw big optimizations in running models. What is OpenAI doing on that front. Were you able to run these large models more efficiently? And if so, how?
The process of making a model efficient to serve, I often see us fairly decoupled from developing the core capability of the model. And, we see a lot of work being done on the inference stack. That's something that DeepSeek did very well, and it's also something that we push on a lot. We care about serving these models at cheap costs to all users, and we push on that quite a bit. Irrespective of GPT-4 reasoning models, we're always applying that pressure to be able to inference more cheaply and we've done a good job of that over time. The costs have dropped many orders of magnitude since we first launched GPT-4.
I’m talking with you right now about an extremely large model, but a theme in the Big Technology Discord is that small and niche models will potentially be the future. So what do we get with the big models versus the niche models. And do you see them in competition or as compliments?