Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Olive Song of MiniMax explains how the M-series open-weight models are trained with reinforcement learning, feedback loops, and environment perturbations, covering long-horizon agents, reward hacking, FP32 training, and debugging real-world LLM failures.

Intelligence with Everyone: RL @ MiniMax, with Olive Song, from AIE NYC & Inference by Turing Post

Watch Episode Here


Listen to Episode Here


Show Notes

Olive Song from MiniMax shares how her team trains the M series frontier open-weight models using reinforcement learning, tight product feedback loops, and systematic environment perturbations. This crossover episode weaves together her AI Engineer Conference talk and an in-depth interview from the Inference podcast. Listeners will learn about interleaved thinking for long-horizon agentic tasks, fighting reward hacking, and why they moved RL training to FP32 precision. Olive also offers a candid look at debugging real-world LLM failures and how MiniMax uses AI agents to track the fast-moving AI landscape.

Use the Granola Recipe Nathan relies on to identify blind spots across conversations, AI research, and decisions: https://bit.ly/granolablindspot

LINKS:

Conference Talk (AI Engineer, Dec 2025) – https://www.youtube.com/watch?v=lY1iFbDPRlw
Interview (Turing Post, Jan 2026) – https://www.youtube.com/watch?v=GkUMqWeHn40

Sponsors:

Claude:

Claude is the AI collaborator that understands your entire workflow, from drafting and research to coding and complex problem-solving. Start tackling bigger problems with Claude and unlock Claude Pro’s full capabilities at https://claude.ai/tcr

Tasklet:

Tasklet is an AI agent that automates your work 24/7; just describe what you want in plain English and it gets the job done. Try it for free and use code COGREV for 50% off your first month at https://tasklet.ai

CHAPTERS:

(00:00) About the Episode

(04:15) Minimax M2 presentation (Part 1)

(17:59) Sponsors: Claude | Tasklet

(21:22) Minimax M2 presentation (Part 2)

(21:26) Research life and culture

(26:27) Alignment, safety and feedback

(32:01) Long-horizon coding agents

(35:57) Open models and evaluation

(43:29) M2.2 and researcher goals

(48:16) Continual learning and AGI

(52:58) Closing musical summary

(55:49) Outro

PRODUCED BY:

https://aipodcast.ing

SOCIAL LINKS:

Website: https://www.cognitiverevolution.ai

Twitter (Podcast): https://x.com/cogrev_podcast

Twitter (Nathan): https://x.com/labenz

LinkedIn: https://linkedin.com/in/nathanlabenz/

Youtube: https://youtube.com/@CognitiveRevolutionPodcast

Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431

Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk


Full Transcript

(00:00) Nathan Labenz:

Hello, and welcome back to the Cognitive Revolution. The presenting sponsor of today's episode is Granola, the AI notepad that helps you get the doing done. Whether it's identifying to-do items after a call, turning a brainstorming session into a product spec, or looking back at multiple calls to identify cultural trends at your company, Granola takes your raw meeting notes and makes them awesome.

Right now, Granola is featuring AI recipes from AI thought leaders, including several past guests of this show. My own contribution is a Blind Spot Finder recipe that looks back at recent conversations and attempts to identify things that I am totally missing. This was immediately useful in the context of contingency planning for my son's cancer treatment. And the more data Granola collects as I continue to use it, the more valuable it becomes for suggesting AI topic areas that I really ought to explore. See the link in our show notes to try my Blind Spot Finder recipe and experience for yourself how Granola puts your meetings to work.

Now today, I'm excited to share a special combined crossover episode featuring Olive Song, a Senior Researcher specializing in reinforcement learning and model evaluation at the Chinese AI company MiniMax, creators of the M series of models, the most recent of which, M2.5, currently tops the OpenRouter usage leaderboard. To give you the most complete picture possible, we're combining two sources: first, a presentation Olive recently gave at the AI Engineer Conference in New York, where she had previously lived for six years, and second, an interview with Ksenia Se from her podcast, Inference by Turing Post.

Together, they provide an excellent overview of MiniMax's goals as a company, the capabilities they're prioritizing in their models, the techniques they're using to get there, and the day-to-day ups and downs of training frontier LLMs. Highlights include how MiniMax's strategy of building both models and user-facing applications in-house creates tight feedback loops that enable their cross-functional research and engineering teams to identify and address model weaknesses as quickly as possible. An overview of how interleaved thinking — which allows the model to take an action, get feedback from the environment, and pause to think again before continuing — improves performance on long-horizon agentic tasks. A description of the perturbation pipeline they use to systematically vary the model's training environment in order to encourage robust generalization. Olive's perspective on the constant battle she and teammates are fighting against reward hacking. A window into the tedious debugging that is sometimes required to diagnose training issues, and how they realized they needed to run reinforcement learning at FP32 precision. And finally, how the team at MiniMax is using AI agents to keep up with the daily flood of AI news.

While Olive recognizes that MiniMax's models, like all open source models in the world today, can't quite match the performance of top American models, I think there is still a lot of value in the details she shares about their approach to reinforcement learning and how they structure their team and work. And in any case, I always appreciate the opportunity to hear directly from Chinese AI researchers who, just like their American counterparts, are figuring things out step by step as they go, even as major questions about issues such as the governance of increasingly powerful open source models remain fundamentally unanswered.

With that, I want to thank Swyx, the creator of the AI Engineer event series, which I absolutely recommend attending if you can, and Ksenia, the creator of Turing Post, which has what I find to be some of the very best topic selection of any AI newsletter, for allowing me to create and post this combined episode. And I hope you enjoy this window into the development of some of the best open-weight models in the world with Olive Song of MiniMax.

(04:16) Olive Song:

Hi. Hi, everyone. I'm Olive. It's my great honor here today to present on our new model, MiniMax M2. I actually lived in New York City for six years, so it feels great to come back, but with a different role. I currently study reinforcement learning and model evaluation at MiniMax. Let me just get a quick sense of the room. Who here has heard of or tried MiniMax before? Oh, a couple. Yeah, not everybody, but I guess that's the value of me standing here today.

So we are a global company that works on both foundation models and applications. We develop multi-modality models, including text, vision language models, our video generation model, Hailuo, and speech generation, music generation, and more. We also have many applications, including agents and other tools, in-house. That's the specific thing that's different from other labs and companies — we both develop foundation models and applications. So we have researchers and developers sitting side by side working on things. Our difference is that we have first-hand experience from our in-house developers into developing models that developers would really need in the community.

And here I want to introduce MiniMax M2, which is an open-weight model — very small, with only 10 billion active parameters — that was designed specifically for coding and workplace agentic tasks. It's very cost-efficient. Let me just go over the benchmark performance because people care about it. We rank very high in both intelligence benchmarks and also agentic benchmarks. I think we're at the top of open source models. But numbers don't tell everything, because sometimes you get those super high-number models, you plug them into your environment, and they just don't perform. So we really care about the dynamics in the community. In our first week, we had the most downloads, and we climbed up to top 3 token usage on OpenRouter. We're very glad that people in the community are really loving our model in their development cycle.

So today, what I want to share is how we actually shaped these main model characteristics that made M2 so good for your coding experience. I'm going to present the training behind it that supports each one of them — from coding experience to long-horizon state tracking tasks, to robust generalization to different scaffolds, to multi-agent scalability.

So first, let's talk about code experience, which we supported with scaled environments and scaled experts. Developers need a model that can actually work in the languages they use and across the workflows they deal with every day. That means we need to utilize real data from the internet and scale the number of environments so that during training — for example, during reinforcement learning — the model can actually react to the environment, target verifiable coding goals, and learn from them. That's why we scaled both the number of environments and our infrastructure so that we can perform that training very efficiently. With data construction and reinforcement learning, we were able to train a model that is very strong and full-stack multilingual.

And what I want to mention here is that besides scaling environments, which everybody talks about, we actually scaled something called expert developers as reward models. As I mentioned, we have a ton of super expert developers in-house who can give us feedback on our model's performance. They participated closely in the model development and training cycle, including problem definition, bug fixing, repo refactoring, and more. They identify the model behaviors that developers enjoy, identify what's reliable and what developers would trust, and they give precise reward and evaluation to the model's behaviors and final deliverables — so that it is a model that developers really want to work with and that adds efficiency to their workflow. With that, we were able to lead in many languages in real use.

The second characteristic that MiniMax M2 has is it performs well in those long-horizon tasks — tasks that require interacting with complex environments, using multiple tools with reasoning. We supported that with the interleaved thinking pattern and reinforcement learning. So what is interleaved thinking? With a normal reasoning model that can use tools, it normally works like this: you have the tool information given to it, you have the system prompts, you have user prompts, and then the model thinks, calls tools — it can call a couple of tools at the same time — gets the tool response from the environment, performs a final round of thinking, and delivers final content.

But here's the truth. In the real world, environments are often noisy and dynamic. You can't just do this once. You can get tool errors. You can get unexpected results from the environment. So what we did is imagine how humans interact with the world. We look at something, we get feedback, we think about whether the feedback is good or not, and then we make other decisions. We did the same thing with our M2 model. Instead of just stopping after one round of tool calling, it actually thinks again and reacts to the environment to see if the information is enough to get what it needs. We call it interleaved thinking — people call it that because it interleaves thinking with tool calling. It can be tens to 100 turns of tool calling within just one user interaction turn.

This helps with adaptation to environment noise. The environment isn't stable all the time — something is suboptimal, and then the model can choose to use other tools or make other decisions. It can handle long-horizon tasks and automate your workflow using, for example, Gmail, Notions, Terminal all at the same time. You just need to make one model call with minimal human intervention, and it can do it all by itself.

And here's a cool illustration, because we're in New York City — I feel the vibe of trading and markets. You can see there were some perturbations in the stock market last week, and our model was able to stay stable. Just like I said, there's environment noise, there's new information, there's news, there are changes like new trading policies and such, but our model was able to perform pretty stably in these kinds of environments.

The third characteristic is our robust generalization to many agent scaffolds, which was supported by our perturbations in the data pipeline. We want our agent to generalize, but what is agent generalization? At first, we thought it was just tool scaling — train the model with enough tools, various tools, new tools we invent, and then it would just perform well on unseen tools. Well, that was kind of true at first. But then we soon realized that if we perturb the environment even a little — for example, change the agent scaffold — it doesn't generalize.

So what is agent generalization? We concluded that it's adaptation to perturbations across the model's entire operational space. That operational space includes tool information, system prompts, user prompts, chat templates, the environment, and tool responses — all of these can be different. So we designed and maintained perturbation pipelines for our data so that our model can generalize across a lot of agent scaffolds.

The fourth characteristic I want to mention is multi-agent scalability, which is very possible with M2 because it's very small and cost-effective. I have a couple of videos here. This is M2 powered by our own MiniMax agent app — we actually have a QR code here, so if you want, you can scan and try it. You can see different copies of M2 working: it can do research, write up and analyze the results, put it in a report, put it in some kind of front-end illustration, and they can all work in parallel. Because it is so small and cost-effective, it can really support those long-running agentic tasks that require some kind of parallelism.

So what's next for MiniMax M2? From what I've introduced, we gathered environments, algorithms, data, expert values, model architecture, inference, evaluation — all of these to build a model that was fast, intelligent, could use tools, and generalizes. For M2.1 and M3 further down the road, we think about better coding, memory or context management, proactive AI for the workplace, and vertical experts. And because we have great audio generation and video generation models, maybe we can integrate them. But our mission is to bring all of these resources and values together to develop models for the community to use. We really need feedback from the community if possible, because we want to build this together. This is kind of a race that everyone needs to participate in, and we are committed to share it with the community. That's all the insights for today. We really hope you try the model — it's pretty good. And you can contact us and scan the QR code to try it. Basically, that's it. Thank you all for listening.

(17:35) Olive Song:

During reinforcement learning, the model tries its best to hack a lot of things. The current open models can achieve that level of understanding. It is a solvable problem, and we are working on it. Engineering is very, very, very important. I didn't know that during school.

(17:54) Nathan Labenz:

We'll continue our interview in a moment after a word from our sponsors. One of the best pieces of advice I can give to anyone who wants to stay on top of AI capabilities is to develop your own personal private benchmarks — challenging but familiar tasks that allow you to quickly evaluate new models. For me, drafting the intro essays for this podcast has long been such a test. I give models a PDF containing 50 intro essays that I previously wrote, plus a transcript of the current episode and a simple prompt. And wouldn't you know it, Claude has held the number one spot on my personal leaderboard for 99% of the days over the last couple of years, saving me countless hours.

But as you've probably heard, Claude is the AI for minds that don't stop at good enough. It's the collaborator that actually understands your entire workflow and thinks with you. Whether you're debugging code at midnight or strategizing your next business move, Claude extends your thinking to tackle the problems that matter. And with Claude Code, I'm now taking writing support to a whole new level. Claude has coded up its own tools to export, store, and index the last five years of my digital history from the podcast and from sources including Gmail, Slack, and iMessage. And the result is that I can now ask Claude to draft just about anything for me. For the recent live show, I gave it 20 names of possible guests and asked it to conduct research and write outlines of questions. Based on those, I asked it to draft a dozen personalized email invitations. And to promote the show, I asked it to draft a thread in my style featuring prominent tweets from the six guests that booked a slot.

I do rewrite Claude's drafts — not because they're bad, but because it's important to me to be able to fully stand behind everything I publish. But still, this process, which took just a couple of prompts once I had the initial setup complete, easily saved me a full day's worth of tedious information-gathering work and allowed me to focus on understanding our guests' recent contributions and preparing for a meaningful conversation. Truly amazing stuff. Are you ready to tackle bigger problems? Get started with Claude today at claude.ai/tcr. That's claude.ai/tcr. And check out Claude Pro, which includes access to all of the features mentioned in today's episode. Once more, that's claude.ai/tcr.

The worst thing about automation is how often it breaks. You build a structured workflow, carefully map every field from step to step, and it works in testing. But when real data hits or something unexpected happens, the whole thing fails. What started as a time saver is now a fire you have to put out. Tasklet is different. It's an AI agent that runs 24/7. Just describe what you want in plain English — send a daily briefing, triage support emails, or update your CRM — and whatever it is, Tasklet figures out how to make it happen. Tasklet connects to more than 3,000 business tools out of the box, plus any API or MCP server. It can even use a computer to handle anything that can't be done programmatically. Unlike ChatGPT, Tasklet actually does the work for you. And unlike traditional automation software, it just works — no flowcharts, no tedious setup, no knowledge silos where only one person understands how it works. Listen to my full interview with Tasklet founder and CEO Andrew Lee. Try Tasklet for free at tasklet.ai, and use code COGREV to get 50% off your first month of any paid plan. That's code COGREV at tasklet.ai.

(21:26) Ksenia Se:

Hello, everyone. Today I have the pleasure of talking to Olive Song, senior researcher at MiniMax. Recently they've been launching very interesting open-weight models specialized in different areas. Olive is currently working at MiniMax on the new version, MiniMax 2.2. Thank you for taking the time at 9 PM on Sunday night. Does everyone work like this at the company? I'm really impressed.

(21:53) Olive Song:

I think different people work on different schedules. We do have people who work even overnight, but they sleep during the day. We have a very flexible schedule — it goes with your experiments. For example, if the experiments run all day, the person can take a break. And if there's a lot of analysis to do, maybe because we're very curious about the results and very passionate, we can't really wait very long. So yeah, everyone has their own schedule.

(22:20) Ksenia Se:

That's telling about the success of the models. I think that influenced it. You specialize in reinforcement learning and model evaluation, as far as I understand, which are two of the least forgiving parts of model development. And you also have more constraints than big American AI labs. What does a good day look like for you, and what does a bad one look like?

(22:41) Olive Song:

I can share something about our recent weeks. There's not really a whole good day or a whole bad day. We were joking that during one day, we have good results in the morning and then sometimes bad results at night. We call it — we have ICU in the morning and then KTV at night. Typically, a good time would be receiving some good results, or even running into a new problem is a good time. For example, during reinforcement learning, we can see the model doing a lot of different things to achieve the results, and sometimes we just discover new model behaviors. That's really exciting, even if it might not be safe or expected. I call that a good time.

A bad time would be — well, it really isn't a bad time, except for the moment of finding out the bad results. That moment itself is bad. But then trying to figure out the problem and break it down — that's a pretty good time.

(23:41) Ksenia Se:

What were the recent model behaviors that you didn't expect?

(23:44) Olive Song:

During reinforcement learning, the model tries its best to hack a lot of things. For example, it uses bash a lot, and sometimes the behaviors aren't very safe, as our expert developers point out. Sometimes the expert developers have their own expectations of how the model works, but it doesn't go that way if we don't constrain it. So we do a lot of alignment to solve that issue.

(24:09) Ksenia Se:

You just launched MiniMax Her, and that went all over Twitter. How do you come up with those ideas? Because role-playing is sort of — is it an alignment question? Is it not? How do you do that?

(24:23) Olive Song:

Frankly speaking, I'm not the expert on that part. We have a whole team on role-playing and the Her stuff. I'm not an expert, but we do have a lot of discussions. We do believe that role-playing — accompanying humans, enabling human interactions — is very important in life with AI, and in how it would change our social life in the future. And it absolutely represents a capability that's very superior, because it's human-like. The model has emotions, it understands your emotions. It's not just working out exam problems. That's absolutely another side of AI capability.

(25:00) Ksenia Se:

What is it called — "Intelligence with Everyone," right? Is that the MiniMax tagline?

(25:04) Olive Song:

Yeah, intelligence with everyone.

(25:06) Ksenia Se:

Intelligence with everyone. What does it mean for you?

(25:09) Olive Song:

For me personally, I feel like it's more about how it changes my life and enables me to do more work, and how it can connect me better to different people. Before, I wouldn't be able to understand a lot of very professional problems — for example, very professional coding problems or optimization problems. And now I can do that with AI, so I can communicate with more people and exchange more ideas. That's one side. On the other side, it generally helps my daily life — my work, my daily routine, my self-care. It changes life for me, and I hope it changes life for everybody, obviously in a good way.

(25:48) Ksenia Se:

Can you tell me a little bit about how day-to-day work is organized in your lab? I remember from your talk at AI Engineer that it's very interconnected between developers and researchers. I'd love to hear more about that.

(25:59) Olive Song:

Absolutely. We sit together every day and share our experiment results. For example, during reinforcement learning experiments, we see some scores going up high. We look at the model's behaviors, and we look at them together with the developers in that area. We sit together, and then they'll spot the issue right away. And then we're able to come up with new ideas to fix it or build more data around it.

(26:27) Ksenia Se:

If we can go into details — like your current work on the current model, the current version — what are the biggest problems you're trying to solve compared to the previous version?

(26:38) Olive Song:

One important thing we're focused on right now and also in the future is human alignment, because we are focusing on coding models for 2.1, 2.2, and the M2 series. And what we realize is that for the model to become very productive in our daily work — productive and safe at the same time — we have to do a lot of alignment on it. The model can't just grow on its own and do some dangerous behaviors just to achieve the final goal. So for us, the important thing is how we define human alignment, how we define expert expectations, and how we actually train the model to be more aligned with those expectations.

(27:20) Ksenia Se:

So I want to go into some real details here, and you're the expert, so correct me if I'm wrong. But I saw there was recent interest in details like keeping the LM head in FP32 during reinforcement learning training. Why do small decisions like this end up mattering more than just a clever new algorithm?

(27:41) Olive Song:

It all comes down to getting closer to the theoretical algorithm. We have the theoretical reinforcement learning algorithm, but when we implement it, it can be a little bit off. That creates a small gap to the theoretical extreme of the algorithm. So our approach is to try to scale to the theoretical extreme. The precision part is one thing we found that prevents us from being close to that extreme, and that's how we solved it.

That was a very funny story, actually, when we discovered it. I talked about it when we published MiniMax M1. During our experiments, we found that the accuracy wasn't going up. We looked layer by layer — we looked at the log probabilities layer by layer — and found it. Theoretically, it had to work. And then there had to be some gap between the theoretical expectation and how we approached it. So we thought about the gap, analyzed it layer by layer, and eventually found it.

(28:46) Ksenia Se: Is there anything like this happening now?

(28:48) Olive Song: Definitely, yes. Every single day, and in every different group. I can't actually disclose something we haven't reached a concrete conclusion on, because we want anything we say publicly to be very concrete and deeply understood. So if we have breakthroughs, we'll definitely publish them later. But I will say we encounter these problems every day, and we think from first principles — from the very fundamental part of the problem — and then approach it from there.

(29:18) Ksenia Se: The models that you launch are open weights. From your perspective and from the alignment perspective, what do builders actually gain from open weights, and what responsibility do they have to take on that you don't have to take responsibility for?

(29:33) Olive Song: I'm actually not an expert in building things with models. I feel like because it's open weight, people can have freer use with it. For example, they can deploy it themselves, or they can even fine-tune it and keep all the data on their own infrastructure — which is very safe from a privacy standpoint.

(29:54) Ksenia Se: But if we talk about alignment, how do you look at that from that perspective when the model is out there in the wild? Before you launch the model, before you publish it, what tells you that it's safe to publish?

(30:06) Olive Song: We have some internal benchmarks in terms of safety, and they cover different dimensions — something like sensitive safety, or alignment safety. We use those as our evaluation. Then, about one or two weeks before launching, we do scaled-up evaluations and scaled-up alignment work on the model, and that's how we assess whether it's safe. But once it's open weight in the wild, people can do things with it that we can't control. I don't know how we fully handle that. Frankly speaking, there are laws around that — there are regulations where people do agree on some moral standards.

(30:51) Ksenia Se: Do you follow any reinforcement learning failure modes that haven't shown up in benchmarks but then become obvious in real agentic use? How do you collect feedback for the next versions to improve the reinforcement learning process?

(31:06) Olive Song: We collect feedback on the model itself first. When we publish a model externally, many developers and users start using it, and we collect that feedback systematically. We analyze each problem — some are fundamental, some are things we just missed and can fix quickly. There are two parts: first, we do internal evaluation with our developers, they point out problems, and that's how we fix that part. But that's not enough — more feedback comes after we officially publish, and we collect that too. The way our group is organized, different people work on different capabilities of the general model. If we collect things we think we should improve, different people take their parts. They say, "I think I can solve this issue, and I'll address it in the next generation." That's how we collect feedback and improve the model.

(32:01) Ksenia Se: How did you initially decide not to build one general-use model for everything, and instead go more into specialization, like coding?

(32:10) Olive Song: I think we are approaching generalized models — it's just that we're putting more emphasis on coding right now. For example, our model can be plugged into any general agent scaffold, including our own agent product, for general-purpose use. We do work on researching, report writing, presentations, stuff like that — that's more general. Personally speaking, I feel like with coding you can kind of structure the whole world, or model a lot of things.

(32:10) Ksenia Se: Yeah, engineer it.

(32:10) Olive Song: Yeah, with engineering. So behind it, it's scaled-up humanity for me. It has a lot of intelligence in it and a lot of work to do. That's how we view this. But we do work on generalized things, and even more generalized things in later versions — for example, our model will be able to handle general workplace scenarios in the future, and that's not just coding.

(33:07) Ksenia Se: If we talk about coding and agentic use, it requires long horizon. How do you solve long-horizon tasks for agentic use?

(33:15) Olive Song: I think defining your goals well and defining the model behavior clearly is key. And we also require great infrastructure — extraordinary infrastructure — for reinforcement learning. The very important issue, besides the algorithms and things people have been working on for a long time, is how we define agents, how we define how an agent model would work. First, you need to define the task and the model's goal, especially in a long-horizon task. You need goals that are actually hard and diverse. The second part is that you need environments — great engineering environments, scaled-up environments, different diverse environments, not just coding but also workplace scenarios, different kinds of tools. That's great engineering. And then you need great infrastructure — outstanding RL infrastructure to let the model really roll out over a very long horizon with efficient GPU use and efficient training rollout. I feel like that's what's different about agentic reinforcement learning compared to before.

(34:24) Ksenia Se: Are you affected by GPU constraints? How do you solve the compute problem?

(34:29) Olive Song: We do have a team that works on utilizing compute as efficiently as possible. That's actually one of the RL scaling challenges — utilizing compute very efficiently. Their purpose is to minimize compute use while training more. Personally, I don't really have a GPU constraint because we have a great team working on maximizing utilization while stabilizing training as much as possible.

(34:56) Ksenia Se: But do you have problems you need to solve with your own expertise on how to use compute more efficiently, or is it just that team?

(35:02) Olive Song: We are actually the same team — we're the reinforcement learning team. We view this issue from different perspectives. It can be an implementation perspective, a data perspective, different perspectives. But our goal is the same.

(35:18) Ksenia Se: We're always looking forward to new solutions coming from Chinese labs because it's always mind-blowing.

(35:25) Olive Song: We are actually working on some new agentic reinforcement learning approaches, but they won't really come out with M2.2 — that's the next-generation model. We are still working on it. I'm not sure what I can share, so I'll share it later when we have concrete conclusions, as I said before. I can't really say something we haven't documented yet.

(35:47) Ksenia Se: Will it be available when the model is out?

(35:50) Olive Song: That depends on our timeline. I'm not very confident yet, but we are dedicatedly working on it.

(35:58) Ksenia Se: Yeah, a lot of constraints when talking to researchers. Well, if we talk about openness — this whole conversation I'm having with people right now this quarter is about open source. I wonder if you can talk about the company strategy: why did MiniMax decide to publish open weights? What are the benefits, and what are the downsides?

(36:20) Olive Song: For our team — the research team — we always wanted to go open source because the open source community is fantastic. I learned that from day one when I joined: the open source community is fantastic. So as researchers, we did want to join it. On the other hand, speaking of the cons, we are a company, and people care about whether this can make money, whether it's a viable business. The downside is that if the weights are open sourced, fewer people will use the API. But as a researcher, that's not really my focus, so I'm not very confident about the company strategy there. For the technical part, we just believe that we can build better models with the open source community.

(37:05) Ksenia Se: How much do you use open source tools yourself from other companies?

(37:09) Olive Song: A lot. For inference, we use — I'm not sure if I'm allowed to name specific ones — but we collaborate with both vLLM and SGLang, and they are open source code repositories.

(37:22) Ksenia Se: How do you look at the open source stack? Because when we talk about open source, sometimes it's perceived as one thing, but actually it's multi-layered. How do you look at it?

(37:31) Olive Song: For example, there are a lot of open source agent scaffolds — both coding agents and general agent scaffolds — that we use ourselves to test our models. We look at their logic. We look at their code to see how they design specific scaffolds and engines. Then we take what they did really well, and we reflect on how we think about the problem, how we structure it, whether we're on the same page. So we learn from each other.

(38:01) Ksenia Se: Do you think teams underestimate how much engineering discipline open models require compared to using closed APIs? It always requires a lot of setup — different compute, and you need engineering talent to do it, instead of just choosing a closed API, turning it on, and using it. Do you have any difficulties with that, or is the open source stack inside your company established and working?

(38:30) Olive Song: Personally, I don't have a problem with that. There are other open source models, and if they publish something, I'll just download it, deploy it on our machines, and work with it. But if there are personal developers out in the world without their own compute, I understand the problem — it will be easier for them to just connect to a model through, say, OpenRouter or similar services.

(38:55) Ksenia Se: Do you use other open models on OpenRouter yourself? Do you play with them?

(39:00) Olive Song: Yeah, I play with them. I play with them day one. If they release at midnight, I play with them at midnight.

(39:06) Ksenia Se: Are you taking notes?

(39:09) Olive Song: I don't actually take notes, but I do have my personal evaluation stack — a list of fun questions that I like to test with every single model to see how they work.

(39:18) Ksenia Se: Can you tell me about it? That's super interesting.

(39:20) Olive Song: Yeah. I've been collecting a bunch of questions since I entered the company, across different areas — logical reasoning, mathematics, proofs, report writing, agentic tasks, and a lot more. Some of them I just like to see how the model reacts to and how they approach it. Different models have different personalities when they approach these problems.

(39:46) Ksenia Se: That's true, and you always need to adjust to them. If we want to give a little guide to people who want to evaluate a model themselves — can you give me examples of questions? Like five questions you need to ask the model to understand how it works, if it works well.

(40:03) Olive Song: From the professional evaluation perspective, five questions isn't enough. If you want a very standard and very fair comparison among models, you need to make it a statistically confident test. There has to be a certain number of questions in each domain to see how the model performs, and usually you need to test multiple times because models are not perfectly stable. If you're testing for fun, use the fun questions. But if you're actually assessing the model's capabilities, you need proper test sets — that's what's fair across different models. Because some questions have answers that aren't single or definitive, and sometimes the environments aren't fixed either. So if you're doing professional evaluation, you have to make sure the evaluation is correct, it's diverse, and it's above a certain threshold so that the test is statistically confident.

(41:08) Ksenia Se: You mentioned characters. How do you work with your model's character?

(41:12) Olive Song: I don't work on my model's characters. That's how I think of this issue: a general model should have all characters, or at least be able to perform all characters. It might have a default character, but if the user wants it to be a different character, it should be. If the model gets a character injected into the system prompt, it should follow that. That's how I view this.

(41:33) Ksenia Se: I find it hard to adjust to new models because they're so different in terms of character all the time. I just don't even understand why it happens.

(41:42) Olive Song: I think it has to be related to the data the model was trained on, the different patterns the model has been trained on, and also different teams might have their own constitution — in the system prompt, or as the model's default behavior.

(42:03) Ksenia Se: If you look at open models in production today — I don't know if it's a relevant question, but where do they fail first? Open model-specific issues like reasoning, tool use, state tracking, evaluation blind spots — there are all those risks for open models. Where does it break first?

(42:22) Olive Song: I think open models are not very good at adjusting to different environments. What I see right now is that people use Claude in different coding environments and think it performs well across all environments — different tool definitions and so on. But I don't feel like the current open models can achieve that level of accuracy or that level of understanding of different environments.

(42:48) Ksenia Se: Why? Where is the problem?

(42:50) Olive Song: I don't know exactly how Claude does it. But for me, I think it is a solvable problem, and we are working on it. We're improving it in M2.2, but it's still not as good as, say, Opus. For M2.5, it might be. We do have some systematic research going on in this area that has shown some results, but it's not a concrete conclusion yet, so I won't say more.

(43:14) Ksenia Se: I'm so curious. Do you think it's a compute problem, because they have this infinite amount they can just throw at it?

(43:20) Olive Song: I feel like compute is one side, but how we structure the problem and how we approach it is another side, and that's where we're more confident that we can solve it.

(43:30) Ksenia Se: What can you tell me about M2.2, if it's launched by the time this interview is out? Can you give me some overview?

(43:38) Olive Song: Better coding, obviously, and better multilingual coding, obviously, and more stable than before. It has better performance across the areas of M2.1 — more stabilized, longer horizons, and so on. We are testing it in different environments right now, and we believe it's better than before. Even environments we haven't seen before, even environments that are totally out of distribution — we see some very promising scores that are higher than M2.1.

(44:12) Ksenia Se: I wonder how you stay updated to everything that happens, which is super hard because the pace is just insane. You said when models come out you're playing with them. Do you read research papers? What other interests help you cross-pollinate with what you do? How do you stay up to date, and what inspires you?

(44:34) Olive Song: There are different articles, different blog posts going out every single day, a huge amount of information. How we deal with it is that we have an internal agent that tracks all the new articles, blogs, and papers, then dispatches them to different subject areas, summarizes them, and analyzes them for researchers. So we have this internal researcher agent, if I call it that, which does some filtering by itself and then delivers what's filtered to us — and we can improve the agent if we think it doesn't do well. That's how we filter out a lot of information first. Then we play with new code repositories using coding agents so that we can understand them more quickly. So we're keeping pace with all the improvements — with agents, and with models more broadly.

(45:30) Ksenia Se: That's fascinating. When you became a researcher, when you chose this path, what did you think you would be doing, and what are you actually doing? Is it close to what you thought?

(45:41) Olive Song: That's a really good question. When I joined the team, I thought I would be reading papers every day, because that's what I was doing during school — during a lab, you read papers, come up with ideas, implement ideas, run experiments, and if the results are good, you run at larger scale. I thought I was about to do that. But what I realized was that when you join the company and work for a couple of months, you already become pretty much at the top of the area or the industry, and you have to come up with something that's really new, or you encounter problems you just don't know how to solve. It's not like you can read a lot of papers and build up your thinking on them. It's more like you need to really understand the problems from the fundamental level, and think from the fundamentals to find the right solution. Another thing: engineering is very, very, very important. I didn't know that during school, because during school or in labs, things are more like toys compared to companies — not that scaled up. But when you really do scale up data, scale up compute, scale up people, you encounter engineering issues that you need to tackle very beautifully. Engineering is very important. Those were the two main things that were different from what I imagined.

(47:03) Ksenia Se: When you work on the model currently, is it mostly that you're solving problems you see immediately from your hands-on work? Or is it that the company says "we have to achieve, say, Opus results"? How do you set goals?

(47:20) Olive Song: We have a meta goal at the company level — for example, we want to improve AI's capabilities in improving productivity, because that's how people view AI's value. So we have the company's mission. As an individual researcher on the team, we have our own missions and we set our own goals within that.

(47:41) Ksenia Se: What is your goal currently?

(47:43) Olive Song: For the next generation, I really want the model to be working elegantly with experts — better collaboration with experts, with developers. That's my goal, but that might be two versions away. I think we're launching one version about every month to a month and a half. For longer-horizon goals, we're definitely working on them. But for the immediate goal along that path, that's like a three-month away thing. The better-collaboration goal is more like one to two months away.

(48:17) Ksenia Se: I wanted to ask a clarification question about interleaved learning. When you were talking at AI Engineer, you also said the model doesn't settle on one action — it's constantly in a loop of asking more questions and trying things. How do you look at it? Is it continual learning? Is it part of it? What do we need to solve to have the model continuously doing this learning over longer and longer horizons?

(48:42) Olive Song: That has some overlap with the defined concept of continual learning — both conceptually and technically, I think. But I don't feel like they are exactly the same. The things I talked about at the summit were not at the level of full continual learning; it's more like on the path toward that.

(49:02) Ksenia Se: How do you see it being solved? Any ideas?

(49:05) Olive Song: We do think it's a different problem definition — a different way of the model working with people — and we're working on that with our own defined questions. But if I had to say how we approach it, I would say we approach it through experiments. It's a very interesting question on continual learning, and still very exploratory. That's definitely where we're going, but it has different phases or stages. We might approach stage one first while exploring more stages later.

(49:45) Ksenia Se: And you haven't yet outlined the stages?

(49:48) Olive Song: We do have internal definitions that I didn't prepare to share today. I would say the first stage would be to be more stabilized in long-horizon tasks, beyond what I set up at the summit. And then the next thing would be optimization.

(50:03) Ksenia Se: Can you repeat that for context? People don't know what you said at the summit.

(50:06) Olive Song: So for example, we see a model in a new environment. It receives environment feedback. It needs to know what to explore and what environments to look at, because it's a partially observed environment. It needs to know which actions to take to receive better information, get better reactions, and then perform harder, more complex tasks in the environment. That's more of stage one, right? That's fairly basic — basically all agentic models can do that to some extent, maybe not perfectly, but to some extent. That's how we can actually solve it with our current algorithms. But we do see different patterns of how a model improves itself in an environment that we don't have a concrete conclusion on yet. Maybe in M2.5 we will. That would be a different definition from what I described — the model itself would be defining its own goal. That's something that would be different.

(50:54) Ksenia Se: Thank you so much. My last question is about AGI. Do you believe in AGI, and if yes, what does it look like to you?

(51:02) Olive Song: That's a very large question. People talk about AGI and ASI every day. Actually, when I was interviewing at MiniMax, when I was interviewing with our CEO, I said the same thing — because he asked me the same thing. What I said was that people talk about AGI, people have different definitions of AGI, but we can only know the definition of AGI when we achieve it. Or rather, it's still progressing so fast that the definition changes every day, and people have different views on it. But what I think is more important is that we actually work toward it — work toward our own vision of AGI. And as long as we figure it out, it becomes true. That's what I said during the interview, and that's still my view today: the definition will become true when it becomes true.

(51:48) Ksenia Se: When we see it, we'll know it's AGI.

(51:50) Olive Song: Yes, exactly.

(51:52) Ksenia Se: But we're not there yet.

(51:53) Olive Song: No. There can still be better AI intelligence, for sure.

(51:56) Ksenia Se: Thank you. One more last question: what was the book that influenced you the most? It can be a recent book or a book from your childhood.

(52:05) Olive Song: Let me just double-check the name. Something like The Art of Creativity, or something like that — I read it during undergrad, so it's been a long time. I don't remember the exact name.

(52:15) Ksenia Se: Yes, there is a book called The Art of Creativity. How did it influence you?

(52:18) Olive Song: It opened up how I think about my own mind, and how I view the world and problem solving. For me now, problem solving is more of a discovery. That's how I would summarize it in one phrase.

(52:30) Ksenia Se: Thank you so much. Thank you for your time. That was very interesting.

(52:34) Olive Song: Thank you for having me.

(53:58) [Song]: Interleave, thinking like the way we move through life. Look and learn, adapt and turn, cut through the noise like a knife. Fifty-two calls deep, one conversation wide. The environment is noisy, but the model holds the ride. Ten billion strong, but running light as we breeze. Cost-effective, multi-agents doing what you please. Intelligence with everyone. Guild of humanity under the sun. From the open source through the open road, Cognitive Revolution — let the story be told. Intelligence with everyone. Intelligence with everyone. I see you in the morning, KTV at night. Problem solving is discovery in a different light. Scaled to the theoretical extreme — push through, the definition becomes true when new work comes due. Engineering is everything, yeah. First principles, first principles — we build a future together. Intelligence with everyone. Intelligence with everyone. Intelligence with everyone.

(55:49) Nathan Labenz: If you're finding value in the show, we'd appreciate it if you take a moment to share it with friends, post online, write a review on Apple Podcasts or Spotify, or just leave us a comment on YouTube. Of course, we always welcome your feedback, guest and topic suggestions, and sponsorship inquiries either via our website, cognitiverevo​lution.ai, or by DMing me on your favorite social network. The Cognitive Revolution is part of the Turpentine network, a network of podcasts which is now part of a16z, where experts talk technology, business, economics, geopolitics, culture, and more. We're produced by AI Podcasting — if you're looking for podcast production help for everything from the moment you stop recording to the moment your audience starts listening, check them out and see my endorsement at aipodcast.ing. And thank you to everyone who listens for being part of the Cognitive Revolution.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to The Cognitive Revolution.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.