Brad Zhang / public archive

Brad Zhang

AI product notes, agent workflow writing, and open-source dossiers for founders and early technical teams.

Back to X dispatches

Article / March 26, 2026

English indexed dispatch

Lin Junyang’s latest must-read masterpiece! AI has shifted from "reasoning thinking" to...

Lin Junyang’s latest must-read masterpiece! AI has shifted from "reasoning thinking" to "agentic thinking". 🏀The frontier of competition is shifting from "better RL algorithms...

Lin Junyang’s latest must-read masterpiece! AI has shifted from "reasoning thinking" to "agentic thinking". 🏀The frontier of competition is shifting from "better RL algorithms + stronger feedback signals" to "better environment design + harness engineering + closed-loop system capabilities". 🏀In the future, it is not about training isolated models, but about training the entire system of agent + environment + multi-agent orchestration. From "thinking" to "doing", from static monologue to tool-enhanced action thinking, this is the direction that can truly improve productivity. 🏓🏓🏀⚽️ In the past, reasoning RL could have self-contained trajectories and clean verifiers; now it is embedded with a lot of harness (tool server, browser, terminal, sandbox, memory, orchestration framework), and the environment becomes a dynamic, partially observable, state machine with delay. Training-inference decoupling, environment quality (stability, authenticity, anti-cheating), and anti-reward-hacking have all become top priorities. The more tools a model can use, the easier it will be to learn to be lazy, look up answers, and exploit loopholes, and the reward hacking playground will instantly become bigger. The frontier of competition is shifting from "better RL algorithms + stronger feedback signals" to "better environment design + harness engineering + closed-loop system capabilities". 💿💽💽💽 In the future, instead of training isolated models, we will train the entire system of agent + environment + multi-agent orchestration. From "thinking" to "doing", from static monologue to tool-enhanced action thinking, this is the direction that can truly improve productivity. In the past two years, o1 has turned "thinking" into a first-class citizen that can be trained and exposed. DeepSeek-R1 proves that this set of things can be reproduced and scaled up. But in the first half of 2025, everyone is still thinking about "how to make the model think a few more steps, how to give a stronger RL signal, and how to control the thinking budget." 💿💽💽💽 🏓🏓🏀⚽️ The real next stop is not "making the model think longer", but Agentic Thinking: Thinking is for action, interacting in the real environment, getting feedback, constantly changing plans, and closed-loop execution. The Qwen team originally wanted to forcibly merge the Thinking and Instruct modes into one model. The ideal was very fulfilling (controllable budget, automatic judgment of whether to think too much or not). The reality is very skinny: data distribution and behavioral goals are completely in conflict - Instructing needs to be fast, short, straight, and high throughput in batches; Thinking needs to be long, structured, and explore multiple paths. Hard merge is easy to displease both sides. In the end, Instruct and Thinking were divided into two lines for optimization. Anthropic insists on the integration route. Claude

3.7/4 emphasizes that users can control the thinking budget, and also allows tools to be directly called during the thinking process, targeting coding and long-running agent tasks.

Open original on X