SDAR: Self-Distilled Agentic Reinforcement Learning
把 GRPO 主目标和 gated OPSD auxiliary objective 结合,用 token-level teacher-student gap 控制 privileged context 的蒸馏强度。
Read reportPaper Reading
Chinese reading reports for papers I want to keep close: method structure, experimental evidence, and reusable research takeaways.
把 GRPO 主目标和 gated OPSD auxiliary objective 结合,用 token-level teacher-student gap 控制 privileged context 的蒸馏强度。
Read report从 reverse-perplexity curriculum SFT 到 RLVR、proof-level RL,再到 solve-verify-refine test-time scaling 的统一 post-training recipe。
Read report把 context learning 推到群聊、会议、笔记、账单、浏览轨迹等真实生活材料,检验模型是否会正确使用嘈杂上下文。
Read report让 multi-turn RL agent 学会周期性生成 task-relevant summaries,在固定 context window 内训练和执行更长 horizon 的工具调用任务。
Read report