SDAR: Self-Distilled Agentic Reinforcement Learning

把 GRPO 主目标和 gated OPSD auxiliary objective 结合,用 token-level teacher-student gap 控制 privileged context 的蒸馏强度。

agentic RLself-distillationGRPOOPSD
Read report

SU-01: Gold-Medal-Level Olympiad Reasoning

从 reverse-perplexity curriculum SFT 到 RLVR、proof-level RL,再到 solve-verify-refine test-time scaling 的统一 post-training recipe。

olympiad reasoningRLVRproof rewardtest-time scaling
Read report

CL-bench Life: Real-Life Context Learning

把 context learning 推到群聊、会议、笔记、账单、浏览轨迹等真实生活材料,检验模型是否会正确使用嘈杂上下文。

benchmarkcontext learningreal-life context
Read report

SUPO: Summarization-based Context Management

让 multi-turn RL agent 学会周期性生成 task-relevant summaries,在固定 context window 内训练和执行更长 horizon 的工具调用任务。

multi-turn RLcontext managementsummarizationagent
Read report