Paper Reading

2026-05-14

SDAR: Self-Distilled Agentic Reinforcement Learning

把 GRPO 主目标和 gated OPSD auxiliary objective 结合，用 token-level teacher-student gap 控制 privileged context 的蒸馏强度。

agentic RLself-distillationGRPOOPSD

2026-05-13

从 reverse-perplexity curriculum SFT 到 RLVR、proof-level RL，再到 solve-verify-refine test-time scaling 的统一 post-training recipe。

olympiad reasoningRLVRproof rewardtest-time scaling

Read report

2026-04

把 context learning 推到群聊、会议、笔记、账单、浏览轨迹等真实生活材料，检验模型是否会正确使用嘈杂上下文。

benchmarkcontext learningreal-life context

Read report

2025-10-08

让 multi-turn RL agent 学会周期性生成 task-relevant summaries，在固定 context window 内训练和执行更长 horizon 的工具调用任务。

multi-turn RLcontext managementsummarizationagent

Read report