I believe AGI will arrive soon.

My recent work focuses on LLM agents that can reason over multi-step tasks, use tools, interact with task environments, and improve through evaluation, data feedback, and reinforcement learning.

ByteDance logo

ByteDanceLLM algorithms

Alibaba logo

Alibaba ATHInternmultimodal LLMs

Tencent logo

Tencent ARCInterndocument AI

News

Recent Signals

Live
  1. SkillsBench became the fastest benchmark repository to reach 1k GitHub stars, with 1.1k stars within two months of release

  2. SkillsBench appeared in recent model-card and release discussions, including Qwen 3.6 Plus and HY-3

  3. SkillsBench now covers 86 tasks, 11 domains, 7,308 trajectories, and 40 indexed benchmarks

  4. Harbor v0.6.5 was released for agent evaluation, task environments, and RL-ready rollout workflows

  5. Started working on LLM algorithms at ByteDance

Selected Publications

Papers

View full publication list

Open Source

Projects

SkillsBench overview figure
01S
Fastest to 1k starsQwen model card

SkillsBench

1.1k stars86 tasks7,308 trajectories

A benchmark for testing whether agent skills actually work across tasks, models, and environments.

  • 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance, and more
  • 11 domains and 40 indexed benchmarks
  • Referenced by leading model labs and recent agent-skill research
Harbor homepage screenshot
02H
Agent evalsRL environments

Harbor

1.8k stars993 forks914 commits

An evaluation and environment framework for agent tasks, rollouts, and RL-oriented data generation.

  • Unified harness for agent benchmarks and task environments
  • Rollout generation for inspection, auditing, and training
  • Designed for repeatable evaluation and optimization workflows
View project notes

Visitor Map

Where visitors come from

This map starts collecting public, approximate visitor regions after it is loaded on the live site.

Loading live visitor map...