GitHub shipped a feature called Rubber Duck for Copilot CLI that pairs two AI models from different families to review each other's code before execution. The cross-model approach closes 74.7% of the performance gap between Claude Sonnet 4.6 and Opus 4.6 on SWE-Bench Pro.
What Happened
GitHub announced on April 6 that Copilot CLI now includes an experimental Rubber Duck feature. When a Claude model serves as the primary coding agent, GPT-5.4 acts as an independent reviewer, examining the agent's decisions at critical checkpoints. The feature activates automatically after planning phases, complex implementations, and test writing.
The name references the classic debugging technique of explaining code to an inanimate object to spot mistakes. GitHub's version replaces the rubber duck with a second AI model from a competing family, giving it genuine analytical capability.
Why It Matters
Single-model coding agents share blind spots. When the same architecture generates and reviews code, systematic biases go undetected. By pairing models from different families, Rubber Duck introduces genuine diversity of perspective into automated code review.
The benchmark results back this up. On SWE-Bench Pro, Claude Sonnet 4.6 paired with GPT-5.4 as Rubber Duck approached the resolution rate of Claude Opus 4.6 running solo. On difficult multi-file problems requiring 70 or more steps, the paired setup scored 3.8% higher than Sonnet alone.
This suggests that model diversity matters more than raw model size for complex engineering tasks. Getting a second opinion from a fundamentally different architecture catches errors that scaling within one family does not.
Key Details
- Rubber Duck surfaces a short, focused list of high-value concerns rather than comprehensive feedback
- The feature identifies architectural flaws, edge cases, and logical errors the primary model might miss
- Available through the
/experimentalcommand for users with GPT-5.4 access - Works through existing Copilot CLI infrastructure with no additional setup
- Users can also trigger it manually at any point during a session
What to Do Next
Developers using Copilot CLI can enable the feature now via /experimental. The approach is worth testing on complex, multi-file refactors where single-model agents tend to accumulate subtle errors. GitHub is tracking feedback through a community discussion thread.
This follows GitHub's Copilot /fleet launch earlier this month, which runs multiple agents in parallel. Together, these features signal a shift from single-agent to multi-model workflows in AI coding tools.