Between February 5 and February 12, 2026, we shipped 18 pull requests on the AdaptiveTest platform. One engineer. Claude Code as the primary implementation tool. Here's what we built, how we built it, and what we learned.
The Stack
AdaptiveTest is a K-12 adaptive testing platform: FastAPI + Python backend, Next.js + TypeScript frontend, PostgreSQL database, deployed on Railway (backend) and Vercel (frontend). The AI services use Claude Haiku 4.5 (question generation) and Claude Sonnet 4.5 (learning recommendations).
The Sprint
Here's what shipped, in order:
PR #1 — CI/CD Pipeline: GitHub Actions running tests, lint, and security checks on every push. The foundation that makes everything else possible.
PR #2 — OneRoster Validation: 38 tests validating the SIS integration. Found and documented 4 critical security issues.
PR #3 — OneRoster Security: The big one. Fernet encryption for credentials, Clerk JWT + RBAC on all endpoints, atomic sync with rollback, 220 tests. This was the PR that took the longest to review — the security implications required careful human attention.
PR #4 — Audit Logging + Data Deletion: FERPA compliance features. Full audit trail, cascading data deletion on request.
PRs #5-10 — LTI 1.3, AI proxy fixes, service wiring. Each PR focused on one integration point.
PRs #11-14 — Tier gating (34 endpoints), billing enforcement, Stripe webhook handling.
PRs #15-18 — Sentry integration, rate limiting (68 endpoints), production hardening.
By the Numbers
18 PRs merged. 532 tests written. 83%+ code coverage across the platform. Zero production incidents during or after the sprint. 7 calendar days.
What the Human Did
Directed priorities. Wrote terminal briefs. Reviewed PRs — especially security-sensitive ones like the OneRoster encryption work. Made architecture decisions (which model for which AI service, how to structure billing tiers, where to put rate limits). Verified production deployments.
What the human did NOT do: write implementation code, write tests, debug CI failures, format PRs, or handle routine engineering work.
Lessons Learned
The terminal brief is everything. The quality of the agent's output is directly proportional to the quality of the spec it receives. Vague briefs produce vague code. Precise briefs with acceptance criteria, reference files, and explicit constraints produce production-quality PRs.
Review time is now the bottleneck. The agent can produce PRs faster than a human can review them. This is the Level 3 reality — and the motivation for Level 4, where the review process itself is automated.
Test coverage is non-negotiable. Every PR includes tests. The agent writes them alongside the implementation. This isn't a nice-to-have — it's how you maintain quality when code is being produced at this pace.
The factory works. Not perfectly. Not autonomously. But demonstrably. 18 PRs in 7 days is not a demo — it's a production sprint with real security fixes, real billing integration, and real monitoring setup.
We're at Level 3. Level 4 is next.