New v2.0 / Live attribution + multi-agent review

Hire the engineers who think, not just the ones who type fast.

Gonfire watches every keystroke, every prompt, every decision. Real codebase. Real complexity. Zero interview theater.

interview-task.ts candidate · jamie@example.com
1 // session.id 9k2x · candidate connecting...
Multi-agent review 10 specialist agents
In production at Strawn, Quartzic, Foundryx, Pylant, Lumenco
The problem

The interview broke.
Nobody wants to admit it.

AI didn't just change how engineers code. It nuked the rituals you've been using to hire them.

01

Take-homes are dead.

Every candidate ships the same Cursor-generated solution. You have no idea who can actually engineer.

87% identical structure
02

On-sites test memory.

Whiteboard inversions are theater. The job is judgment, taste, and recovery. None of which fit a 45-minute slot.

screen take-home on-site debrief ~3 weeks · 8 hours of you
03

You see the output. Never the process.

The PR looks great. But did they design it, debug it, or did Claude do the whole thing in one prompt? You'll find out at month three.

// 73% AI-generated human
How it works

One assessment.
Three lenses on every candidate.

Define it in five minutes. Send the link. Get a complete picture back, automatically.

Connect your repo. Define your rubric. Done.

Point Gonfire at any GitHub repo. Pick the area you want candidates to work in. Pick the signals that matter.

  • Real production code, not LeetCode
  • Custom rubric: judgment, taste, recovery, orchestration
  • One link. Sends to candidates instantly.
github.com/your-team/api-server
Task: refactor the rate limiter
Rubric: judgment, taste, recovery
Time budget: 90 minutes
Link generated · gonfire.io/a/9k2x

Candidates build in a real IDE. With AI. On your code.

No local setup. No toy problems. They get a browser-based VS Code with Claude Code attached, working against your actual codebase.

  • Browser IDE. No install, no environment hell.
  • Claude Code as their copilot, just like the job
  • Every keystroke, prompt, and AI response captured
[09:14] cloned api-server.git
[09:15] opened src/middleware/limiter.ts
[09:18] prompt: "explain the windowing logic"
[09:21] edited src/middleware/limiter.ts (+12 -8)
[09:24] prompt: "what edge cases am I missing?"
[09:31] ran tests · 14 passing

Get a verdict, not just a score.

10 independent agents review against your rubric. Every line classified human / AI / AI-modified. Full session replay. Auto-generated debrief questions.

  • 10+ AI agents grading independently
  • Line-by-line human vs AI attribution
  • Click-through session replay (every keystroke + prompt)
  • Suggested debrief questions, ready to ask
Correctness 9.2 / 10
Code taste 8.7 / 10
Edge cases 6.4 / 10
AI orchestration 9.0 / 10
Recovery from bug 4.1 / 10
Verdict · STRONG, ask about error handling
Before / after

One assessment replaces
your entire interview loop.

Before 0 Weeks · 8 hr of you
After 0 Hours · 17 min of you
Your current process

The 4-stage gauntlet

Recruiter screen 30 min
Take-home assignment ~6 hr / candidate
Technical phone screen 1 hr
On-site interviews ×4 4 hr
Debrief + decision 2 hr
Interviewer time~8 hr
Calendar time3–4 weeks
With Gonfire

One assessment. Done.

Send Gonfire link 2 min
Candidate builds on real repo 90 min, async
10 agents review submission ~5 min
Read verdict + replay 15 min
Interviewer time~17 min
Calendar timeunder 4 hours
Attribution

Every line.
Classified.

Hover any line to see who wrote it. We track not just keystrokes but intent: human-original, AI-generated whole-cloth, or AI-suggested-then-edited.

93% Classification accuracy on a 5,000-submission labeled benchmark.
3 classes Human-original, AI-generated, AI-modified. Not a binary signal.
Per line Hover any row in a real submission to see the verdict and confidence.
1 import { RateLimiter } from './limiter' Human
2 Human
3 async function enforce(req, key) { AI gen
4 const limiter = new RateLimiter({ window: 60, max: 100 }) AI gen
5 const allowed = await limiter.check(key, req.ip) AI modified
6 if (!allowed) throw new RateLimitError(key) Human
7 return allowed AI gen
8 } Human
50%
Human
37%
AI generated
13%
AI modified
Multi-agent review · session #9k2x Δ 4.7s
A1CorrectnessTests, edge cases, type safety9.2
A2Code tasteNaming, structure, idioms8.7
A3Edge casesConcurrency, nulls, overflow6.4
A4AI orchestrationPrompt quality, follow-ups9.0
A5Bug recoveryReaction to broken state4.1
A6Test designCoverage, intent, isolation8.4
A7ArchitectureBoundaries, abstractions8.0
A8PerformanceBig-O, allocations7.1
A9SecurityInputs, secrets, IDOR8.8
A10DocumentationPR quality, comments8.2
Aggregate verdict · STRONG HIRE 7.8 / 10
Multi-agent

10 agents.
Independent verdicts.
Zero bias.

One human reviewer has a bad day. 10 specialist agents grade independently against your rubric, then aggregate. The result is a more reliable signal than any single interviewer can produce.

10 agents Each grading one rubric dimension. Disagreement is logged, not flattened.
~5 min From submit to aggregate verdict. Reviewer reads the verdict, not the code.
Custom rubric Bring your own dimensions. Calibrate against your existing strong-hire bar.
Session replay

Watch the build,
not just the output.

Every keystroke, every prompt, every Claude response, every test run. Scrub the timeline like a debugger. Spot the moment they figured it out, or the moment they gave up and pasted.

Every keystroke Including pastes, deletions, AI completions, and tab-accepts. Nothing dropped.
Every prompt Full Claude conversation log, ordered alongside the diff that resulted.
Indefinite Replays stay accessible for as long as the role is open. Share with hiring committee.
$ git checkout -b ratelimit-fix > opened src/middleware/limiter.ts > prompt to claude: "explain the windowing logic" > claude: The current implementation uses a fixed window: 60 second buckets, max 100/req. Edge case: a burst at second 59 + second 60 will allow up to 200 within 2s. Suggestion: switch to sliding window. > editing limiter.ts
12:14 / 41:08
opened file prompt → claude edit (+12 -8) test run commit
vs the field

Why teams pick Gonfire.

Other platforms bolted AI on. We were built around it from day zero.

HackerRank CodeSignal Saffron Gonfire
What candidates build on LeetCode-style sandbox Algorithmic tasks Real codebase Real codebase + your stack
AI tools available Restricted copilot Limited AI Claude Code Claude Code + Cursor compat
Line-by-line attribution × × with intent classification
Independent agents per review × Single model 10+ 10+ with custom rubric
Full session replay Output only Output only + AI conversation log
Auto-generated debrief questions × × tailored to weak signals
Time to first assessment ~1 day ~1 day Demo first 5 minutes · self-serve
Try it without booking a call × × × live demo, no Calendly
Engineering teams that ship
We cut our interview loop from three weeks to one afternoon. The line-by-line attribution caught two senior candidates who would have slipped through.
Sara Mendez, Head of Engineering
Strawn · 80-engineer infra team
The replay is the killer feature. Watching how a candidate prompts Claude tells me more about their judgment than any whiteboard ever did.
Daniel Kim, Staff Engineer
Helmstone · hiring committee
Pricing

Pay for signal,
not seats.

Every plan includes the full platform. You only pay for the assessments you run.

Starter
$149 / mo
For teams running their first AI-native loop.
  • 5 assessments / month
  • Multi-agent review · 10 agents
  • Line-level attribution
  • Full session replay
  • Email support
Start free trial
Enterprise
Custom
For orgs with high volume or compliance needs.
  • Unlimited assessments
  • SSO + SCIM provisioning
  • Self-hosted runner option
  • SOC 2 + DPA
  • Dedicated account manager
Talk to us
20
$449 /mo
Growth tier
FAQ

Things engineering leaders ask first.

Candidates work in isolated, ephemeral sandboxes. Your repo is mirrored read-only and shredded at the end of each session. SOC 2 Type II + Cloud-region pinning available on Enterprise.

Yes, that's the point. Claude Code is built in. Bring-your-own-key Cursor / Copilot is supported on Growth+. Every prompt and response is logged for review.

93% on our internal labeled benchmark of 5,000+ submissions. We classify lines as human-original, AI-generated, or AI-suggested-then-modified, using keystroke timing, paste detection, and prompt context.

TypeScript, JavaScript, Python, Go, Rust, Ruby, Java, C++. Any framework that runs in a Linux container. We've tested Next.js, Rails, FastAPI, Spring, and dozens more. If your stack runs in Docker, it works.

4.7 / 5 average from 8,000+ post-assessment surveys. Most say it's the first interview that felt like the actual job, because it is.

Yes. Click "Try the live demo" at the top. You'll see a real anonymized report with full attribution, agent verdicts, and replay. No email required.

Native integrations with Greenhouse, Ashby, Lever, and Workday. Gonfire posts the report straight into the candidate scorecard. Webhooks + REST API available for everything else.

You'll see it in the dashboard with their reasoning. We never auto-reject based on flags. They're context for your hiring team.

Get started

See how your next hire actually thinks.

Drop your email. We'll send a real anonymized report and a link to the live demo. No call required.

No credit card. No sales call. Unsubscribe in one click.