New v2.0 / Live attribution + multi-agent review

Hire the engineers who think, not just the ones who type fast.

Gonfire watches every keystroke, every prompt, every decision. Real codebase. Real complexity. Zero interview theater.

Try the live demo How it works

interview-task.ts candidate · jamie@example.com

1 // session.id 9k2x · candidate connecting...

Multi-agent review 10 specialist agents

In production at Strawn, Quartzic, Foundryx, Pylant, Lumenco

The problem

The interview broke.
Nobody wants to admit it.

AI didn't just change how engineers code. It nuked the rituals you've been using to hire them.

Take-homes are dead.

Every candidate ships the same Cursor-generated solution. You have no idea who can actually engineer.

On-sites test memory.

Whiteboard inversions are theater. The job is judgment, taste, and recovery. None of which fit a 45-minute slot.

You see the output. Never the process.

The PR looks great. But did they design it, debug it, or did Claude do the whole thing in one prompt? You'll find out at month three.

How it works

One assessment.
Three lenses on every candidate.

Define it in five minutes. Send the link. Get a complete picture back, automatically.

Connect your repo. Define your rubric. Done.

Point Gonfire at any GitHub repo. Pick the area you want candidates to work in. Pick the signals that matter.

Real production code, not LeetCode
Custom rubric: judgment, taste, recovery, orchestration
One link. Sends to candidates instantly.

→ github.com/your-team/api-server

→ Task: refactor the rate limiter

→ Rubric: judgment, taste, recovery

→ Time budget: 90 minutes

✓ Link generated · gonfire.io/a/9k2x

Candidates build in a real IDE. With AI. On your code.

No local setup. No toy problems. They get a browser-based VS Code with Claude Code attached, working against your actual codebase.

Browser IDE. No install, no environment hell.
Claude Code as their copilot, just like the job
Every keystroke, prompt, and AI response captured

[09:14] cloned api-server.git

[09:15] opened src/middleware/limiter.ts

[09:18] prompt: "explain the windowing logic"

[09:21] edited src/middleware/limiter.ts (+12 -8)

[09:24] prompt: "what edge cases am I missing?"

[09:31] ran tests · 14 passing

Get a verdict, not just a score.

10 independent agents review against your rubric. Every line classified human / AI / AI-modified. Full session replay. Auto-generated debrief questions.

10+ AI agents grading independently
Line-by-line human vs AI attribution
Click-through session replay (every keystroke + prompt)
Suggested debrief questions, ready to ask

Correctness 9.2 / 10

Code taste 8.7 / 10

Edge cases 6.4 / 10

AI orchestration 9.0 / 10

Recovery from bug 4.1 / 10

Verdict · STRONG, ask about error handling

Before / after

One assessment replaces
your entire interview loop.

Before 0 Weeks · 8 hr of you

→

After 0 Hours · 17 min of you

Your current process

The 4-stage gauntlet

Recruiter screen 30 min

Take-home assignment ~6 hr / candidate

Technical phone screen 1 hr

On-site interviews ×4 4 hr

Debrief + decision 2 hr

Interviewer time~8 hr

Calendar time3–4 weeks

With Gonfire

One assessment. Done.

Send Gonfire link 2 min

Candidate builds on real repo 90 min, async

10 agents review submission ~5 min

Read verdict + replay 15 min

Interviewer time~17 min

Calendar timeunder 4 hours

Attribution

Every line.
Classified.

Hover any line to see who wrote it. We track not just keystrokes but intent: human-original, AI-generated whole-cloth, or AI-suggested-then-edited.

93% Classification accuracy on a 5,000-submission labeled benchmark.

3 classes Human-original, AI-generated, AI-modified. Not a binary signal.

Per line Hover any row in a real submission to see the verdict and confidence.

1 import { RateLimiter } from './limiter' Human

2 Human

3 async function enforce(req, key) { AI gen

4 const limiter = new RateLimiter({ window: 60, max: 100 }) AI gen

5 const allowed = await limiter.check(key, req.ip) AI modified

6 if (!allowed) throw new RateLimitError(key) Human

7 return allowed AI gen

8 } Human

50%

Human

37%

AI generated

13%

AI modified

Multi-agent review · session #9k2x Δ 4.7s

A1CorrectnessTests, edge cases, type safety9.2

A2Code tasteNaming, structure, idioms8.7

A3Edge casesConcurrency, nulls, overflow6.4

A4AI orchestrationPrompt quality, follow-ups9.0

A5Bug recoveryReaction to broken state4.1

A6Test designCoverage, intent, isolation8.4

A7ArchitectureBoundaries, abstractions8.0

A8PerformanceBig-O, allocations7.1

A9SecurityInputs, secrets, IDOR8.8

A10DocumentationPR quality, comments8.2

Aggregate verdict · STRONG HIRE 7.8 / 10

Multi-agent

10 agents.
Independent verdicts.
Zero bias.

One human reviewer has a bad day. 10 specialist agents grade independently against your rubric, then aggregate. The result is a more reliable signal than any single interviewer can produce.

10 agents Each grading one rubric dimension. Disagreement is logged, not flattened.

~5 min From submit to aggregate verdict. Reviewer reads the verdict, not the code.

Custom rubric Bring your own dimensions. Calibrate against your existing strong-hire bar.

Session replay

Watch the build,
not just the output.

Every keystroke, every prompt, every Claude response, every test run. Scrub the timeline like a debugger. Spot the moment they figured it out, or the moment they gave up and pasted.

Every keystroke Including pastes, deletions, AI completions, and tab-accepts. Nothing dropped.

Every prompt Full Claude conversation log, ordered alongside the diff that resulted.

Indefinite Replays stay accessible for as long as the role is open. Share with hiring committee.

$ git checkout -b ratelimit-fix > opened src/middleware/limiter.ts > prompt to claude: "explain the windowing logic" > claude: The current implementation uses a fixed window: 60 second buckets, max 100/req. Edge case: a burst at second 59 + second 60 will allow up to 200 within 2s. Suggestion: switch to sliding window. > editing limiter.ts

12:14 / 41:08

opened file prompt → claude edit (+12 -8) test run commit

vs the field

Why teams pick Gonfire.

Other platforms bolted AI on. We were built around it from day zero.

	HackerRank	CodeSignal	Saffron	Gonfire
What candidates build on	LeetCode-style sandbox	Algorithmic tasks	Real codebase	Real codebase + your stack
AI tools available	Restricted copilot	Limited AI	Claude Code	Claude Code + Cursor compat
Line-by-line attribution	×	×	✓	✓ with intent classification
Independent agents per review	×	Single model	10+	10+ with custom rubric
Full session replay	Output only	Output only	✓	✓ + AI conversation log
Auto-generated debrief questions	×	×	✓	✓ tailored to weak signals
Time to first assessment	~1 day	~1 day	Demo first	5 minutes · self-serve
Try it without booking a call	×	×	×	✓ live demo, no Calendly

Engineering teams that ship

We cut our interview loop from three weeks to one afternoon. The line-by-line attribution caught two senior candidates who would have slipped through.

Sara Mendez, Head of Engineering

Strawn · 80-engineer infra team

The replay is the killer feature. Watching how a candidate prompts Claude tells me more about their judgment than any whiteboard ever did.

Daniel Kim, Staff Engineer

Helmstone · hiring committee

Pricing

Pay for signal,
not seats.

Every plan includes the full platform. You only pay for the assessments you run.

Starter

$149 / mo

For teams running their first AI-native loop.

5 assessments / month
Multi-agent review · 10 agents
Line-level attribution
Full session replay
Email support

Start free trial

Most picked

Growth

$449 / mo

For teams hiring at scale across roles.

20 assessments / month
Custom rubrics + templates
ATS integrations · Greenhouse, Ashby
Calibration sessions with our team
Priority support · 4-hr response

Start free trial

Enterprise

Custom

For orgs with high volume or compliance needs.

Unlimited assessments
SSO + SCIM provisioning
Self-hosted runner option
SOC 2 + DPA
Dedicated account manager

Talk to us

How many assessments / month?

$449 /mo

Growth tier

FAQ

Things engineering leaders ask first.

Is our source code secure?+

Candidates work in isolated, ephemeral sandboxes. Your repo is mirrored read-only and shredded at the end of each session. SOC 2 Type II + Cloud-region pinning available on Enterprise.

Can candidates use Claude / Cursor / their own AI tools?+

Yes, that's the point. Claude Code is built in. Bring-your-own-key Cursor / Copilot is supported on Growth+. Every prompt and response is logged for review.

How accurate is the human-vs-AI attribution?+

93% on our internal labeled benchmark of 5,000+ submissions. We classify lines as human-original, AI-generated, or AI-suggested-then-modified, using keystroke timing, paste detection, and prompt context.

What languages and frameworks are supported?+

TypeScript, JavaScript, Python, Go, Rust, Ruby, Java, C++. Any framework that runs in a Linux container. We've tested Next.js, Rails, FastAPI, Spring, and dozens more. If your stack runs in Docker, it works.

How do candidates feel about it?+

4.7 / 5 average from 8,000+ post-assessment surveys. Most say it's the first interview that felt like the actual job, because it is.

Can I see a sample report before signing up?+

Yes. Click "Try the live demo" at the top. You'll see a real anonymized report with full attribution, agent verdicts, and replay. No email required.

Does Gonfire integrate with our ATS?+

Native integrations with Greenhouse, Ashby, Lever, and Workday. Gonfire posts the report straight into the candidate scorecard. Webhooks + REST API available for everything else.

What happens if a candidate flags the assessment?+

You'll see it in the dashboard with their reasoning. We never auto-reject based on flags. They're context for your hiring team.

Hire the engineers who think, not just the ones who type fast.

The interview broke.Nobody wants to admit it.