Step 1 of 4

Data Capture

Identifying where AI models fail in real-world usage

Real-time capture

Every failed response is a training opportunity

Traditional AI training relies on static datasets. RawEval captures live user interactions where models actually fail, creating a continuous feedback loop that identifies exactly where AI needs improvement.

Zero-latency capture (no slowdown for users)
Automatic PII removal before storage
Failure detection via user behavior signals
User Query

"Explain quantum tunneling in simple terms"

Model Response

Generic answer, user edits for 15 seconds...

Captured for Evaluation

Queued for expert correction

How capture works

Dual-path execution

When a user submits a query, we create two parallel paths:

Path A:Fast response to user (standard flow)
Path B:Silent copy to evaluation buffer

Failure detection

We detect model failures through user behavior signals:

• User spends >10s editing response
• Clicks "Wrong" or "Try again"
• Abandons without accepting answer
• Submits follow-up clarification

Priority queueing

Failed prompts are automatically prioritized based on:

• Domain complexity
• Confidence score of failure
• Enterprise client demand
• Model improvement potential

Technical implementation

What we capture

Original Prompt:Full user query with context
Model Response:Complete AI-generated answer
Web Context:RAG sources and citations
User Edits:Changes made by the user

Privacy & security

Automatic PII removal
Email addresses, phone numbers, and personal identifiers stripped before storage
End-to-end encryption
All captured data encrypted at rest and in transit
Zero user impact
Capture happens asynchronously with <5ms overhead
47K+
Queries captured daily
4.3%
Average failure rate
<5ms
Capture overhead
100%
Privacy compliant

Why real-world capture matters

Traditional approach

  • ✗ Training on static benchmark datasets
  • ✗ No visibility into real user pain points
  • ✗ Weeks/months between failure and fix
  • ✗ Biased toward academic test cases

RawEval approach

  • ✓ Live capture of actual model failures
  • ✓ Identify problems as they happen
  • ✓ Hours from failure to training data
  • ✓ Reflects real-world use cases