Everything you need to know about RawEval.
From chatting with AI models and flagging failures, to expert evaluation and data delivery — here's how the platform works and how to get started.
How RawEval works — in three steps
Chat with AI models
Open the RawEval Chat and start a conversation with any supported model — GPT-4o, Claude, Gemini, and more. Chat naturally, test edge cases, or explore topics in your domain.
Open Chat →Flag what goes wrong
When the AI gives an incorrect, incomplete, or harmful response, click the flag button. Add context about what went wrong — factual error, bias, hallucination, or format issue. Your flag enters the evaluation queue.
Experts verify & correct
Domain-verified experts review flagged interactions using structured rubrics. They score the response, write a corrected version, and tag the failure type. All with full provenance metadata.
Learn about experts →Connect with every part of RawEval
Chat Platform
Multi-model AI chat with built-in failure flagging. Talk to all top frontier models from one interface.
Expert Workbench
Where verified domain experts evaluate flagged AI failures. Structured rubrics, correction workflows, and quality scoring.
Data Delivery
Audit-ready evaluation data delivered to AI labs and enterprises. Every data point carries full provenance — who evaluated it, their credentials, behavioral verification, and quality scores.
Programmatic access to evaluation data
For teams that want to integrate evaluation data directly into their ML pipeline. All endpoints require a Bearer token — API keys are scoped per organization.
Request
curl -X GET https://api.raweval.com/v1/delivery/batch_42a \ -H "Authorization: Bearer reval_sk_live_..."
Response
{
"prompt_id": "fail_0x7f3ac9",
"model": "gpt-4o",
"domain": "geography",
"original_response": "Sydney is the capital...",
"corrected_response": "Canberra is the capital...",
"rubric_scores": {
"accuracy": 9.8,
"completeness": 9.1,
"format": 9.5
},
"annotator_tier": "T2",
"improvement_delta": 8.3,
"provenance": {
"expert_verified": true,
"behavioral_check": "passed",
"eu_ai_act_artifact": "audit_847.json"
}
}Frequently asked questions
Do I need an API key to use RawEval?+
No. You can use the Chat and Expert Workbench without an API key. API access is available on Pro and Enterprise plans for programmatic data delivery.
What AI models are supported?+
We support all top frontier AI models and add new ones regularly. You can switch between models within the same conversation to compare responses.
How are experts verified?+
Experts go through credential verification, a domain skills assessment with identity verification, and ongoing quality monitoring to ensure every evaluation meets our standards.
What data formats do you deliver?+
RLHF preference pairs, SFT-ready corrected completions, rubric-scored evaluations, and raw JSONL with full provenance metadata. Custom formats available on Enterprise plans.
How does the Expert Workbench work?+
Experts receive task assignments matched to their domain. They evaluate flagged AI responses using structured rubrics, write corrections, and score responses across multiple dimensions. Everything is tracked with full provenance.
Can I integrate RawEval into my existing pipeline?+
Yes. Use our REST API to submit prompts, retrieve evaluations, and download completed batches. Webhooks notify you when batches complete. Enterprise plans include custom integration support.
Ready to try it?
Chat with AI models, flag what goes wrong, and help build better AI — or bring evaluation data into your ML pipeline.