Backed by Y Combinator

Data & Evaluation
for Voice AI.

The richest multilingual voice dataset and the most rigorous evaluation framework - built from 2M+ live production calls.

Built on a foundation of scale.

The data, languages, and metrics powering Voice AI evaluation at every layer.

65,000+

Contributors building & evaluating Voice AI.

80+

Languages covered across global voice datasets.

40+

Automated eval metrics across every conversation.

3-Layer

Eval stack: transcript, speech, and behavior.

Trusted by

How it works

From spec to delivery.

STEP 01

Scope & Design

We align on target languages, dialects, demographics, recording conditions, and use-case domain. Together we draft an evaluation rubric - what counts as a pass, what counts as a failure event, and how each turn should be scored.

+Language & dialect targeting
+Demographic & domain spec
+Custom eval rubric

+Language & dialect targeting
+Demographic & domain spec
+Custom eval rubric

Multilingual voice data

Spin up language-specific data collection in days.

From contributor recruitment to verbatim transcription with dialect codes, every dataset is shaped to your model and your market.

+65,000+ contributors across 80+ countries
+Verbatim transcripts with dialect codes
+Full consent & provenance chain
+Custom demographic & annotation specs

Live coverage

80+ countries · 65,000+ contributors

● recording · en-US

Studio / Controlled

High-fidelity recordings in acoustically treated environments.

Call Center / Telephony

Real production telephony audio with codec realism.

Mobile & Outdoor

In-the-wild captures across devices and ambient conditions.

Synthetic / Edge Cases

Targeted adversarial and long-tail scenarios.

Evaluation

The most rigorous evaluation framework for voice AI.

A three-layer stack - automated, industry-tuned, and human-reviewed - applied to every conversation your agent has.

LAYER 01

WERDiarizationIntent MatchEntity F1MOSSNRProsodyLatencyTurn-taking

Turn-by-Turn AI Evaluation

Automated metrics applied to every turn across transcript, speech, and behavior.

TranscriptSpeechBehavioral

LAYER 02

FDCPA ChecksRight Party ContactPromise to PayMini-MirandaScreening FidelityBias SurfaceDrop-offScheduling Conv.Resolution

Industry-Specific Evaluation

Domain rubrics built with operators in each vertical.

Debt CollectionRecruitmentCustomer Support

LAYER 03

Root CauseSeverityRecurrenceRecoveryPrompt DiffsTool UseKnowledge GapsRoutingPrompt Injection

Human Expert Evaluation

Trained reviewers catch what models miss.

Failure Mode IDAgent ImprovementAdversarial & Edge Cases

The difference

Failure Events, not pass/fail scores.

Call-level pass/fail loses the signal. Samora logs each failure as a structured event with turn, type, severity, and recovery - so you can fix the exact failure mode, not relitigate the entire call.

samora-logs

●event_id:fe_28a91c3d

●turn:4 of 12

●type:compliance_gap

●severity:high

●recovery:succeeded · +1 turn

●action:re-collect: hi-IN, debt

Why Samora

2M+

Production calls

40+

Eval metrics

80+

Languages

Built ground-up for code-switching, dialects, and low-resource languages - not English with translations bolted on.

FAQ

Frequently asked questions.

80+ languages spanning major and low-resource locales, with dialect-level metadata (e.g. hi-IN, es-MX, en-NG). New locales spin up in days through our contributor network.

Most custom pipelines deliver first batches within 7–14 days from spec sign-off, scaling weekly thereafter.

Layer 01 is turn-by-turn AI scoring across transcript, speech, and behavior. Layer 02 applies industry-specific rubrics. Layer 03 is human expert review for failure modes and adversarial cases.

Structured records of every individual failure - turn, type, severity, recovery, and recommended action - instead of a single call-level pass/fail.

No. Every engagement is on-demand and scoped to your model, locale, and domain. This is the only way to guarantee distributional fit.

Explicit informed consent is captured per contributor, with full provenance chains, PII redaction, and regional compliance (GDPR, sector-specific) by default.

Ready to scope your data & evaluation needs?

Tell us your model, your locales, and your edge cases. We'll come back with a pipeline and an evaluation plan in 48 hours.

Book a Call

Backed by Y Combinator

Founded by alumni of Stanford & Microsoft

Data & Evaluation
for Voice AI.

Built on a foundation of scale.

65,000+

80+