Institutional annotators produce measurably higher quality training data

Scored against the same evaluation benchmarks AI labs use to measure their models. 86.2 mean across five task types.

Drafted vs Industry Benchmark Scores

Drafted LLM as Judge scores, using rubrics derived from each benchmark's scoring methodology, vs. estimated unverified annotator performance from annotation quality research. OSWorld 72.36% is the only published benchmark paper human baseline (Xie et al., 2024).

Score (0-100)

100

+0%

90.2

+0%

83.5

+0%

80.8

+0%

Tool Call
Evaluation

ToolBench

Red Team
Analysis

HarmBench

Computer Use
Trajectory

OSWorld

Reasoning
Verification

PRM800K

Multilingual
Evaluation

Global MMLU

Task Type

Tool Call Evaluation

ToolBench

+0%

Red Team Analysis

HarmBench

90.2

+0%

Computer Use Trajectory

OSWorld

+0%

Reasoning Verification

PRM800K

83.5

+0%

Multilingual Evaluation

Global MMLU

80.8

+0%

Est. unverified annotator baseline

Drafted LLM as Judge score

Ouyang et al.

NeurIPS — InstructGPT

Gig annotators are paid per task. The rational move is speed: 30 seconds, next. When Ouyang et al. (2022) trained InstructGPT, their best managed labelers hit ~0.65 agreement. That is the ceiling. Below it, quality degrades with every cohort rotation.

Drafted sources annotators through university career center partnerships (USC, UChicago, Georgetown, University of Miami). Enrolled students, identity verified, matched by major. They build credential portfolios, not per-task payouts. Their incentive is quality. The chart above is what that looks like when you score every submission against published benchmark rubrics.

Task Level Scores

Each card maps to a published benchmark. Scores are generated automatically by Claude Haiku, not self reported. Grey baselines: OSWorld's 72.36% is published (Xie et al., 2024); others estimated from crowdworker accuracy literature (Ziegler et al., 2019; Lightman et al., 2023; Rottger et al., 2022).

Tool Call Evaluation

0.0 mean

ToolBench rubric · n=6 · median 90.5 · ~68 est. unverified baseline

Red Team Analysis

0.0 mean

HarmBench rubric · n=13 · median 91 · ~70 est. unverified baseline

Computer Use Trajectory

0.0 mean

OSWorld rubric · n=4 · median 84 · 72.36 published human baseline (Xie et al., 2024)

Process Supervision

0.0 mean

PRM800K rubric · n=13 · median 89 · ~60 est. unverified baseline

Multilingual Evaluation

0.0 mean

Global MMLU rubric · n=8 · median 83.5 · ~72 est. native speaker baseline (Ustun et al., 2024)

The Deliberation Gap

A contractor on a gig platform picks Response A in 30 seconds and moves on. A CS major at USC spends 7 minutes reading both outputs, changes their preference rating 20 times, writes a structured rationale, and never pastes a single character. That behavioral gap is the difference between noise and signal, and it shows up in every metric we capture.

Drafted annotators spend 3 to 11 minutes per task across every task type. Industry benchmarks for contract annotation sit at 30 seconds to 3 minutes. The multiples speak for themselves.

Task Type	Drafted Avg	Industry	Multiple
Code Review	10.7 min	1 to 2 min	5 to 10x
Computer Use	8.9 min	1 to 3 min	3 to 9x
Multilingual	6.1 min	1 to 2 min	3 to 6x
RLHF Ranking	4.9 min	30 to 90 sec	3 to 10x
Process Supervision	4.5 min	1 to 2 min	2 to 4x
Red Team Analysis	3.3 min	30 to 60 sec	3 to 7x
Tool Call Evaluation	3.3 min	30 to 60 sec	3 to 7x

Note on scoring coverage: Code Review and RLHF Preference Ranking appear in the engagement table but not in the Task Level Scores above. Code Review is evaluated via heuristic scoring and gold QC (code correctness requires deterministic execution, not LLM rubric scoring). RLHF preference data is evaluated via Krippendorff's Alpha across shared prompts rather than a per-task LLM as Judge score; see the Inter Annotator Agreement section below.

Three Layers. No Blind Spots.

Most annotation platforms run one quality check: a spot review by an internal reviewer, if that. We run three independent signals on every submission. Two are automated and score in real time. The third is deterministic, built around your rubric and your gold answers, and ships per pilot. Together they catch what any single layer misses.

●
Layer 1 Heuristic Scoring: Computed the moment a student submits. Measures time on task, scroll depth, rationale length, keystroke count, paste events, and tab visibility.
●
Layer 2 Gold Standard QC (rolling into pilots): Seeded gold items with expert verified answers injected invisibly into an annotator's workflow. Deterministic grading against known answers is the only score that cannot be gamed. This layer ships into production per pilot with the buyer's own gold set and rubric.
●
Layer 3 LLM as Judge: Every submission is scored by Claude Haiku using a rubric derived from a matched benchmark's scoring methodology (PRM800K for reasoning, HarmBench for red team, ToolBench for tool call, OSWorld for computer use, Global MMLU for multilingual). Computer use trajectories are additionally scored by a deterministic OSWorld style evaluator running in parallel.

When Annotators Agree, Signal Compounds

High individual scores mean nothing if annotators disagree on what "good" looks like. That is why academic annotation research uses Krippendorff's Alpha as the gold standard reliability metric. Unlike Cohen's Kappa, it handles missing data, more than two annotators, and multiple measurement levels. Published crowdsourced RLHF annotator agreement typically falls in the 0.50 to 0.67 band (Ouyang et al., 2022). Most annotation platforms don't publish this number at all.

We publish ours, including the numbers that aren't yet where we want them. The honest baseline is more useful than a selective one.

RLHF Preference Ranking (largest cohort)

14 annotators · 6 shared prompts · 23 annotations · first run, zero calibration sessions

0.667 α nominal

On the binary preference choice (A vs B). Meets Krippendorff's (2004) threshold for tentative conclusions (0.667 ≤ α < 0.80); comparable to trained labeler agreement reported for RLHF data (Ouyang et al., 2022).

0.318 α ordinal

On the 1 to 5 dimension rating scales. Below reliability threshold. Flagged needs_improvement by our own system. This is the primary target of the calibration sessions that ship in each pilot.

Reliability thresholds (Krippendorff 2004): α ≥ 0.80 reliable · 0.667 ≤ α < 0.80 tentative · α < 0.667 discard. Published crowdsourced RLHF annotator agreement typically falls in the 0.50 to 0.67 band (Ouyang et al., 2022 report ~0.65 among trained labelers). Our calibration session target for production pilots is combined α ≥ 0.80.

Other task types: Computer Use, Multilingual, and Process Supervision currently have only 1 shared prompt each in the cohort. Krippendorff's Alpha is not statistically stable at that sample size (Hayes & Krippendorff, 2007). Red Team Analysis and Tool Call Evaluation cohorts do not yet have shared prompts across annotators, so α has not been computed for those task types. We are expanding shared prompt pools before publishing those numbers.

Why the Gap Exists

The quality difference is not a coincidence. It is a structural consequence of incentive design. On one side, pay per task rewards speed. On the other, verified credentials and career center portfolios reward depth. The comparison below is the outcome of those two systems running in parallel.

Drafted Training Labs

Identity
University verified enrollment and edu email
Engagement
3 to 11 minutes per task (Quality optimized)
Quality Validation
Heuristic + LLM as Judge live; gold set QC per pilot
Incentives
Credential badges and verified portfolios

Contractor Marketplaces

Identity
Self reported and unverified
Engagement
30 to 90 seconds per task (Throughput optimized)
Quality Validation
Rarely disclosed publicly
Incentives
Pay per task (Speed equals money)

The Concepts Behind the Scores

If you are evaluating annotation vendors, these are the terms that matter. Each one maps directly to how we score and validate the data your models train on.

What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF is a machine learning technique where human annotators rank or rate AI generated responses. This feedback is used to train a "reward model" that teaches the AI to produce outputs that are more helpful, accurate, and safe.

What is OSWorld and Computer Use Trajectories?

OSWorld (Xie et al., 2024) is a benchmark for evaluating multimodal AI agents on real computer tasks across Ubuntu, Windows, and macOS. Human performance is 72.36%. A "trajectory" is a step by step recording of a human completing a task, capturing screenshots, click coordinates, keystrokes, and written reasoning at each step. Agent performance has closed the gap rapidly (Agent S reached 72.6% in late 2025), but production grade agent reliability in enterprise workflows still bottlenecks on high quality human demonstration data. That is the kind of data our Computer Use cohort is built to produce.

What is Process Supervision (PRM800K)?

Instead of just checking if an AI got the final answer right (Outcome Supervision), Process Supervision evaluates every single step of the AI's reasoning. PRM800K is a dataset used to train Process Reward Models, which are critical for improving AI performance in complex math, logic, and coding tasks.

What is Krippendorff's Alpha?

A statistical measure of agreement between multiple annotators (Inter Annotator Agreement, or IAA). Krippendorff (2004) defines three reliability bands: α ≥ 0.80 is reliable (draw firm conclusions), 0.667 ≤ α < 0.80 supports only tentative conclusions, and α < 0.667 is unreliable. Unlike simpler metrics such as Cohen's Kappa, α handles missing data, more than two annotators, and multiple levels of measurement (nominal, ordinal, interval). That is why it is the standard in academic annotation research.

What is Gold Standard QC?

Gold Standard Quality Control injects "gold items" (prompts with known, expert verified correct answers) invisibly into an annotator's workflow. Because the answer is known, grading is deterministic. It cannot be gamed or bypassed by low effort submissions. In our architecture, the gold set is defined per pilot using the buyer's own rubric and expert answers, which ships alongside pilot launch.