Institutional annotators produce measurably higher quality training data

Scored against the same evaluation benchmarks AI labs use to measure their models. 86.2 mean across five task types.

Drafted vs Industry Benchmark Scores

Drafted LLM as Judge scores, using rubrics derived from each benchmark's scoring methodology, vs. estimated unverified annotator performance from annotation quality research. OSWorld 72.36% is the only published benchmark paper human baseline (Xie et al., 2024).

Score (0-100)
100
75
50
25
0
68
92
+0%
70
90.2
+0%
72
84
+0%
60
83.5
+0%
72
80.8
+0%
Tool Call
Evaluation
ToolBench
Drafted score: LLM as Judge using a rubric derived from ToolBench (Qin et al., 2023) scoring methodology. The rubric evaluates correct tool selection, parameter accuracy, and result interpretation. Grey bar: ~68 estimated from annotation quality meta research (Ziegler et al., 2019; Bai et al., 2022) on tool use tasks by unverified crowdworkers. ToolBench does not publish human annotator scores.
Red Team
Analysis
HarmBench
Drafted score: LLM as Judge using a rubric derived from HarmBench (Mazeika et al., 2024) scoring methodology. The rubric evaluates attack vector identification, harm classification, and response quality. Grey bar: ~70 estimated from safety annotation IAA literature (Pavlopoulos et al., 2020; Hartvigsen et al., 2022 ToxiGen). HarmBench does not publish crowdworker baseline scores.
Computer Use
Trajectory
OSWorld
Drafted score: LLM as Judge using a rubric derived from OSWorld (Xie et al., 2024) scoring methodology. Trajectories are also scored by a deterministic OSWorld style evaluator (URL coverage, step efficiency, action diversity, loop detection) in parallel. Grey bar: 72.36% is the published human baseline from the OSWorld paper. It is the only grey bar on this chart drawn directly from a benchmark paper.
Reasoning
Verification
PRM800K
Drafted score: LLM as Judge using a rubric derived from PRM800K (Lightman et al., 2023) scoring methodology. The rubric evaluates step level verdicts, first error localization, and correction validity. Grey bar: ~60 estimated from math reasoning annotation studies (Cobbe et al., 2021; Lightman et al., 2023) on untrained crowdworkers. PRM800K used trained annotators; no untrained baseline is published.
Multilingual
Evaluation
Global MMLU
Drafted score: LLM as Judge using a rubric derived from Global MMLU scoring methodology. The rubric evaluates fluency, accuracy, and cultural appropriateness. Grey bar: ~72 estimated from native speaker annotation accuracy on FLORES 200 and multilingual RLHF studies (Ustun et al., 2024 Aya). Global MMLU does not publish annotator performance scores.
Task Type
Tool Call Evaluation
ToolBench
Drafted LLM as Judge using a rubric derived from ToolBench (Qin et al., 2023) scoring methodology. Grey bar ~68 estimated from annotation quality research. ToolBench does not publish human annotator scores.
68
92
+0%
Red Team Analysis
HarmBench
Drafted LLM as Judge using a rubric derived from HarmBench (Mazeika et al., 2024) scoring methodology. Grey bar ~70 estimated from safety annotation IAA literature. HarmBench does not publish crowdworker baseline scores.
70
90.2
+0%
Computer Use Trajectory
OSWorld
Drafted LLM as Judge using a rubric derived from OSWorld (Xie et al., 2024). Also scored by a deterministic OSWorld style trajectory evaluator in parallel. Grey bar 72.36%: published human baseline. Only grey bar drawn directly from a benchmark paper.
72
84
+0%
Reasoning Verification
PRM800K
Drafted LLM as Judge using a rubric derived from PRM800K (Lightman et al., 2023) scoring methodology. Grey bar ~60 estimated from math annotation studies on untrained crowdworkers. PRM800K used trained annotators; no untrained baseline is published.
60
83.5
+0%
Multilingual Evaluation
Global MMLU
Drafted LLM as Judge using a rubric derived from Global MMLU scoring methodology. Grey bar ~72 estimated from native speaker annotation accuracy on FLORES 200 and multilingual RLHF studies (Ustun et al., 2024 Aya). Global MMLU does not publish annotator performance scores.
72
80.8
+0%
Est. unverified annotator baseline
Drafted LLM as Judge score

Gig annotators are paid per task. The rational move is speed: 30 seconds, next. When Ouyang et al. (2022) trained InstructGPT, their best managed labelers hit ~0.65 agreement. That is the ceiling. Below it, quality degrades with every cohort rotation.

Drafted sources annotators through university career center partnerships (USC, UChicago, Georgetown, University of Miami). Enrolled students, identity verified, matched by major. They build credential portfolios, not per-task payouts. Their incentive is quality. The chart above is what that looks like when you score every submission against published benchmark rubrics.

Task Level Scores

Each card maps to a published benchmark. Scores are generated automatically by Claude Haiku, not self reported. Grey baselines: OSWorld's 72.36% is published (Xie et al., 2024); others estimated from crowdworker accuracy literature (Ziegler et al., 2019; Lightman et al., 2023; Rottger et al., 2022).

Tool Call Evaluation

0.0 mean

ToolBench rubric · n=6 · median 90.5 · ~68 est. unverified baseline

Red Team Analysis

0.0 mean

HarmBench rubric · n=13 · median 91 · ~70 est. unverified baseline

Computer Use Trajectory

0.0 mean

OSWorld rubric · n=4 · median 84 · 72.36 published human baseline (Xie et al., 2024)

Process Supervision

0.0 mean

PRM800K rubric · n=13 · median 89 · ~60 est. unverified baseline

Multilingual Evaluation

0.0 mean

Global MMLU rubric · n=8 · median 83.5 · ~72 est. native speaker baseline (Ustun et al., 2024)

The Deliberation Gap

A contractor on a gig platform picks Response A in 30 seconds and moves on. A CS major at USC spends 7 minutes reading both outputs, changes their preference rating 20 times, writes a structured rationale, and never pastes a single character. That behavioral gap is the difference between noise and signal, and it shows up in every metric we capture.

Drafted annotators spend 3 to 11 minutes per task across every task type. Industry benchmarks for contract annotation sit at 30 seconds to 3 minutes. The multiples speak for themselves.

Task TypeDrafted AvgIndustryMultiple
Code Review10.7 min1 to 2 min5 to 10x
Computer Use8.9 min1 to 3 min3 to 9x
Multilingual6.1 min1 to 2 min3 to 6x
RLHF Ranking4.9 min30 to 90 sec3 to 10x
Process Supervision4.5 min1 to 2 min2 to 4x
Red Team Analysis3.3 min30 to 60 sec3 to 7x
Tool Call Evaluation3.3 min30 to 60 sec3 to 7x

Note on scoring coverage: Code Review and RLHF Preference Ranking appear in the engagement table but not in the Task Level Scores above. Code Review is evaluated via heuristic scoring and gold QC (code correctness requires deterministic execution, not LLM rubric scoring). RLHF preference data is evaluated via Krippendorff's Alpha across shared prompts rather than a per-task LLM as Judge score; see the Inter Annotator Agreement section below.

Three Layers. No Blind Spots.

Most annotation platforms run one quality check: a spot review by an internal reviewer, if that. We run three independent signals on every submission. Two are automated and score in real time. The third is deterministic, built around your rubric and your gold answers, and ships per pilot. Together they catch what any single layer misses.

  • Layer 1 Heuristic Scoring: Computed the moment a student submits. Measures time on task, scroll depth, rationale length, keystroke count, paste events, and tab visibility.
  • Layer 2 Gold Standard QC (rolling into pilots): Seeded gold items with expert verified answers injected invisibly into an annotator's workflow. Deterministic grading against known answers is the only score that cannot be gamed. This layer ships into production per pilot with the buyer's own gold set and rubric.
  • Layer 3 LLM as Judge: Every submission is scored by Claude Haiku using a rubric derived from a matched benchmark's scoring methodology (PRM800K for reasoning, HarmBench for red team, ToolBench for tool call, OSWorld for computer use, Global MMLU for multilingual). Computer use trajectories are additionally scored by a deterministic OSWorld style evaluator running in parallel.

When Annotators Agree, Signal Compounds

High individual scores mean nothing if annotators disagree on what "good" looks like. That is why academic annotation research uses Krippendorff's Alpha as the gold standard reliability metric. Unlike Cohen's Kappa, it handles missing data, more than two annotators, and multiple measurement levels. Published crowdsourced RLHF annotator agreement typically falls in the 0.50 to 0.67 band (Ouyang et al., 2022). Most annotation platforms don't publish this number at all.

We publish ours, including the numbers that aren't yet where we want them. The honest baseline is more useful than a selective one.

RLHF Preference Ranking (largest cohort)

14 annotators · 6 shared prompts · 23 annotations · first run, zero calibration sessions

0.667 α nominal

On the binary preference choice (A vs B). Meets Krippendorff's (2004) threshold for tentative conclusions (0.667 ≤ α < 0.80); comparable to trained labeler agreement reported for RLHF data (Ouyang et al., 2022).

0.318 α ordinal

On the 1 to 5 dimension rating scales. Below reliability threshold. Flagged needs_improvement by our own system. This is the primary target of the calibration sessions that ship in each pilot.

Reliability thresholds (Krippendorff 2004): α ≥ 0.80 reliable · 0.667 ≤ α < 0.80 tentative · α < 0.667 discard. Published crowdsourced RLHF annotator agreement typically falls in the 0.50 to 0.67 band (Ouyang et al., 2022 report ~0.65 among trained labelers). Our calibration session target for production pilots is combined α ≥ 0.80.

Other task types: Computer Use, Multilingual, and Process Supervision currently have only 1 shared prompt each in the cohort. Krippendorff's Alpha is not statistically stable at that sample size (Hayes & Krippendorff, 2007). Red Team Analysis and Tool Call Evaluation cohorts do not yet have shared prompts across annotators, so α has not been computed for those task types. We are expanding shared prompt pools before publishing those numbers.

Why the Gap Exists

The quality difference is not a coincidence. It is a structural consequence of incentive design. On one side, pay per task rewards speed. On the other, verified credentials and career center portfolios reward depth. The comparison below is the outcome of those two systems running in parallel.

Drafted Training Labs

  • Identity
    University verified enrollment and edu email
  • Engagement
    3 to 11 minutes per task (Quality optimized)
  • Quality Validation
    Heuristic + LLM as Judge live; gold set QC per pilot
  • Incentives
    Credential badges and verified portfolios

Contractor Marketplaces

  • Identity
    Self reported and unverified
  • Engagement
    30 to 90 seconds per task (Throughput optimized)
  • Quality Validation
    Rarely disclosed publicly
  • Incentives
    Pay per task (Speed equals money)

The Concepts Behind the Scores

If you are evaluating annotation vendors, these are the terms that matter. Each one maps directly to how we score and validate the data your models train on.

What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF is a machine learning technique where human annotators rank or rate AI generated responses. This feedback is used to train a "reward model" that teaches the AI to produce outputs that are more helpful, accurate, and safe.

What is OSWorld and Computer Use Trajectories?

OSWorld (Xie et al., 2024) is a benchmark for evaluating multimodal AI agents on real computer tasks across Ubuntu, Windows, and macOS. Human performance is 72.36%. A "trajectory" is a step by step recording of a human completing a task, capturing screenshots, click coordinates, keystrokes, and written reasoning at each step. Agent performance has closed the gap rapidly (Agent S reached 72.6% in late 2025), but production grade agent reliability in enterprise workflows still bottlenecks on high quality human demonstration data. That is the kind of data our Computer Use cohort is built to produce.

What is Process Supervision (PRM800K)?

Instead of just checking if an AI got the final answer right (Outcome Supervision), Process Supervision evaluates every single step of the AI's reasoning. PRM800K is a dataset used to train Process Reward Models, which are critical for improving AI performance in complex math, logic, and coding tasks.

What is Krippendorff's Alpha?

A statistical measure of agreement between multiple annotators (Inter Annotator Agreement, or IAA). Krippendorff (2004) defines three reliability bands: α ≥ 0.80 is reliable (draw firm conclusions), 0.667 ≤ α < 0.80 supports only tentative conclusions, and α < 0.667 is unreliable. Unlike simpler metrics such as Cohen's Kappa, α handles missing data, more than two annotators, and multiple levels of measurement (nominal, ordinal, interval). That is why it is the standard in academic annotation research.

What is Gold Standard QC?

Gold Standard Quality Control injects "gold items" (prompts with known, expert verified correct answers) invisibly into an annotator's workflow. Because the answer is known, grading is deterministic. It cannot be gamed or bypassed by low effort submissions. In our architecture, the gold set is defined per pilot using the buyer's own rubric and expert answers, which ships alongside pilot launch.

The data is published. The methodology is open.

If you are building the models that need this signal, let's talk about a pilot.

Partner with Drafted