Improving annotation quality with machine learning

Thursday November 20, 2025. 10:00 AM , from InfoWorld

Data science and machine learning teams face a hidden productivity killer: annotation errors. Recent research from Apple analyzing production machine learning (ML) applications found annotation error rates averaging 10% across search relevance tasks. Even ImageNet, computer vision’s gold standard benchmark, contains a 6% error rate that MIT CSAIL discovered in 2024—errors that have skewed model rankings for years.

The impact extends beyond accuracy metrics. Computer vision teams spend too much of their time on data preparation and annotation, with quality issues creating development bottlenecks where engineers spend more time fixing errors than building models. Teams implementing manual quality control report five to seven review cycles before achieving production-ready data sets, with each cycle requiring coordination across annotators, domain experts, and engineers.

The financial implications follow the 1x10x100 rule: annotation errors cost $1 to fix at creation, $10 during testing, and $100 after deployment when factoring in operational disruptions and reputational damage.

Why current annotation tools fall short

Existing annotation platforms face a fundamental conflict of interest that makes quality management an afterthought rather than a core capability. Enterprise solutions typically operate on business models that incentivize volume—they profit by charging per annotation, not by delivering performant downstream models. This creates incentives to annotate ever increasing amounts of data with little motivation to prevent errors that would reduce billable work. Their black-box operations provide minimal visibility into QA processes while demanding $50,000+ minimum engagements, making it impossible for teams to understand or improve their annotation quality systematically.

Open-source alternatives like Computer Vision Annotation Tool (CVAT) and Label Studio focus on labeling workflows but lack the sophisticated error detection capabilities needed for production systems. They provide basic consensus mechanisms—multiple annotators reviewing the same samples—but don’t offer prioritization of which samples actually need review or systematic analysis of error patterns.

These shortcomings force a telling statistic: 45% of companies now use four or more annotation tools simultaneously, cobbling together partial solutions that still leave quality gaps. The result is a costly, multi-step process where teams cycle through initial annotation, extensive manual QA, correction rounds, and re-validation. Each step adds weeks to development timelines because the underlying tools lack the intelligence to identify and prevent quality issues systematically.

Modern ML development demands annotation platforms that understand data, not just manage labeling workflows. Without this understanding, teams remain trapped in reactive quality control cycles that scale poorly and consume engineering resources that should be focused on model innovation.

A data-centric annotation solution

Voxel51’s flagship product, FiftyOne, fundamentally reimagines annotation quality management by treating it as a data understanding problem rather than a labeling workflow challenge. Unlike traditional platforms that focus on creating labels, FiftyOne helps teams work smarter by identifying which data actually needs annotation attention and where errors are most likely to occur.

Our data-centric approach represents a paradigm shift from reactive quality control to proactive data intelligence. Instead of blindly labeling entire data sets or reviewing random samples, the platform uses ML-powered analysis to prioritize high-impact data, automatically detect annotation errors, and focus human expertise where it matters most.

FiftyOne leverages machine learning to identify specific, actionable quality issues. This methodology recognizes that annotation errors aren’t random—they follow patterns driven by visual complexity, ambiguous edge cases, and systematic biases that can be detected and corrected algorithmically.

This intelligence transforms annotation from a cost center into a strategic capability. Rather than accepting 10% error rates as inevitable, teams can systematically drive down error rates while reducing the time and cost required to achieve production-quality data sets. FiftyOne is backed by an open-source community with three million installs and teams from Microsoft, Google, Bosch, Ford, Raytheon, Berkshire Grey, and more.

Automated error detection with mistakenness scoring

FiftyOne’s compute_mistakenness() capability identifies potential annotation errors by analyzing disagreement between ground truth labels and model predictions. This ML-powered approach ranks errors by likelihood and impact, transforming weeks of manual review into hours of targeted correction.

import fiftyone.brain as fob

# Automatically detect likely annotation errors
fob.compute_mistakenness(dataset, 'predictions', label_field='ground_truth')

The system generates several error indicators:

mistakenness: Likelihood that a label is incorrect (0-1 scale)

possible_missing: High-confidence predictions with no ground truth match

possible_spurious: Unmatched ground truth objects likely to be incorrect

from fiftyone import ViewField as F

# Show most likely annotation mistakes first
mistake_view = dataset.sort_by('mistakenness', reverse=True)

# Find highly suspicious labels (>95% error likelihood)
high_errors_view = dataset.filter_labels('ground_truth', F('mistakenness') > 0.95)

# Identify samples with missing annotations
missing_objects_view = dataset.match(F('possible_missing') > 0)

FiftyOne’s interactive interface enables immediate visual verification of flagged errors. Teams can quickly confirm whether detected issues represent actual annotation mistakes or model limitations, focusing human expertise on genuine problems rather than reviewing random samples.

Voxel51

This intelligent prioritization typically achieves significantly faster convergence to accurate labels compared to random sampling approaches, with customers like SafelyYou reporting a 77% reduction in images sent for manual verification.

Patch embedding-based pattern discovery

FiftyOne’s patch embedding visualization exposes quality issues invisible to traditional metrics. The platform’s similarity analysis projects samples into semantic space, revealing clusters of similar images with inconsistent annotations.

In other words, embeddings finds groups of similar objects that should be labeled the same way but aren’t (consistency-driven error detection).

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Path to BDD100k dataset files
source_dir = '/path/to/bdd100k-dataset'

# Load dataset
dataset = foz.load_zoo_dataset('bdd100k', split='validation', source_dir=source_dir)

# Compute patch embeddings using pre-trained model
model = foz.load_zoo_model('mobilenet-v2-imagenet-torch')

gt_patches = dataset.to_patches('detections')
gt_patches.compute_patch_embeddings(
model=model, patches_field='detections',
embeddings_field='patch_embeddings',
)

# Generate embedding visualization
results = fob.compute_visualization(
gt_patches, embeddings='patch_embeddings', brain_key='img_viz'
)

# Launch interactive visualization
session = fo.launch_app(gt_patches)

Clusters can be used to identify vendor-specific annotation errors invisible to statistical quality metrics—errors that only became apparent when visualizing the semantic similarity of misclassified samples.

Voxel51

Similarity search for quality control

Once you find one problematic annotation, similarity search becomes a powerful tool to find all related errors. Click on a mislabeled sample and instantly retrieve the most similar images to check if they have the same systematic labeling problem.

FiftyOne’s similarity search transforms “find more like this” from manual tedium into instant discovery. Index your data set once, then instantly retrieve visually similar samples through point-and-click or programmatic queries.

import fiftyone as fo
import fiftyone.brain as fob
import fiftyone.zoo as foz

# Load dataset
dataset = foz.load_zoo_dataset('quickstart')

# Index images by similarity
fob.compute_similarity(
dataset,
model='clip-vit-base32-torch',
brain_key='img_sim'
)

# Sort by most likely to contain annotation mistakes
mistake_view = dataset.sort_by('mistakenness', reverse=True)

# Query the first sample and find 10 most similar images
query_id = mistake_view.take(1).first().id
similar_view = dataset.sort_by_similarity(query_id, k=10, brain_key='img_sim')

# Launch App to view similar samples and for point-and-click similarity search
session = fo.launch_app(dataset)

Key capabilities include instant visual search through the App interface, object-level similarity indexing for detection patches, and scalable back ends that switch from sklearn to Qdrant, Pinecone, or other vector databases for production.

Remove problematic samples before they’re sent to annotators

FiftyOne’s Data Quality workflow scans data sets for visual issues that commonly lead to annotation mistakes. The built-in analyzer detects problematic samples—overly bright/dark images, excessive blur, extreme aspect ratios, and near-duplicates—that annotators often label inconsistently.

How the Data Quality workflow prevents annotation errors:

Brightness/blur detection: Identifies low-quality images where annotators guess labels

Near-duplicate finder: Reveals inconsistent annotations across visually identical samples

Extreme aspect ratios: Flags distorted images that confuse annotators about object proportions

Interactive thresholds: Adjusts sensitivity to explore borderline cases where quality degrades

Voxel51

Teams like Berkshire Grey achieved 3x faster investigations by using the tagging system to quarantine problematic samples, preventing bad annotations from contaminating model training. This transforms quality control from reactive debugging into proactive prevention.

Works with existing annotation tools and pipelines

Rather than forcing teams to abandon existing annotation infrastructure, FiftyOne can integrate seamlessly with any platform including CVAT, Labelbox, Label Studio, and V7 Darwin. The platform’s annotate() API uploads samples directly to these services while maintaining complete provenance tracking. After correction, load_annotations() imports updated labels back into FiftyOne for validation.

This integration extends throughout the platform. FiftyOne works natively with PyTorch, TensorFlow, and Hugging Face, enabling quality assessment within existing ML pipelines. Moreover, FiftyOne’s plugins architecture enables rapid development of custom functionality tailored to specific workflows.

FiftyOne’s data-centric approach offers automated error detection that reduces quality assessment time by 80%, improves model accuracy by 15% to 30%, and delivers up to 50% operational efficiency gains. By emphasizing understanding and improving data set quality through ML-powered analysis, FiftyOne differentiates itself from traditional labeling platforms—all while maintaining an open-core foundation that ensures transparency and flexibility.

For engineering teams drowning in annotation quality issues, the solution isn’t better labeling tools—it’s better data understanding. FiftyOne transforms annotation quality from a manual bottleneck into an automated, intelligent process that scales with modern ML development needs.

—

New Tech Forum provides a venue for technology leaders—including vendors and other outside contributors—to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to doug_dineley@foundryco.com.