Dataset Curation Pipeline

Title

Dataset Curation Pipeline

Role

Software Engineer

Year

2026

Dataset Curation Pipeline architecture preview

A scalable pipeline for transforming large, noisy image datasets into high-quality training datasets using modern vision embeddings and vector similarity search.

Python
PyTorch
DINOv2
PostgreSQL + pgvector
AWS S3
OpenCV / Pillow
NumPy

View on GitHub

Overview

Dataset curation + targeted retrieval for large-scale ML training.

This pipeline processes millions of images to remove duplicates and near-duplicates, then selects a diverse subset of images suitable for model training.

Beyond filtering and sampling, it also supports targeted image retrieval: users can retrieve images semantically similar to a candidate image or folder, which is useful for object-specific datasets, environment coverage, and failure-case mining.

Problem Context

Real-world dataset quality bottlenecks at scale.

Large image datasets often contain near-duplicates, poor-quality frames, redundant samples, and class imbalance.

In robotics-style capture pipelines, millions of frames can be collected while only a fraction are training-worthy. Unfiltered training data wastes compute and can increase overfitting risk.

Manual curation does not scale when datasets reach hundreds of thousands or millions of images, so an automated and cloud-ready pipeline is essential.

System Flow

From raw dataset to training-ready outputs.

The core flow is: feature extraction with DINOv2 embeddings, duplicate filtering, diversity sampling, and semantic retrieval.

The output can either be a curated training dataset or a targeted set of similar images for analysis and downstream model iteration.

Core Tech

Embeddings + vector search + cloud storage.

Python for pipeline orchestration
PyTorch + DINOv2 for embedding generation
PostgreSQL + pgvector for nearest-neighbor search
AWS S3 for scalable dataset storage and server-side copy
OpenCV / Pillow for image IO and preprocessing
NumPy for vector operations

Design Decisions

Why vector indexing and pgvector.

Without indexing, duplicate checks become O(N) comparisons per image. With HNSW-based vector indexing, nearest-neighbor lookup is approximately O(log N).

Using pgvector directly in Postgres keeps metadata and embeddings in one system, simplifies infrastructure, and is often more cost-effective than hosting a separate vector database.

Scaling Notes

Performance and memory tradeoffs at 100K to 1M+ images.

Embedding extraction is the most expensive stage and is optimized with batch GPU inference.

Diversity sampling with k-center style selection introduces memory pressure because filtered embeddings must be loaded into memory. Future improvements include candidate-pool caps and reduced precision where possible.