Title

Dataset Curation Pipeline

Role

Software Engineer

Year

2026

Dataset Curation Pipeline architecture preview

A scalable pipeline for transforming large, noisy image datasets into high-quality training datasets using modern vision embeddings and vector similarity search.

  • Python
  • PyTorch
  • DINOv2
  • PostgreSQL + pgvector
  • AWS S3
  • OpenCV / Pillow
  • NumPy

Overview
Dataset curation + targeted retrieval for large-scale ML training.

This pipeline processes millions of images to remove duplicates and near-duplicates, then selects a diverse subset of images suitable for model training.

Beyond filtering and sampling, it also supports targeted image retrieval: users can retrieve images semantically similar to a candidate image or folder, which is useful for object-specific datasets, environment coverage, and failure-case mining.

Problem Context
Real-world dataset quality bottlenecks at scale.

Large image datasets often contain near-duplicates, poor-quality frames, redundant samples, and class imbalance.

In robotics-style capture pipelines, millions of frames can be collected while only a fraction are training-worthy. Unfiltered training data wastes compute and can increase overfitting risk.

Manual curation does not scale when datasets reach hundreds of thousands or millions of images, so an automated and cloud-ready pipeline is essential.

System Flow
From raw dataset to training-ready outputs.

The core flow is: feature extraction with DINOv2 embeddings, duplicate filtering, diversity sampling, and semantic retrieval.

The output can either be a curated training dataset or a targeted set of similar images for analysis and downstream model iteration.

Core Tech
Embeddings + vector search + cloud storage.
  • Python for pipeline orchestration
  • PyTorch + DINOv2 for embedding generation
  • PostgreSQL + pgvector for nearest-neighbor search
  • AWS S3 for scalable dataset storage and server-side copy
  • OpenCV / Pillow for image IO and preprocessing
  • NumPy for vector operations
Design Decisions
Why vector indexing and pgvector.

Without indexing, duplicate checks become O(N) comparisons per image. With HNSW-based vector indexing, nearest-neighbor lookup is approximately O(log N).

Using pgvector directly in Postgres keeps metadata and embeddings in one system, simplifies infrastructure, and is often more cost-effective than hosting a separate vector database.

Scaling Notes
Performance and memory tradeoffs at 100K to 1M+ images.

Embedding extraction is the most expensive stage and is optimized with batch GPU inference.

Diversity sampling with k-center style selection introduces memory pressure because filtered embeddings must be loaded into memory. Future improvements include candidate-pool caps and reduced precision where possible.