This pipeline processes millions of images to remove duplicates and near-duplicates, then selects a diverse subset of images suitable for model training.
Beyond filtering and sampling, it also supports targeted image retrieval: users can retrieve images semantically similar to a candidate image or folder, which is useful for object-specific datasets, environment coverage, and failure-case mining.
