Why Metrics Fail Without Preprocessing (And What We Did About It)

Andrew Kurtz
2025-07-12

Synthetic data only works if it's close enough to the real thing. At Bucket, we generate thousands of photorealistic defect images from STEP files, but we need to measure how close those renders are to actual factory photos.

So, we built a visual similarity evaluation pipeline. It combines semantic embeddings (CLIP), structural comparison (SSIM), and feature-matching techniques (ORB + RANSAC + logistic regression) to compare synthetic and real part images. Each metric has strengths — but also breaks down under real-world conditions. To address this we developed a robust preprocessing pipeline.

What Are CLIP, SSIM, & ORB?

CLIP: Model that encodes images (or text) into a vector embedding. The embeddings of two images can be compared using cosine similarity to determine the semantic similarity between them.

SSIM: Measures visual similarity between two images by comparing luminance, contrast, and structure (texture/visual layout) at corresponding points throughout the images.

ORB: Algorithm to extract key points and their descriptors from an image. The key points from two images can be matched, filtered, and compared.

When Metrics Fall Short

These three metrics all fluctuate based on things other than the part itself.

They’re highly sensitive to background clutter. CLIP encodes the background into the semantic meaning of the image. SSIM compares backgrounds just like it compares the part. ORB picks up background features as if they were part of the object, throwing off matches.

Orientation and framing matter too. If the part is slightly rotated or off-center, CLIP will treat it as a different object. SSIM expects pixel-aligned layouts. ORB’s descriptors depend on the patch size and resolution, so changes in scale or placement skew the results.

But here’s the thing: we're not building this for a lab. We can’t ask people, “no, that's wrong — move your part and retake the image.” We're building tools for messy, real-world conditions, and our comparisons have to hold up anyway.

To do that, we needed a way to ignore everything except the part.

The Pipeline

To ignore differences in the images beyond the part, we developed a preprocessing pipeline to standardize images.

  1. Remove the background: We use a pretrained image segmentation model to separate the foreground from the background and set the background pixels to be transparent.
  2. Center and resize: A bounding box for the part can be derived from the foreground segmentation mask. This enables centering and resizing the part to a standard width or height within the image, depending on the aspect ratio.
  3. Down/upsample: Resize both images to the same resolution.
  4. Rotate: Rotate one part to minimize the mean squared error (a simple image similarity metric) between the images. Since the background is gone and they are both centered, the MSE will be significantly impacted by the rotational alignment.
  5. Final resize: Resize the rotated part again to account for the size changing during rotation.

Before and After

(before)
(after)

Conclusion

With this preprocessing pipeline, our metrics now reliably measure the similarity between parts across different image conditions. Higher quality benchmarking enables easier iteration on rendering quality and establishes the foundation for developing a tool simple enough for customers to use independently.