Building an Image Similarity Search Using CLIP and FAISS

I’ve always found it fascinating how Google Images can find visually similar images almost instantly. It feels magical — you upload one photo, and in seconds, it understands and matches it to millions of others. This weekend, I wanted to explore: Could I build a tiny, minimal version of that, using only open-source tools?

Choosing the Tools

After some research, I decided to experiment with two open-source projects:

  • CLIP (by OpenAI): a model that converts images and text into embeddings, placing them into the same vector space where they can be meaningfully compared.

  • FAISS (by Meta): a library designed for fast similarity search over large collections of embeddings, helping you find the nearest matches efficiently. Both seemed lightweight enough for a small project, yet powerful enough to give meaningful results.

What I Built

Here’s a short demo of the image similarity tool I created:

The tool lets you:

  • Upload images to create a searchable index
  • Search by uploading another image
  • Or even search by typing a text query

Behind the scenes, CLIP generates embeddings, and FAISS efficiently finds the nearest matches.

You can find the complete code here: GitHub Repository

A Quick Look at How It Works

  • When a user uploads an image, I convert it into a CLIP embedding.
  • I store these embeddings in FAISS for fast retrieval.
  • During search, the uploaded image or text query is embedded and compared against the stored vectors.
  • The closest matches are shown back to the user.

Here’s the rough architecture I followed:

High level architecture diagram

The app itself runs on Streamlit, keeping it simple and easy to experiment with.

What I Learned

Working with CLIP was a great experience. It’s impressive how it generalizes, being able to search images using text without any retraining felt almost like cheating.

However, I also ran into a few limitations:

  • FAISS always returns top K results, even if no truly close match exists. With a small index, this can sometimes return completely unrelated images as “similar.”
  • CLIP embeddings are semantic, not instance-specific. For example, two photos of the same cat in different poses don’t necessarily come close in embedding space. CLIP understands the “idea” of a cat, but not necessarily the identity of this cat.
  • Handling file uploads in Streamlit needed a bit of care to avoid processing the same file multiple times when switching tabs.

These are all natural limitations of the tools — and understanding them was part of the fun.

Reflections

This project reminded me how much can be achieved with simple, well-built tools. I didn’t train a single model. I didn’t build a database from scratch. I just connected two smart pieces — CLIP and FAISS — and a lot of the magic simply happened. It also made me think: Sometimes, creativity isn’t about building everything yourself, but knowing how to connect things meaningfully.

What’s Next?

This experiment opened up a few ideas I’d like to explore further:

  • Fine-tuning a model for better instance-specific matching
  • Experimenting with more advanced search methods (e.g., filtering by confidence score)
  • Building a small backend to make this tool scalable beyond a weekend project

There’s a lot more to do if I want to make this production-grade. But for now, I’m happy with where this experiment landed.

Thanks for reading!

If you have thoughts, ideas, or just want to geek out about embeddings, feel free to connect!

Related Posts