Detecting similar images with pHash and pg_similarity

  17 Jul 2014

In Ruby, we can use Phashion gem (the pHash wrapper) to calculate all stuff but it doesn’t seem to be a scalable solution to me. So, I move hamming distance calculation to database, that means we use database to do the hard work and speed things up. Specifically, the solution here is to use the cool pg_similarity module for many similarity algorithms we can pick eg. hamming distance for this app. And, in order to use pg_similarity, we have to download and install it Pretty basic unix software installation.

git clone
cd pg_similarity
USE_PGXS=1 make
USE_PGXS=1 make install

For Postgresql9.1+, you should use CREATE EXTENSION instead of LOAD EXTENSION to install and load the module.

CREATE EXTENSION pg_similarity;

Test if you can use functions in the module

select hamming('11101', '10011');
(1 row)

(Note: the document in uses the older postgresql version, load extension command)

Our datatabase system is now ready to use, let’s move to application side. To illustrate this in web application, we’ll use Rails 4 with paperclip. The app is simple and straight forward, we just need these two functions: 1. Store images and save image phash fingerprint in database. 2. Upload or Paste image to search.

Create rails app

rails new --database=postgresql simi

Add the paperclip gem into Gemfile

gem "paperclip", :git => "git://"

Save and run bundle install. Then, create migration rails g migration Product title:string image_file_name:string image_content_type:string image_file_size:string You will have generated migration file like

class CreateProducts < ActiveRecord::Migration
    def change
        create_table :products do |t|
            t.string :title
            t.string :image_file_name
            t.string :image_content_type
            t.string :image_file_size

Add the attachment to the model you wish to use file upload feature, what we have created is app/models/product.rb So, modify it to be

class Product < ActiveRecord::Base
  has_attached_file :image, :styles => {  :medium => "300x300>", :thumb => "100x100>", :small => "150x150>" },
                    :url => "/system/:attachment/:id/:style/:basename.:extension",
                    :path => ":rails_root/public/system/:attachment/:id/:style/:basename.:extension"

  validates_attachment_content_type :image, :content_type => /\Aimage\/.*\Z/

Add create method in products controller, simply with cat > app/controllers/products_controller.rb, paste the code below and Ctrl+D (or use rails g controller products if you like)

class ProductsController < ApplicationController
    def create
      @product = Product.create( product_params )


    # Use strong_parameters for attribute whitelisting
    # Be sure to update your create() and update() controller methods.

    def product_params

At this point you should be able to upload file at the /products/new url.

We also want to save image fingerprint in the product column. We add two new columns image_fingerprint adn image_fingerprint_bits in products table to save image_fingerprint and its bits string. Add this in migration

add_column :products, :image_fingerprint, :string
add_column :products, :image_fingerprint_bits, 'bit varying(255)'

Also, add the following self.search_by_image method to the Product model

class Product < ActiveRecord::Base
  has_attached_file :image, :styles => {  :medium => "300x300>", :thumb => "100x100>", :small => "150x150>" },
                    :url => "/system/:attachment/:id/:style/:basename.:extension",
                    :path => ":rails_root/public/system/:attachment/:id/:style/:basename.:extension"

  validates_attachment_content_type :image, :content_type => %r{^(image|(x-)?application)/(bmp|gif|jpeg|jpg|pjpeg|png|x-png)$}
  before_save :perceptual_hash

  def self.search_by_image(fingerprint, options={})
    threshold = options[:threshold] || 0.75
    fingerprint_bits = fingerprint.to_s.unpack("B*")[0]
    conditions = <<-SQL
     length(image_fingerprint_bits) = length('#{fingerprint_bits}') AND hamming('#{fingerprint_bits}', image_fingerprint_bits)::numeric > #{threshold}
    select("title, image_fingerprint, hamming('#{fingerprint_bits}', image_fingerprint_bits)").where(conditions).order('hamming DESC')


  def perceptual_hash
    return unless image?
    tempfile = image.queued_for_write[:original]
    unless tempfile.nil?
      self.image_fingerprint = Phashion.image_hash_for tempfile.path
      self.image_fingerprint_bits = self.image_fingerprint.to_s.unpack("B*")[0]

(Note: Paperclip does the magic to hash with MD5 for the record that has an attribute named fingerprint, image_fingerprint in this case. But we don’t want MD5 hash, we want perceptual hash, so we make the workaround)

Next, we’ll create ImageSearch Controller. It’s basically image file upload form. The uploaded image will be the input to search againts images in products table.

class ImageSearchController < ApplicationController
  def index

  def upload_and_search
    uploaded_io = params[:image]
    search_image_path = Rails.root.join('public', 'uploads', uploaded_io.original_filename), 'wb') do |file|
    @img =
    @products = Product.search_by_image(@img.fingerprint)


Edit config/routes.rb to handle these endpoints.

get 'image_search' => 'image_search#index'
post 'upload_and_search' => 'image_search#upload_and_search'

For user to query with image, we create app/views/image_search/index.html.erb and use this simple form.

<h3>Search by Image</h3>

<p>Upload an image</p>

<%= form_tag({action: :upload_and_search}, multipart: true) do %>
  <%= file_field_tag 'image' %><br />
  <%= submit_tag 'Search' %>
<% end %>

Finally, add the search result page, we iterate over @products result in app/views/image_search/upload_and_search.html.erb

Search image fingerprint: <%= debug @img.fingerprint.inspect %>
    <% @products.each do |p| %>
           <td><%= p.title %></td>
           <td><%= p.hamming %></td>
           <td><%= image_tag p.image.url(:thumb) %></td>
    <% end %>

Let’s drink some beer 🍻

Ps. If you deploy the app on Heroku, I guess you might be able to use fuzzystrmatch module + Levenshtein instead of pg_similarity + Hamming like I do. But I’m not sure how accurate it is.

comments powered by Disqus