This post detailing our approach to Image Retrieval is written by Mirjana Karanovic. Mirjana is our prolific AI developer and a member of the Vivify Ideas AI team, currently working on an array of innovative and easily integrable AI/ML tools. With an academic background in Industrial Engineering and Engineering Management and currently working on her PhD, Mirjana had decided to explore the possibilities of AI implementation and we couldn’t be happier to have her on board.
In today’s world, a fingerprint can not only be traced to a person’s identity but also reveal their favorite locations, browser history and more. On the other side of the spectrum, AI-based medical diagnostics are becoming more accurate than humans and Facebook’s tagging algorithm can recognize and pull out your entire family from a pixelated Christmas photo.
The innovation doesn’t stop here – the constantly narrowing gap between the capabilities of humans and machines has led to a breakthrough in the field of image retrieval, allowing the machines to really “see”.
Computer Vision (CV) stands behind all of this, with its various subtasks and diverse applications.
CV-based systems have gotten much better in recent years largely because of the increase in size and complexity of the datasets. Searching for relevant images in a large dataset represents a challenging problem in CV. While traditional search engines retrieve images based on the text like caption and metadata, this method can result in a lot of output that is not relevant, not to mention the effort, time, and money needed to annotate this textual data.
This is where Content-based image retrieval (CBIR) comes in handy, as it consists of a specialized image recognition task that includes finding relevant images from a larger set of images, based on the content. Images (their color, texture, shape, and other high-level features) are represented as feature vectors of numerical values.
CBIR in general is based on the following computer vision tasks:
- Image classification (grouping images into different categories),
- Segmentation (partitioning of an image into multiple regions, usually for further examination) and
- Object detection (detect where are the objects in an image).
There have been a lot of proposed solutions for these tasks, resulting in varying degrees of accuracy. Most of the current state-of-the-art solutions are based on Convolutional Neural Networks (CNN/ConvNet).
This is why we decided to use the CBIR system along with the CNN model to tackle this problem, which resulted in the creation of our search engine based on image vectorization, called ImaginAI.
The power of CNN
The potential of CNNs is limitless. They can be applied in healthcare, OCR, self-driving cars, search engines, recommender systems, social media, and many other areas thanks to their architecture. Every layer in CNN extracts different features: the “lower” layers usually extract general features, while the “higher” layers identify more specific ones. The output - feature map - can easily be accessed in order to gain some understanding of what features CNN detects. Since feature maps are nothing more than a vectorized representation of the image, they can also be used for defining the similarity between images.
The effectiveness of image retrieval depends on the type of applied distance metrics to calculate the similarity between the contents. The similarity is computed by applying distance measures on feature vectors of query image(s) and the database images. Euclidean distance and Cosine similarity are one of the most used distance metrics. Smaller distance between feature vectors means higher similarity.
Out-of-the-box models and Transfer learning can be helpful in extracting feature vectors.
Developing a neural network can be tedious, especially due to the time and computational power required for training. Sometimes it’s expensive and often simply isn’t possible with the given resources.
That is where the pre-trained models come in handy. Transfer learning is a term used for a model trained on one task (and one dataset) and reused as the starting point for a model on a second task on another dataset. By using these pre-trained models, training time is greatly reduced and performance is improved. Transfer learning is possible if features learned from the first task are general, not task-specific. The relevance of the pre-trained network is in the learned weights for the features.
However, it is important to determine which part of the knowledge can be transferred. For example, the top layers of the pre-trained network are removed since they are trained for classification on a different domain. Also, some layers can be fine-tuned in order to achieve better performance.
Most popular transfer learning models like AlexNet, VGGNet, GoogLeNet (InceptionV3) and ResNet are trained on publicly available ImageNet dataset, one of the most widely used datasets for training of image classifiers.
After summarizing pros and cons in terms of achieved results/accuracy, computational power and complexity of the extracted features, and some mini-experiments on prominent transfer learning models, we selected VGG16 as one of the main approaches for the ImaginAI system.
There are multiple variants of VGGNet (VGG11, VGG16, VGG19, etc.) which differ only in the total number of layers in the network. The architecture of VGG16 is shown in Figure 1.
There are 13 convolutional layers, 5 Max Pooling layers and 3 Fully connected layers which sums up to 21 layers but only 16 weight layers. VGG16 has a total of 138 million parameters. The convolution kernels are 3x3 and maxpool kernels are of size 2x2 with a stride of two.
Small fixed-size kernels are characteristic for this network. Why this concept? By using 3x3 filters, you can replicate many others like 5x5, 7x7, etc. and large-size filters are not needed. The advantage of using a smaller filter is in fewer numbers of parameters (The total number of parameters in this context is the sum of all weights and biases. For the sake of simplicity, we will ignore the bias). For example, 1 layer of 11x11 filters has 11 x 11 = 121 parameters. On the other hand, 5 layers of 3x3 filters (needed to reconstruct 1 layer of 11x11) has only 5 x 3 x 3 = 45 parameters, which is a significant reduction! Fewer parameters mean faster convergence and less chance of overfishing.
Pretrained VGG16 model is available in an open-source neural-network library, Keras, and can be easily implemented, as shown below.
The first time this example is run, the weights are downloaded. The next time the example is run, the weights are just loaded. By calling the summary() method of the VGG16 model, network layers are displayed with information about the name and type of each layer, shape and number of parameters. Any of these layers can be accessed by name.
By default, the model expects images as input with the size 224 x 224 pixels with 3 channels (e.g. RGB), however, this can be changed if needed. In order to use the model for transfer learning, additional arguments, like weights, pooling, number of classes can also be set.
Alongside classifying, models from Keras Applications can be used for feature extraction. We tested VGG16 classification on a custom dataset and it delivered satisfactory results, which led us to the conclusion that these feature vectors can be useful in our case. After additional analysis, we’ve concluded that the last layer (fc2) in the VGG16 outputs feature vectors that can (in combination with cosine similarity measure) boost the performance of our system and its capability to predict.
An example of feature extraction is shown in the example below. Some of the second convolutional layer’s feature maps are shown in the output.
Flaws of VGGNet
VGGNet is not perfect and applicable to any image classification problem. Major drawbacks are listed below:
- Custom categories: VGGNet has been trained on and supports classes from ImageNet dataset. Using categories that are not part of the dataset requires retraining, which is computationally expensive;
- Image context: When creating feature vectors, the last layer of the network encodes the whole image, which includes the object of interest and the background. This problem can be solved by combining encoding of the higher (pooling) layers, but won’t necessarily give good results. In order to improve results, the relevant part of the image must be properly selected and passed as input to VGGNet.
The field of CBIR has been explored, however, there is still a place for improvement. There are numerous experiments on the topic, and these findings can be helpful since it saves you time and other resources. Models pre-trained on large datasets are available and easily implemented, but knowledge and understanding of how they work is required in order to use them effectively. Furthermore, every problem is different, therefore each solution is specific. When choosing the architecture, it is important to verify that the network is capable of modeling the type of data that we want to work with.
Building a top-notch image retrieval system requires good data, domain knowledge and understanding of deep learning algorithms. It is known that deep learning models are only as good as the data that supports them. Hence, good quality data is vital. Moreover, better domain knowledge means better systems. What is also important is to be up-to-date with the latest methods and discoveries. The established system should be periodically upgraded and modernized.
ImaginAI is accordingly being revised and improved in order to achieve better results in a more efficient way. With ImaginAI you can build a recommender engine for your online store or a search engine to help customers find what they are looking for. This tool can even be used for tracking projects visually.
Possibilities are endless and imagination is what separates you from the crowd. Use your ImaginAItion.