Abstract
3D augmented reality (AR) has a photometric aspect of 3D rendering and a geometric aspect of camera tracking. In this paper, we will discuss the second aspect, which involves feature matching for stable 3D object insertion. We present the different types of image matching approaches, starting from handcrafted feature algorithms and machine learning methods, to recent deep learning approaches using various types of CNN architectures, and more modern end-to-end models. A comparison of these methods is performed according to criteria of real time and accuracy, to allow the choice of the most relevant methods for a 3D AR system.
Keywords
Introduction
Image matching consists in accurately locating similar regions of images acquired from different viewpoints. An example of matching is shown in Fig. 1.
The applications of matching are numerous: the creation of a 3D map of a scene from a set of images acquired from different viewpoints by Structure From Motion (SfM) methods [50] or from a video by SLAM methods [42], geo-localization on a map by comparing an image to the images of a prerecorded 3D map, the creation of panoramas by merging multiple images into a single image [7], object detection [61], or autonomous drone controls [55].
The application that interests us in our work is the 3D augmented reality (AR) [22, 21], which consists in inserting 3D synthetic images in videos of real scenes.
The 3D augmented reality has a photometric aspect of 3D rendering and a geometric aspect of camera tracking.
3D rendering
3D rendering is used to illuminate the 3D objects, in such a way that it is inserted in a coherent way with the real scene. In [23], we presented the different traditional 3D rendering techniques of rasterization and ray tracing, and deep learning methods of Generative Adversarial Networks GAN, Variational Autoencoders VAE and the recent discipline of neural rendering, which combine the physical aspect of rendering and deep learning. A comparison was made according to different criteria of photo-realism, controllability, computational and working time, accessibility, quality of details and applications.
Image matching example.
The geometrical aspect of augmented reality is the goal of this article. The purpose of this step is to track the camera movement in order to be able to insert the 3D objects in a coherent way with the geometry of the scene. This is achieved by matching the features of the successive images of the video. Features (or keypoints) are locally unique regions in the image, such as corners or blobs.
The traditional feature matching pipeline consists of three main steps: detection, description and matching of image features.
Before presenting in detail image matching based on handcrafted, machine learning and deep learning methods, we will start by discovering the different types of approaches and their evolution.
Traditional methods, despite the advent of deep learning, are still used because of their performance. In [40], a sliding window is used to calculate the intensity variation of a block in several directions. In the Harris algorithm [27], the problem is reformulated to take into account all directions. The detection can be improved by sub-pixel accuracy algorithms. Multi-scale detection is proposed in [33]. Affine transformation invariant detectors are proposed in [36] and [35]. SIFT [34] allows detection and description of features, it is the most popular algorithm. The SURF algorithm [4] allows to accelerate SIFT by approximations with filters. ASIFT is a variant of SIFT [41] allowing to make the descriptor invariant by affine transformation. There are several fast detectors and descriptors such as FAST [47], BRIEF [9], ORB [48] and BRISK [32], their disadvantage is that they are less accurate.
The use of machine learning for the detection, description, and matching of features started a little before the “big bang” of deep learning. In [30], PCA is used to reduce the dimensionality of the SIFT descriptor. The drawback of PCA is that we do not have information about the labels of the classes. To solve this problem, it is possible to reduce the dimensionality of the descriptor with a linear discriminant analysis (LDA) as in [8]. In [52], the reduction of dimensionality is reformulated into convex optimization for better performance.
The method proposed in [28] is one of the first to use convolutional neural networks (CNN) for descriptor learning and feature matching. The goal is to train a Siamese network to identify similar patches of features. The dataset used is obtained by geometric and photometric image deformations. The success of deep learning for image matching is due to several factors: the possibility of using pre-trained CNNs on large datasets such as ImageNet [13], the CNNs are able to learn different image textures in each layer and the descriptors of the features can be obtained by removing the last layers (Fully Connected, etc.). In [18] it was shown that a CNN can achieve better matching results than SIFT. In [69] deeper architectures are used to improve performance. However, it was shown in [3] that performance can be improved with a shallow CNN (TFeat) by using triplets of training samples: reference patch, a positive (similar) patch, and negative (different) patch. A simple 6-layer CNN (L2-Net) was also used in [56] but with complex objective functions (loss function). The HardNet model proposed in [39], shows that with a simple objective function, but a more advanced training triplet selection strategy, it outperforms L2-Net. The SOSNet model [57] improves the performance by optimizing the intra- and inter-class distances. The most recent methods propose new architectures and strategies for matching [16, 5], homography estimation [31, 70], dense estimation [60, 59], or new models based on transformers [54, 29].
In other “modern” approaches, rather than using the classical pipeline (detection, description, matching) they use “end-to-end” training, unifying several parts of the pipeline in a differentiable way. In [66], a LIFT model is trained to perform both detection, orientation estimation, and description of features. The LF-Net model [44] uses the same principle with a different architecture. There are more and more models using this type of approach: Superpoint [14], IMIPs [12], DELF [43], D2-Net [15], UR2KiD [65] or SuperGlue [49]. In [67] a model is used to improve the matching by detecting outliers and which, combined with LIFT, shows performance that far exceeds the popular RANSAC method [19]. These modern approaches perform very well for complex images (large geometric variations and illumination) but the classical training pipeline based methods perform better in other simpler cases.
In this article we will present these different kind of approaches and a comparison according to different criteria of real time, accuracy, accessibility and applications, to allow the choice of the most relevant methods for a 3D augmented reality system.
In Section 1, we will present the traditional methods. Then, in Section 2, we will discover the methods based on machine and deep learning. Finally, in Section 3, we present the results of the comparison.
Handcrafted methods
Harris detector
In [40], a sliding window is used to compute the intensity variation of a block
The function
In the Harris algorithm [27], the problem is reformulated to take into account all directions. After a Taylor development of the Eq. (1) around
The matrix in the right term, which we denote by
The parameter
The drawback of the Harris detector is the fixed size of the window
The scale space theory is used in [33] to allow multiscale detection. The Harris matrix is reformulated by:
with
and
The function
with:
The scale values used are:
with
The selection of the most representative scale of the feature is performed by detecting the local maxima on the scale dimension of the normalized Laplacian:
For better detection of blobs (similar connected regions), the Harris matrix can be replaced by the Hessian matrix normalized with respect to scale:
The trace of
Another method consists in using the determinant of
It is possible to improve the detection of features with sub-pixel accuracy.
We start by detecting the connected components in the measurement quality map
Indeed, if
if if
In practice
We note:
If
The algorithm is as follows:
We start by initializing We compute in the neighborhood of We compute the new We calculate the error
MSER (Maximally Stable Extremal Region or MSER) [35] is a covariant blob detector by affine transformation. The principle is to vary a threshold
with
with
The ellipse surrounding the detected region is used as the feature patch. Stable regions have the property of remaining connected after affine transformation and are therefore reliable for matching.
The FAST algorithm [47] allows fast detection of the corners, thanks to a thresholding on a circle of
We consider a region to be a Feature if at least
or:
with
If two features have close positions, we eliminate the one with the lowest value of
Feature detection and scale selection
The detection in SIFT [34] is performed by the LoG operator (seen previously), which is approximated by difference of Gaussian (DoG) by introducing the notion of octave. DoG with different
Feature localization
The problem with the result of the previous step is the lack of accuracy in locating the features. Indeed, because of the use of a discrete space of scales
Note that SIFT is a good blob detector, while Harris is a good corner detector.
Once we have obtained the position and scale of the feature, we compute a descriptor
We start by choosing the relevant pixels that will contribute to the calculation of the descriptor. We consider a support region, larger than the scale around the detected feature. Then, in order to be able to compare the descriptors of two features with different orientations, we detect in each support region the orientation of the dominant gradient, and we change the orientation so that the dominant gradient is horizontal. To detect the orientation of the dominant gradient, we first compute the magnitude
A region
Note that there are other methods, using similar principles to SIFT for the calculation of the descriptor, but with different patch shapes: GLOH [37], CHOG [10] and DAISY [58].
ASIFT [41] is a SIFT variant allowing to make the descriptor invariant by affine transformation. The principle is to apply for each feature patch different affine transformations, and during the matching, we compare it to all the patchs in the other image. The disadvantage is that the calculation is very expensive. Note that there are also affine variants of the multi-scale Harris detectors, but the ASIFT detector gives better results.
SURF
The SURF algorithm [4] accelerates SIFT by approximating the LoG operator with filters. First, the second derivatives of a Gaussian
with:
such that
Note that the summed area table algorithm is used for the fast computation of sums in the filtering operation.
The octave scale space theory is used, except that instead of decreasing the image size, the filter size is increased. The lowest level corresponds to a filter of size
In [4], the formula of the determinant of the hessian, used for scale detection is approximated by:
The orientation of the features is detected with Haar wavelets by calculating the sum of the horizontal and vertical wavelet responses, over a
For the descriptor, we consider a neighborhood of size
Note that there is an extended SURF version with
Note finally that the sign of the Laplacian allows to distinguish between black blobs on white background and the opposite. This is useful for matching only blobs of the same type.
The disadvantage of the SIFT and SURF descriptors is that they are stored with
There are solutions consisting in applying a PCA (Principal Component Analysis) or LDA (Linear Discriminant Analysis) in order to reduce the dimensionality of the vector, or to apply a LSH (Locality sensitive hashing) algorithm to convert the vector into binary. However, we still have to calculate the descriptor in float first.
The principle of the BRIEF descriptor [9] consists in selecting
Note that unlike SIFT and SURF, the BRIEF algorithm only allows the description of features. Detection can be done with another detector (Harris, SIFT, etc.), but the author recommends the CenSurE algorithm [1].
The ORB algorithm [48] is a combination of the FAST detector and the BRIEF descriptor with some modifications to allow fast, multiscale, and rotation-invariant detection. First, the detection is performed with FAST. Then, the best
The detection is performed at several scales using a pyramid of images at different resolutions.
The orientation is computed as the direction between the center of the corner and the centroid of the patch computed from the first order moment:
with:
Thus, the orientation of the Feature is:
The BRIEF descriptor is modified to rBRIEF (rotation-aware BRIEF) to take into account the orientation of the Feature. The rotation matrix, corresponding to the angle
According to the author, the performances of ORB are quite close to SIFT, better than SURF and faster than both.
PCA-SIFT
PCA-SIFT [30] is a variant of SIFT whose goal is to reduce the dimensionality of the descriptor by using a projection matrix estimated by training. This method has the advantage of being more accurate and faster than SIFT.
A pre-processing is first applied:
SIFT detection of feature patches Extraction of vertical ( Vectorization and concatenation (( Unit normalization to reduce luminance variations.
Then, the principle is to use a training set of
After normalization of
We apply the singular value decomposition:
We use the first
As shown in [30], compared to SIFT, PCA-SIFT improves the matching performance. The descriptor size is reduced from
In [52], a method for learning descriptors by convex optimization is proposed. The algorithm is as follows.
The detected feature patch
The orientation gradients of each pixel of the patch are computed as in SIFT on
Each gradient image is convolved with a set of kernels, called “pooling regions”, having different support and positions. The pooling regions used are defined by:
with
As a result of the convolutions, we obtain a descriptor
Not all the pooling regions are used for training. A selection strategy is defined.
The descriptor is separated into several elements corresponding to each region:
with
The regions relevant for matching are selected by training. The training set is separated into two sets
with
From these last two equations, we end up with a convex optimization problem that consists in finding relevant regions
such that:
and:
the regularization term
The optimization thus allows us to find the relevant regions
Finally, we compute the descriptor:
The descriptor is also normalized to make it invariant to luminance changes, with respect to the mean and variance of the patch gradients or with the quantile of the gradient.
For dimensionality reduction, we look for a projection matrix
with
after development:
with:
and
The problem is reformulated as before in convex optimization:
with
The optimization of the two objective functions Eqs (48) and (39) is performed by the RDA algorithm because of the size of the training samples and the use of sums. RDA is efficient for optimization problems of the form:
with
All the details of the updates of the parameters
In [28], two convolutional neural networks (CNN) architectures are proposed for descriptor training and for matching.
The dataset used is created by geometric transformations applied on a set of patches extracted by DoG (as in SIFT). The use of synthetic transformations has the advantage, compared to traditional methods, of controlling the amount of invariance of the transformations we want the model to learn.
In order to retain only reliable features for matching during training, a random homography is applied to each image. Given the known homography, we can deduce the position of each pixel in the second image. DoG detection is performed on the transformed image and a comparison is made with the position, scale and orientation of the features in the reference image. Position, scale and orientation tolerance thresholds are used. If the features are close, they are considered reliable for matching and are retained as training samples.
This model can be seen as a classifier where each element at its output corresponds to a class (stable feature number).
The CNN is trained thus with all the patches of the stable features with the corresponding labels.
During the test phase, we provide the model with a patch of a new point of view image and it assigns it the closest class (matching).
with
A similar architecture is used in [11] for person identification from face images.
The ground truth used for the evaluation uses the homographies of the dataset in [37]. The use of a model to do the training directly on the image to be processed allows to improve the performances, moreover we find a similarity with the ASIFT method. The disadvantage is that the number of features varies from one image to another, while the output size of the CNN is not directly adjustable. Architecture 2 has the advantage of offering a smaller descriptor size of 128. The disadvantage of both methods is the slow training and processing time.
In [18], a CNN is used for feature description based on the output of the hidden layers of a CNN trained for classification.
Two models are trained by supervised learning on the ImageNet dataset [13] and unsupervised learning (images without labels). The evaluation is performed on the dataset [38] and on a new dataset. The results show that the method outperforms SIFT and that unsupervised learning gives better results than supervised learning.
The supervised model uses a 5-layer CNN architecture, plus two fully connected layers (FC) and a softmax classification layer.
The unsupervised model is trained with
Note that both models have been trained for classification and the goal is to use the output of the hidden layers as a descriptor.
Feature detection is first performed by the MSER algorithm. Then, the descriptors are extracted from a hidden layer of the model. Finally, the matching is performed by comparing the Euclidean distances of the descriptors.
For evaluation, the matched patches are compared to the ground truth and are classified as true positive/negative (TP/TN) if they have an IOU (Intersection Over Union) of the MSER ellipses of at least
Note that this metric is generally used for localization problems.
The IOU allows us to calculate the number of TP and FP according to the chosen threshold.
We can therefore deduce the precision and recall values from TP, FP and FN (false negative):
The evaluation metric AP (Average Precision) is the area below the Precision-Recall curve. We then derive mAP (Mean Average Precision) by averaging the AP (over all classes and/or IOU thresholds).
These metrics are the most used for the evaluation of the matching methods.
Several hidden layers have been evaluated with this metric for the calculation of descriptors. The evaluation results show that in the case of unsupervised learning, the upper layers give better results. In the case of supervised learning, the results are similar whatever the layer used for the extraction of the descriptors, however, the size of the patches used has more influence on the results. The larger patch sizes are preferable in the case of CNNs.
CNNs give better matching results than SIFT but are slower.
In [69], a CNN is trained to learn a similarity function from patch pairs. A dataset is labeled in similar and different pairs. four architectures are proposed.
The first simple model uses a single CNN: the two patches are considered as a single two-channel image.
In the second Siamese model, two CNNs with the same weights are applied to the two patches (the two CNNs have shared branches). The two outputs are concatenated into a single vector of dimension
The third pseudo-Siamese model is similar to the previous one, with the difference that the two CNNs do not share the branches, which means that the weights of the 2 CNNs are different (implies more parameters).
The choice of the architecture depends on the speed of calculation and the accuracy.
For the matching, the FC layer can be replaced by a
Other models can be combined with these architectures to create very deep and more powerful models [53]. The interest is to replace layers with large filter sizes, with several
The fourth model “Central-Surround” allows to perform a multi-scale matching. Two Siamese models are used. The “Surround” model receives the subsampled patches at a lower resolution (from
The Spatial Pyramid Pooling (SPP) method can be used to avoid resizing the patch. This consists in adding to the output of the CNNs a max-pooling layer of dimension proportional to the patch, which allows to keep all the information of the patch.
The training of the different models is performed by optimizing the following objective function:
with
The evaluation results show better performance than convex optimization, SIFT and DAISY.
In the TFeat model [3], patch triplet samples are used instead of patch pairs for CNN descriptor training.
In the precedent case of patch pairs, we use samples pairs
with
The disadvantage of using pairs is that most negative pairs do not contribute to the gradient update during the optimization, since the distance is most often greater than
A solution [51] is to identify the “Hard Negative” pairs from their distances (close negative pairs) to use them mostly during the training. The disadvantage of this method is that it is time consuming.
In the case of triplets, we use sample sets
and
We use the objective function:
This is equivalent to adjusting the weights of a CNN by optimizing the objective function to obtain:
This results in a distance
A strategy to use the “Hard Negatives” is proposed to increase also the distance between
We define the “Hard Negative” of triplets by:
If
We thus obtain a new objective function:
This method gives better results than those using patch pairs. The processing time on GPU is
The L2-Net model [56] uses an architecture with only convolutional layers followed by Batch Normalization layers and a LRN (Local Response Normalization) layer at the output.
For training, the datasets Brown [6] and HPatches [2] are used.
In practice, when we do the matching, we end up with many more different patches than similar patches. Because of the large number of negative patches it is impossible to use all of them for training. The previous methods use positive and negative classes with equal numbers of samples, while to be closer to reality we need more negative samples.
The strategy to select the relevant negative patches in larger quantities is the following:
We consider a number Iteratively, we sequentially select For each point
We obtain a batch of the corresponding descriptors, each of dimension
We compute a distance matrix
with:
Indeed, the elements
Therefore,
The objective function used by L2-Net consists of 3 terms: a similarity term to separate positive and negative patches, a compactness term to decorrelate the dimensions of the descriptor and an intermediate term to take into account the hidden (intermediate) layers of the CNN.
The similarity error is written:
where
The
with
The compactness error is:
Minimizing this error means trying to obtain zero elements outside the diagonal of
Descriptors extraction time is about
The model [39] is inspired by SIFT. The principle is to find the nearest neighbor NN of the patch and to compare it to the second NN by a distance ratio. The architecture used is the same as L2-Net but without the compactness and intermediate error terms. The principle is based on a triplet selection strategy. First, a batch of
The descriptors
with
For each patch
For each patch
Then, the selection of the triplets with hard negatives is made as follows:
We therefore look for the CNN parameters that minimize the objective function:
The results obtained are better than SIFT or complex regularization methods such as L2-Net.
The idea of SOSNet [57] is that different pairs of positive patches should have similar distances in the descriptor space. A second order similarity regularization term is used.
Let us consider
where
Note that unlike HardNet the objective function is quadratic. This improves the performance.
The SOS measures the similarity between
The objective function SOS is:
The objective function of the model is therefore:
The evaluation results show that the SOSNet model performs better than TFeat, L2-Net and HardNet.
The LIFT model [66] is a deep network allowing the training of the whole pipeline of detection, orientation estimation and feature description. This is done in an “end-to-end” way in order to preserve differentiability, so that the different steps of the pipeline are optimized together and not independently as is the case with previous methods.
First, a 3D reconstruction of the dataset images [63] is performed by a Structure From Motion (SfM) algorithm based on SIFT [64]. Similar 3D points in different images are used as positive patches, and different 3D points are used as negative patches. Non-reconstructed 3D points are used as patches that do not correspond to features for detector training only.
Each step of detection, orientation estimation and feature description, is modeled by a CNN.
The training process is performed using a Siamese network.
We consider that two patches
It is not possible to train all parts of the network at the same time because each component will try to optimize the parameters differently since they do not use the same objective function (not the same goal).
So we start by training the descriptor alone. Then, it is combined with the orientation estimator to train the latter. Finally the result of the two training is used to train the detector.
The CNNs used for detection, orientation and description are similar to those in [62, 68, 51].
The descriptor is trained with an objective function, using a distance
with
The orientation estimator is optimized with an objective function that minimizes the distance between positive patch descriptors:
with
The detection model is trained to optimize two terms in the objective function:
The first term consists in maximizing the classification score (detector output) for the different pairs:
with
Qualitative and quantitative evaluation
The second term consists in minimizing the distance between the descriptors of similar patches:
Note the “end-to-end” aspect of the whole pipeline in this last term:
Since the trained model takes patches as input, in order to apply it on a whole image it is necessary to browse the image with a sliding window and call the model for each patch. This method is very time consuming. The solution is to separate the detection from the other modules. Indeed, the detection allows to obtain a features map and can be applied on a whole image. Then, only the detected patches are used in a batch for the estimation of the orientation and for the description.
Note also that the detection can be performed in multiscale by reducing the resolution of the input image.
The evaluation results of the matching show that LIFT provides better performances than SIFT, ORB, DAISY and other traditional methods.
In [26, 24], we created large datasets of synthetic feature patches [25], which we used to train various models of detection, description and matching. Indeed, we found that the training of previous presented methods is based on feature patches extracted either by traditional detectors or manually. We therefore proposed to generate synthetic feature patches in order to avoid training the models by the result of traditional methods, this allows more control on the invariance of the model to geometric and photometric transformations. A convolutional sliding window model was trained on the generated patches for fast feature detection. Then, a Siamese CNN model was trained on a dataset of patch pairs for the description and matching of the detected features.
Comparison
Table 1 present qualitative and quantitative comparisons of the different methods. The references of the evaluation results and the datasets used are specified in the table for each method.
The metrics mAP (presented in Section 3.4), accuracy, precision-recall and ROC curves are used for the quantitative evaluation of descriptor matching. The repeatability metric is used for the evaluation of detectors.
Most AR applications need fast algorithms and accurate tracking for a stable insertion result. Therefore, we made a comparison of the presented methods in terms of processing time and accuracy. For real time applications, some methods should be discarded because of slow processing time. For off-line applications, methods with higher accuracy should be preferred.
We note that most of the methods compare to SIFT because of its performance and also because of its use for the extraction of patches used for model training.
For deep learning methods, the types of training data have an important influence on the result. Most of the deep learning methods have been trained on wide baseline image pairs (large geometrical transformations between images), while in the case of augmented reality, most of the time we are dealing with video sequences with low variances between successive images. This implies the need to train these models with new more representative datasets, in order to optimize the performances.
Finally, it will be necessary to evaluate the methods directly on the target application using a single dataset, and this is one of the perspectives of our next works concerning matchmoving visual effects.
Conclusion and perspectives
In this article we have discovered traditional and deep learning algorithms for feature detection, description and matching. The traditional methods are still used today, because of their performance despite the appearance of learning-based methods that emerged in the last decade with the “big bang” of deep learning. The models used are based on CNN architectures and various simple or complex objective functions. Modern methods allow to train models of the whole pipeline in an end-to-end manner. The datasets and the selection strategies of the training data influence the results significantly. A comparison of the methods in terms of processing time and accuracy has been performed to allow us to choose the most appropriate methods for an AR application. Among our perspectives, the development of a complete AR system by combining and comparing different modules of matching and 3D rendering (previously studied in [23]).
