Hierarchical Scheme for Vehicle Make and Model Recognition

Abstract

A vehicle make and model recognition (VMMR) system is a common requirement in the field of intelligent transportation systems (ITS). However, it is a challenging task because of the subtle differences between vehicle categories. In this paper, we propose a hierarchical scheme for VMMR. Specifically, the scheme consists of (1) a feature extraction framework called weighted mask hierarchical bilinear pooling (WMHBP) based on hierarchical bilinear pooling (HBP) which weakens the influence of invalid background regions by generating a weighted mask while extracting features from discriminative regions to form a more robust feature descriptor; (2) a hierarchical loss function that can learn the appearance differences between vehicle brands, and enhance vehicle recognition accuracy; (3) collection of vehicle images from the Internet and classification of images with hierarchical labels to augment data for solving the problem of insufficient data and low picture resolution and improving the model’s generalization ability and robustness. We evaluate the proposed framework for accuracy and real-time performance and the experiment results indicate a recognition accuracy of 95.1% and an FPS (frames per second) of 107 for the framework for the Stanford Cars public dataset, which demonstrates the superiority of the method and its availability for ITS.

Fine-grained visual categorization (FGVC) ( 1 – 3 ) is a classification framework in which the input data is assigned to very fine class labels. With the rapid development of FGVC, the vehicle make and model recognition (VMMR) system ( 4 ) has been widely used in real-life scenarios, which has gained significant attention in the past decade. The make and model of a vehicle reflect the inherent attributes of the vehicle, such as carrying capacity, dimensions, crew, and so forth. The application of intelligent transportation systems (ITS) is inseparable from the type of vehicle. For example, the VMMR system can assist toll collectors in charging fees according to different vehicle types in the electronic toll collection system ( 5 ). The traffic management system uses the VMMR method to determine the vehicle type on the road, and to guide and control the traffic flow ( 6 ). The year, make, and model of a vehicle recognized using the VMMR system can be cross-checked with the license plate registry to screen for any fraud. As the primary technology in ITS, the VMMR system needs to achieve a very high accuracy to avoid providing wrong information for other applications in ITS. Meanwhile, the VMMR system also needs to maintain an excellent processing speed to ensure that vehicles speeding on the road will not be missed in the video.

In the past, the license plate was searched through license plate recognition (LPR) systems ( 7 – 9 ), as the license plate databases store the vehicle information corresponding to the license plate, and the model of the vehicle and its relevant properties were finally identified. Because access to the official vehicle information database is not open and the LPR system requires an excessively high image resolution, most LPR systems cannot meet the real-time requirements of processing dozens of images per second. The VMMR system is an essential complement to the LPR system and can help improve the robustness and reliability of the applications in ITS.

The FGVC dataset is an essential element of the recognition framework, the image resolution and background complexity greatly determining the performance of VMMR. The Stanford Cars dataset ( 10 ) is widely used in VMMR and FGVC to make an objective and fair evaluation of the recognition performance. The shortcomings of the Stanford Cars dataset, such as low image resolution and limited number of images, generally make methods produce low recognition accuracy when they use this dataset. We obtained some vehicle images using web crawling technology, and expanded the dataset with a higher level of performance to address the above shortcomings in Stanford Cars. In contrast to other datasets, the pictures in the augmented dataset include models with different brands but highly similar appearances, as shown in Figure 1a. The dataset is named the MHV (multi-view high-resolution vehicle) dataset, as it contains images taken by different users, using different imaging devices, and from multiple view angles for various vehicles, which ensures a wide range of variations to accommodate various scenarios, as shown in Figure 1b.

Figure 1.

Typical exemplars of Stanford Cars dataset: (a) highly similar appearances and (b) complex environment.

The feature extraction network is an essential stage in image processing. Existing FGVC approaches ( 11 – 27 ) first localize the vehicle parts in the picture and then extract the discriminative features for classification. Most of the previous localization methods ( 2 , 15 , 16 ) are supervised algorithms and require a lot of auxiliary data. Those methods are not only computationally intensive but also labor-intensive. Meanwhile, images labeled with the appropriate vehicle manufacturer and model require professional knowledge in the auto industry. Also, a weakly supervised method ( 3 , 11 , 12 , 14 ) to generate discriminative regions not only maximizes the utilization of the semantic information in these regions but also avoids the problem of excessive reliance on labels. However, these methods usually have complex network structures and are not conducive to optimization. High-dimensional feature coding methods solve this problem; for example, the hierarchical bilinear pooling (HBP) ( 28 ) framework processes multiple discriminative regional features in a cross-layer interactive manner. The features of different layers complement and reinforce each other to enhance feature representation capability. Because of the cluttered background of some pictures in the dataset, the feature extraction process is interfered with by useless background. We propose an approach named weighted mask hierarchical bilinear pooling (WMHBP) for classification. This method generates a mask with weights to filter invalid features and inputs useful features into the HBP network. In addition, the hierarchical loss function is proposed to narrow the intra-class distance and expand the inter-class distance, and consequently to further learn the correspondence between brand, model and appearance.

Our main contributions can be summarized as follows:

We propose the WMHBP framework based on the HBP framework. The features generated by the last three blocks in the backbone each generate a mask and then aggregate the three masks. The mask can perform weight distribution according to the possibility of vehicles appearing in the area, filter the cluttered background to preserve the discriminative area, and consequently further avoid the loss of important information on the edge of the vehicle.

According to the critical characteristic of the vehicle hierarchy ( 29 , 30 ), we design a loss function with a hierarchical structure according to the divided labels. The loss function can enable the model to learn the corresponding relationship between brand, model, and vehicle appearance, and to improve the recognition accuracy of the vehicle brand and model.

We use web crawler technology to obtain many vehicle pictures and augment the training dataset to overcome the deficiencies of the existing dataset. FGVC models that are trained using the augmented dataset improve the recognition accuracy of the model. The experimental effect of data augmentation can be found in the Ablation Study of the Data Augmentation section. The proposed method achieves the best accuracy with the public dataset under the premise of ensuring real-time performance, which verifies the feasibility of the method’s implementation in ITS. The ablation experiment proves the effectiveness of each component in the model.

Literature Review

At present, the VMMR methods ( 17 – 28 , 31 – 36 ) favored by researchers mainly fall into three major directions: (1) attention mechanism; (2) high-dimensional feature coding; (3) vehicle-specific characteristic.

Attention Mechanism: Recently, attention mechanism has played an important role in the FGVC field ( 11 , 12 , 14 , 20 , 21 , 24 – 26 ). Fu et al. ( 25 ) proposed a recurrent attention convolutional neural network (RA-CNN). This method recursively learns the attention of the discriminative region and the feature representation based on the region in a mutually reinforcing manner and enlarges the local area of the discriminatory region to obtain more detailed semantic information. Sun et al. ( 14 ) proposed a method called multi-attention multi-class (MAMC), which uses a metric learning framework to establish connections between multiple attention regions. It enables the same attention region or the same category features to be aggregated more closely, and the features of different attention areas or different categories to be dispersed. Zheng et al. ( 26 ) proposed a network structure called trilinear attention sampling network (TA-SN), which can learn subtle features in hundreds of attention regions and distill these fine-grained features into a global one, to learn many local area features at a low computational cost. Although the method of VMMR based on attention mechanism has high recognition accuracy and strong interpretability, it usually needs to add an attention module, which leads to complicated network structure and additional calculations.

High-dimensional Feature Coding: High-dimensional feature coding ( 28 , 31 – 36 ) achieved powerful feature representation ability in FGVC with a simple structure. Bilinear CNN (BCNN) ( 31 ) uses a convolutional neural network (CNN) to extract features from pictures, then features multiplied by the Kronecker product, with its transpose, finally get image descriptions. Bilinear models can not only be trained using end-to-end training methods but also provide a more vital feature representation than linear models. Li et al. ( 33 ) proposed factorized bilinear pooling (FBP), which uses the Hadamard product to fuse the features extracted by CNN. The structure simplifies the pooling structure, reduces the model complexity, and improves computational efficiency. Yu et al. ( 28 ) proposed the HBP algorithm, which integrated multiple cross-layer bilinear features based on FBP and merged them to enhance the feature representation ability. Tan et al. ( 34 ) generated an independent slack mask according to the feature value output by each CNN and the threshold, then used a mask aggregated from multiple independent masks to filter features and then input into the HBP structure. These methods reduce the difficulty of designing a model through more discriminative feature representation, but the resulting high feature dimension limits the generalization ability.

Vehicle-specific Characteristic: This kind of method utilizes informative vehicle characteristics to design the recognition network structure for better recognition performance ( 17 – 19 , 22 , 23 , 27 , 37 ). Lu and Huang ( 37 ) divided the front appearance of a vehicle into seven fixed sub-regions and identified these sub-regions in two stages through a hierarchical structure-based recognition scheme. First, the method identified the brand of a vehicle through the vehicle’s logo and then it recognized the other six sub-regions to achieve the purpose of identifying the model. Chen et al. ( 27 ) proposed a feedback-enhanced multi-branch CNN in response to the needs of the real monitoring environment, using a multi-branch loss function for enhancing feedback to each branch. The accuracy of this method on its self-made dataset and the CompCars dataset ( 30 ) reached 94.9% and 91.0% respectively. Xiang et al. ( 23 ) proposed a network structure with topology constraint, describing the relationship between regions and integrating the global topology relationship into CNN. Through the training of the entire network, the network using the global topological relationship can be a more effective utilization of features. This type of method is designed based on the characteristics of a vehicle, so researchers need to have professional vehicle-related knowledge as a prerequisite in such studies.

The quality of the dataset determines the upper limit of the performance of the methods in VMMR. Many researchers have proposed public and self-made VMMR datasets for testing and training. Yang et al. ( 30 ) published a public dataset containing web-derived and surveillance-derived parts. The dataset is made of 163 vehicle brands and 1,716 models. This dataset includes 136,727 images of vehicles, taking from different view angles. Tafazzoli et al. ( 38 ) proposed a dataset that contains 9,170 categories, including 291,752 images covering the vehicle models manufactured between 1950 and 2016. Krause et al. ( 10 ) proposed a multi-view dataset named the Stanford Cars dataset. The dataset contains two sub-datasets, one of which contains an ultra-fine-grained small dataset of 10 BMW models and the other is a relatively large dataset with 196 models. The large sub-dataset contains 16,185 images, and the small sub-dataset contains 512 pictures. The Stanford Cars dataset includes not only 2D labels but also 3D labels. The representation of three-dimensional objects can be used for geometric estimation, 3D model reconstruction, and 3D representation of vehicle components. There are problems with this dataset: some pictures in the dataset have low resolution, such as (120*96), or (142*94), slightly long-tailed distributions of categories, and so forth. However, because of the authority of this dataset and the rich label information of this dataset, researchers usually use this dataset to measure the performance of their method in VMMR.

Methods

The proposed method is set out in three sections to provide a better explanation. First, we describe the proposed network structure, WMHBP for feature extraction. Then, we introduce the implementation of the hierarchical loss function. Finally, we introduce the methodology of data augmentation.

Feature Extraction Model

Factorized Bilinear Pooling and Hierarchical Bilinear Pooling

Unlike BCNN, FBP ( 33 ) factorizes the pooling layer in bilinear CNN, which dramatically reduces the parameters in the model and relieves the occurrence of overfitting. Input an image into the CNN and it outputs a feature $X \in R^{h \times w \times c}$ , where h is the height, w is the width, and c denotes the channel number. The FBP model is defined as

\begin{matrix} z_{i} = x^{T} W_{i} x = x^{T} U_{i} V_{i}^{T} x = U_{i}^{T} x ⊙ V_{i}^{T} x \end{matrix}

(1)

\begin{matrix} O_{FBP} = P^{T} (U^{T} x ⊙ V^{T} x) \end{matrix}

(2)

where

$W = [W_{1}, W_{2}, \dots, W_{o}] ϵ R^{c \times c \times o}$ is the projection matrix that we need to train,

$x = X^{m, n}$ is the c-dimensional local descriptor with the spatial location $(m, n)$ ,

⊙ is the Hadamard product,

$U \in R^{c \times d}$ and $V \in R^{c \times d}$ are projection matrices that transform features into d-dimensional, and

$P \in R^{d \times o}$ is a classification matrix.

In the HBP model, multiple features in backbone networks are added to enhanced feature expression. Based on the design concept of FBP, the features in the last layer of blocks are merged in pairs to strengthen the interaction and feature representation capabilities between the convolutional layers. The following equation can define the HBP model:

\begin{matrix} O_{HBP} = P^{T} concat (U^{T} x ⊙ V^{T} y, U^{T} x ⊙ S^{T} z, V^{T} y ⊙ S^{T} z, \dots) \end{matrix}

(3)

where

$x = X^{m, n}, y = Y^{m, n}, z = Z^{m, n}$ are the c-dimensional local descriptors with the spatial location $(m, n)$ ,

x, y, z represent different local feature descriptors,

⊙ is the Hadamard product,

$U \in R^{c \times d}$ , V $\in R^{c \times d}$ , $S \in R^{c \times d}$ are projection matrices that transform features into d-dimensional, and

$P \in R^{d \times o}$ is a classification matrix.

The HBP model first obtains three feature matrices of the same dimension from the last three blocks of the feature extraction network. Simultaneously, the feature matrix is converted into a higher-dimensional feature matrix through the convolution layer. The three feature matrices are multiplied by the Hadamard product in pairs to obtain three interactive feature matrices. Finally, the interactive feature matrices are pooled and connected into a matrix to obtain a feature descriptor with richer semantic information.

Weighted Mask Hierarchical Bilinear Pooling (WMHBP)

In the proposed method, inspired by Tan et al. ( 34 ) and Wei et al. ( 39 ), the impact of useless background on the recognition performance of the model can be effectively avoided by extracting the discriminative features in the picture. We generated a weighted mask that reduces the loss of useful feature information while restraining feature information that is not useful for the current task. The location of each layer of feature in response to semantic information is very different, so it is essential to select a useful descriptor to generate a mask. The proposed method is called weighted mask hierarchical bilinear pooling (WMHBP). This method uses global average pooling to calculate the total average of the feature maps as the threshold $\bar{a}$ . The area where the feature matrix A is greater than or equal to the threshold is reserved with a value of 1, and the other area is reserved with a value of 0.1 to form a binary image mask. The dimension of the mask is the same as the feature matrix A , and the weighted mask M is defined as

\begin{matrix} M (i, j) = {\begin{matrix} 1, A_{(i, j)} > β \bar{a} \\ 0.1, otherwise \end{matrix} \end{matrix}

(4)

where $(i, j)$ denotes the spatial location $(i \in {1, \dots, h}, j \in {1, \dots, w}, A_{(i, j)} \in R)$ . β is a constant with a value of 0.5.

We used the HBP model to extract three semantic features through the backbone network and used Equation 4 to generate masks $M_{1}, M_{2}, M_{3}$ of size $h \times w,$ respectively. Then we aggregated the three generated masks into $M_{agg}$ using

\begin{matrix} M_{agg} (i, j) = {(\frac{\sum_{k = 1}^{n} M_{k} (i, j)}{n})}^{θ} \end{matrix}

(5)

where $(i, j)$ denotes the spatial location $(i \in {1, \dots, h}, j \in {1, \dots, w}, θ \in Z)$ , n is the number of corresponding masks, and θ is the variable to control the tightness of $M_{agg}$ . The larger the variable is, the less likely it is to retain the area.

As shown in Figure 2, the backbones generate three features $A_{1}, A_{2}, A_{3} \in R^{h \times w \times c}$ respectively. We then fused the features with $M_{agg}$ through the Hadamard product operator to obtain the filtered feature $A_{1}^{'}, A_{2}^{'}, A_{3}^{'}$ . Finally, we inputted the features $A_{1}^{'}, A_{2}^{'}, A_{3}^{'}$ into HBP to get the final result. The following formula defines the WMHBP model:

\begin{matrix} O_{WMHBP} = P^{T} concat (U^{T} a_{1}^{'} ⊙ V^{T} a_{2}^{'}, U^{T} a_{1}^{'} ⊙ S^{T} a_{3}^{'}, V^{T} a_{2}^{'} ⊙ S^{T} a_{3}^{'}, \dots) \end{matrix}

(6)

where $a_{1}^{'}, a_{2}^{'}, a_{3}^{'}$ are c-dimensional local descriptors at the same spatial location as the features $A_{1}^{'}, A_{2}^{'}, A_{3}^{'}$ .The WMHBP method can provide a more reliable selection of discriminative regions so that feature interaction produces more effective results and achieves better recognition accuracy.

Figure 2.

Illustration of the weighted mask hierarchical bilinear pooling (WMHBP) framework.

Loss Function Based on Hierarchical Structure

Based on the traditional cross-entropy function, we proposed the brand loss function and series loss function to learn the appearance differences and commonalities between different models and different brands. Simultaneously, to better match the structure of the hierarchical loss function, we added 48 brand labels to the original 196 model labels, that is, 47.194.Volvo_240_Sedan_1993. These labels are sorted in ascending order. The dataset label structure is shown in Figure 3.

Figure 3.

The label format of the dataset.

The number of linear layer output units in the WMHBP model was increased from the number of models (in the Stanford Cars dataset this is 196) to the sum of the number of models and the number of brands (244). With the WMHBP model, a prediction matrix $M$ with the dimension of $batch size \times 244$ would be the output. We used the segmentation function $g_{k} (m, n)$ to cut the prediction matrix $M$ and label matrix $N$ into two parts $y, \hat{y}$ and $z, \hat{z}$ . When $g_{k} (m, n)$ is applied to the series loss function, $y, \hat{y} \in batch size \times 196$ ; when $g_{k} (m, n)$ is applied to the brand loss function, $z, \hat{z} \in batch size \times 48$ .

In summary, we can define the $L_{cls}$ loss function as

\begin{matrix} L_{cls} (m, n) = L_{Brand} + β L_{Series} = - \frac{1}{N} \sum_{k = 1}^{N} [z_{k} \log \hat{z_{k}} + β y_{k} \log \hat{y_{k}}] \end{matrix}

(7)

where

$L_{Brand} = - \log (P_{B} (y))$ , $L_{Series} = - \log (P_{S} (z))$ , and $P_{B}, P_{S}$ are the category probabilities of the last Softmax layer output,L_cls is the loss value formed by the combination of the Brand loss value and the Series loss value, which is used to optimize the parameters of the classification model,

k is the index of training images that changes from 1 to N,

$y, \hat{y}$ represent the ground-truth label vector and the class probabilities vector of the brand respectively,

$z, \hat{z}$ represent the ground-truth label vector and the class probabilities vector of the model respectively, and

$N$ is the sample number of the dataset.

Data Augmentation

We obtained pictures of vehicles from web pages using web crawler technology and constructed a VMMR dataset called the multi-view high-resolution vehicle (MHV) dataset. Each image usually contained only one car with a uniform resolution of 1027 × 768 pixels. After the pictures were collected, we classified all the images with image-level labels. The labels of the dataset take the same format as the labels of the Stanford Cars dataset. Each image label contains three kinds of information: brand, model, and year, such as “Audi A5 Coupe 2012” and “Dodge Journey SUV 2012.” Subsequently, we searched for the same categories in the MHV dataset and the Stanford Cars dataset and extracted 5,545 images of these categories. Finally, these images were added to the Stanford car dataset to form an enhanced dataset.

Experiments

Handling Datasets

The datasets (Stanford Cars, MHV, Stanford Cars & MHV) are divided into training images and testing images with a ratio of around 50% to 50%. In the training phase, we flipped the original image horizontally to augment the dataset further and randomly cropped the image to 448 × 448 pixels. In the test phase, we cropped the image into 448 × 448 pixels by center cropping. The number of categories of the dataset and the detailed statistics of the dataset segmentation are shown in Table 1.

Table 1.

Statistics of the Related Datasets

Dataset	Categories	Training samples	Testing samples
Stanford Cars	196	8,144	8,041
MHV	196	5,545	5,540
Stanford Cars and MHV	196	13,689	13,581

Note: MHV = multi-view high-resolution vehicle.

Implementation

We used a backbone network pre-trained on the ImageNet ( 40 ) dataset to build our recognition network. Unlike other similar experiments ( 28 ), the proposed method requires higher computational efficiency and more effective inter-layer interaction, so we did not choose VGG ( 41 ) and Inception ( 42 ). Instead, we decided to use the ResNet network ( 43 ), which is a compromise solution between the number of parameters and the depth of the network, as the backbone network. In the training process, we used the training method of ( 28 ) to divide the experiment into two steps. First, we fixed the parameters of ResNet and trained the fully connected layer and the high-dimensional projection layers. For the second time, we updated the parameters of all layers.

We trained the network using stochastic gradient descent (SGD) with a batch size of 16, momentum of 0.9, and weight decay of 10⁻⁵. The initial learning rate (LR) is 1.0 when training the fully connected layer, and 0.01 when training the whole model; and it is then decreased by a factor of 10 every 40 epochs. All experiments were implemented on a Titan V GPU server with the Pytorch environment ( 44 ) and TorchVision library.

To find the best threshold $β$ for generating the single mask and the best value $θ$ to control the degree of combination of multiple masks, we chose the value of $θ$ from ${1, 2, 3, 4}$ and the value of $β$ from $[0.4, 1]$ . Initially we set $θ$ to 1 and $β$ to 0.4. We linearly increased the value of $θ$ by 0.2 each time. Finally the values of $θ$ and $β$ were 2 and 0.6 respectively. Because of the VMMR dataset’s similarity, we set the values of $θ$ and $β$ to 2 and 0.6 respectively in all the subsequent experiments when these two parameters were required. Table 2 shows all the hyperparameters (LR-FC in the table represents the initial learning rate when adjusting the fully connected layer; LR-All represents the initial learning rate when adjusting all layer parameters).

Table 2.

Detailed Statistics of Hyperparameters in the Experiment

Hyperparameters	Batch size	Weight decay	Momentum	LR-FC	LR-all
Values	16	10⁻⁵	0.9	1.0	0.01

Note: LR-FC = initial learning rate when adjusting the fully connected layer; LR-all = initial learning rate when adjusting all layer parameters.

Ablation Study

We conducted several ablation experiments to study the influence of some important parameters and different components of the proposed method. The effects of each set of experiments are shown in Table 3.

Table 3.

Comparison of Classification Results of Various Components in the Proposed Method on Stanford Cars Dataset

Model	DA	WM	HL	Acc. (%)	Precision (%)	Recall (%)	F1-score (%)
WMHBP (ResNet34)	na	na	na	92.89	93.14	92.90	92.88
	√	na	na	94.76	95.00	94.77	94.76
	√	√	na	94.82	95.02	94.82	94.79
	√	√	√	95.13	95.32	95.13	95.09

Note: DA = data augmentation used in training; WM = weighted mask; HL = hierarchical loss; acc. = accuracy; WMHBP = weighted mask hierarchical bilinear pooling; na = not applicable.

The bold values represent the results of using all the components proposed in this paper.

Ablation Study of the Data Augmentation

We conducted the experiments using the Stanford Cars & MHV dataset (augmented dataset) and the Stanford Cars dataset respectively for training to verify the impact of data augmentation on the accuracy. A comparison was done with multiple models, and the results are shown in Figure 4.

Figure 4.

Ablation contrast of different models using data augmentation tested on the Stanford Cars dataset.

In respect of the recognition accuracy, it can be seen from Figure 4 that there was an increase from 92.89% to 94.76% for HBP, an increase from 93.00% to 94.82% for WMHBP, and an increase from 93.30% to 95.13% for WMHBP with hierarchical loss when using the augmented dataset to train the various ResNet34-based networks. When ResNet18 is used as the backbone network, the recognition accuracy of HBP is increased from 91.99% to 94.19%, the recognition accuracy of WMHBP is increased from 92.23% to 94.27%, and the recognition accuracy of WMHBP with hierarchical loss is increased from 93.38% to 94.59%. Regardless of the recognition model adopted, data augmentation can improve the robustness of the model, enhance the sample quality, and reduce the model’s dependence on image resolution and other factors, and therefore improve the generalization performance of the model. The impact of data augmentation on the accuracy of vehicle recognition is significant.

It can be seen from Table 4 that when the MHV dataset is tested using ResNet34 as the backbone network, the accuracy of WMHBP with hierarchical loss function is 98.61%, which is 0.21% higher than the accuracy of WMHBP and 0.14% higher than HBP. When ResNet18 is used as the backbone network, the accuracy of WMHBP with hierarchical loss function is 97.01%, which is 0.15% higher than the accuracy of WMHBP and 0.08% higher than HBP. The accuracy of the same method in the MHV dataset is higher than that in the Stanford Cars dataset, which proves the effectiveness of the recognition method and the applicability of the MHV dataset.

Table 4.

Comparison of Different Methods’ Classification Accuracy on the MHV Dataset

Backbone	Method	Accuracy (%)
ResNet18	HBP	96.48
	WMHBP	96.86
	WMHBP_HL	97.01
ResNet34	HBP	98.37
	WMHBP	98.40
	WMHBP_HL	98.61

Note: MHV = multi-view high-resolution vehicle; HBP = hierarchical bilinear pooling; WMHBP = weighted mask hierarchical bilinear pooling; WMHBP_HL = weighted mask hierarchical bilinear pooling with hierarchical loss.

Ablation Study for the WMHBP

In the experiments, the methods were trained on the Stanford Cars & MHV dataset and tested on the Stanford Cars dataset. The experimental results show that the accuracy of the HBP model using the weighted mask is 94.82%, which is 0.08% higher than that of the HBP model without the weighted mask (see also Figure 4).

In addition, we combined three independent masks into a weighted mask. Each element in the mask means the probability of a vehicle appearing in the area. Since the size of the mask is much smaller than the size of the original image, we used bilinear interpolation for upsampling and smoothing the edges of the mask to fit the original image. The visual image is generated on the corresponding sample; the effect of mask fusion with the original picture is shown in Figure 5.

Figure 5.

Visualization of weighted mask based on the fusion of masks. The first column of each group of images is the input image, and the second column shows the heat map from the weighted mask.

The two adjacent cells in Figure 5 are a set of images, the left half is the original image in the dataset, and the right half is the picture produced through the fusion of the weighted mask with the original picture. The red area in the picture represents the reserved area, and the blue area is the area that needs to be eliminated. As we can see from Figure 5, the weighted mask can be assigned according to the possibility that the region is the foreground, which can be used to reduce or shield the input of invalid features into the feature processing step. Such a mask has a spatial correspondence with the feature. From the first column, A1, we can see that the vehicle body’s color is similar to the color of the shadow of the vehicle, which is not easy to distinguish, so the feature of the lower part of the bumper is reduced rather than eliminated. The edge of the weighted mask fits closely with the image foreground, and the discriminative areas such as the rearview mirror and headlamp in the vehicle are located at the edge of the image foreground. Therefore, the method of generating the single mask sparsely makes the discriminative area of the edge of the image be preserved utterly.

The larger the values of β and θ in Equations 4 and 5 are, the sharper the outline of the weighted mask will be and the less the reserved continuity area will be; the smaller the values of β and θ, the more invalid areas will be input to the network. Therefore, choosing appropriate values of β and θ is particularly crucial for suppressing the input of cluttered background.

Other situations in Equation 4 are discussed. The value named γ controls the degree of tolerance for the background area. We took values from $γ ϵ {0.1, 0.2, 0.3, 0.4}$ and performed extended experiments on the Stanford Cars & MHV dataset. The performance of different γ is shown in Table 5. When γ is set to 0.1, the best performance can be obtained, and the value of γ has little effect on the final result.

Table 5.

Comparison of Classification Accuracies of Different Values of γ on the Stanford Cars and MHV Dataset

Backbone	γ	Accuracy (%)
ResNet34	0.1	96.28
	0.2	96.24
	0.3	96.23
	0.4	96.25

Note: MHV = multi-view high-resolution vehicle.

Ablation Study of the Hierarchical Loss

We applied the WMHBP model with different loss functions for testing on the Stanford Cars dataset and Stanford Cars & MHV dataset, and the accuracy is shown in Table 6.

Table 6.

Comparison of Classification Accuracy of Different Hierarchical Loss and Cross-Entropy Loss Values on the Stanford Cars Dataset

Backbone	Dataset	CE-M (Acc.) (%)	HL-M (Acc.) (%)	CE-B (Acc.) (%)	HL-B (Acc.) (%)
ResNet18	SC	94.27	94.59	97.31	97.84
ResNet18	SC & MHV	95.80	96.08	97.99	98.27
ResNet34	SC	94.82	95.13	98.73	98.87
ResNet34	SC & MHV	96.28	96.55	98.96	99.07

Note: Acc. = accuracy; CE = cross-entropy function; HL = hierarchical loss; M = classification accuracy of vehicle model; B = classification accuracy of vehicle brand; SC = Stanford Cars dataset; MHV = multi-view high-resolution vehicle dataset.

As shown in Table 6, we used ResNet34 as the backbone network to test on the Stanford Cars dataset. The recognition network with the hierarchical loss function obtained an accuracy of 95.13% and a brand recognition accuracy of 98.87%, results which are higher than the cross-entropy loss function 0.31% and 0.14% respectively. Meanwhile, to make the experimental data more convincing, we also analyzed the recognition result on the Stanford Cars & MHV dataset. The accuracy of using hierarchical loss was 96.55%, and the accuracy of using cross-entropy was 96.28%, which improved the accuracy by 0.27%, and the accuracy of brands using hierarchical loss increased from 98.96% to 99.07%. When we used ResNet18 as the backbone network to test on the Stanford Cars dataset, the recognition network with the hierarchical loss function obtained an accuracy of 94.59% and a brand recognition accuracy of 97.84%, which are respectively 0.32% and 0.53% higher than the cross-entropy loss function. When the Stanford Cars & MHV dataset was used as the test set, the accuracy of WMHBP (ResNet18) using hierarchical loss was 96.08%, which is an increase of 0.28% compared with the accuracy of using the cross-entropy loss function. Brand recognition accuracy was also increased from 97.99% to 98.27%. The experimental results show that hierarchical loss has a specific effect on improving the recognition accuracy of models and manufacturers.

Based on the above results (ResNet34), we extracted some categories of brands that are not easily distinguishable from the dataset for quantitative statistics and finally formed two confusion matrices, as shown in Figure 6. The horizontal and vertical axes in the confusion matrix represent the categories where brands are easily identified incorrectly (the last index represents all other unselected categories, and the bottom right corner is the total number of recognition errors in the above-selected categories). From the total number of correctly identified pictures, the number when hierarchical loss was used is greater than the number when hierarchical loss was not used, and the total number of brand recognition errors has dropped from 91 to 72. We analyzed the brand with serial number 9 separately. In the case of hierarchical loss, the number of correct classifications is 877, and the number of classification errors is 19. When the hierarchical loss function is not used, the number of categories which are correct is 872, and the number of classification errors is 24. The results demonstrate that the WMHBP model based on the hierarchical loss function can learn the differences in the appearance of vehicles between brands, reduce brand recognition errors, and improve the overall accuracy.

Figure 6.

Confusion matrix of typical examples of hierarchical loss and cross-entropy loss on the Stanford Cars dataset.

Comparison with State-of-the-Art Methods

A comparison was also done between the WMHBP and some other state-of-the-art methods in VMMR. All VMMR methods use accuracy and real-time related indicators as the metrics for recognition performance. Some experimental results are based on the reimplementation experimental results (marked with * in the chart). The experimental results are shown in Table 7.

Table 7.

Comparison of Weighted Mask Hierarchical Bilinear Pooling (WMHBP) and Some State-of-the-Art Methods on Stanford Cars Dataset

Method	Backbone	Accuracy (%)
FCAN ( 21 )	ResNet50	89.1
FCAN(+BBox) ( 21 )	ResNet50	91.3
RA-CNN ( 25 )	VGG19	92.5
MA-CNN ( 24 )	VGG19	92.8
DT-RAM ( 20 )	ResNet50	93.1
TA-SN ( 26 )	ResNet50	93.8
*BCNN ( 31 )	VGGD+VGGM	91.3
KP ( 13 )	VGG16	92.4
HIHCA ( 35 )	VGG16	91.7
Boost-CNN ( 32 )	VGG16	92.1
*HBP-ResNet ( 28 )	ResNet34	92.2
*HBPASM ( 34 )	ResNet34	92.8
SWP ( 18 )	ResNet101	93.1
DCEL ( 19 )	DenseNet161	93.3
GTCN ( 23 )	DenseNet264	94.3
Proposed method	ResNet34	95.1

Note: FCAN = fully convolutional attention network; FCAN(+BBox) = fully convolutional attention network with bounding box; RA-CNN = recurrent attention convolutional neural network; MA-CNN = multi-attention convolutional neural network; DT-RAM = dynamic time recurrent attention model; TA-SN = trilinear attention sampling network; BCNN = bilinear convolutional neural network; KP = kernel pooling; HIHCA = higher-order integration of hierarchical convolutional activations; boost-CNN = boosted convolutional neural network; HBP-ResNet = hierarchical bilinear pooling using ResNet; HBPASM = hierarchical bilinear pooling with aggregated slack mask; SWP = spatially weighted pooling; DCEL = dual cross-entropy loss; GTCN = global topology constraint network.

indicates reimplementation experimental results.

The WMHBP does not use additional information such as bounding box (BBox) and part annotations. The fine-grained recognition methods used in the comparison are divided into three categories: (1) based on the attention mechanism, namely, fully convolutional attention network (FCAN) ( 21 ), RA-CNN ( 25 ), multi-attention convolutional neural network (MA-CNN) ( 24 ), dynamic time recurrent attention model (DT-RAM) ( 20 ), and trilinear attention sampling network (TA-SN) ( 26 ); (2) based on high-dimensional feature coding, namely, bilinear convolutional neural network (BCNN) ( 31 ), kernel pooling (KP) ( 13 ), higher-order integration of hierarchical convolutional activations (HIHCA) ( 35 ), boosted convolutional neural network (Boost-CNN) ( 32 ), HBP ( 28 ), and HBP with aggregated slack mask (HBPASM) ( 34 ); and (3) based on the vehicle-specific characteristics, namely, spatially weighted pooling (SWP) ( 18 ), dual cross-entropy loss (DCEL) ( 19 ), global topology constraint network (GTCN) ( 23 ).

The results show that the WMHBP had higher accuracy compared with the other three types of experiments. Compared with TA-SN, WMHBP has an accuracy improvement of 1.3%, which is 2.3% higher than that of HBPASM. Compared with the GTCN, the accuracy from the WMHBP was improved by 0.8%.

To show the advantages of the WMHBP in real-time performance, we conducted an FPS (frames per second) experiment. The results are shown in Table 8. WMHBP was compared with some high-dimensional feature coding models, such as BCNN, HBP, and HBPASM. Experiments showed that the WMHBP can reach FPS of 106.9, which is one frame higher than BCNN, two frames higher than the HBPASM method, and two frames lower than HBP. Therefore, the WMHBP can meet the real-time requirements if it is applied in an ITS. The parameter size of the WMHBP is 159.56 MB, which is 4.72 MB higher than the HBP and HBPASM methods and 131.1 MB lower than the BCNN method. The method requires a small scale of parameters, which is beneficial for deployment in various embedded devices for traffic management. The WMHBP improves the recognition accuracy under the premise of ensuring real-time recognition, so that the ITS system’s ability to process information in real-time has been further improved, providing strong support for other intelligent technologies.

Table 8.

Comparison of Real-Time Results on the Stanford Cars Dataset

Method	Accuracy (%)	FPS	Parameter size (MB)
BCNN ( 31 )	91.3	105.8	290.66
HBP-ResNet ( 28 )	92.2	108.7	154.84
HBPASM ( 34 )	92.8	105.6	154.84
Proposed (ResNet34)	95.1	106.8	159.56

Note: FPS = the number of frames that can be processed per second; BCNN = bilinear convolutional neural network; HBP-ResNet = hierarchical bilinear pooling using ResNet; HBPASM = hierarchical bilinear pooling with aggregated slack mask.

Visualization

We randomly selected images from the dataset and input them into the WMHBP model to obtain three interactive features, which were mapped to the original image according to the spatial order of the features to form a heat map. The effect of the model training can be observed through energy distribution in Figure 7.

Figure 7.

Visualization results of sampling images in the Stanford Cars dataset. The first column of the subscripts (a and b) corresponds to the original image, the second column represents the heat map of the weighted mask, and the last three columns represent the sampled features of the last three layers. Due to the large number of images visualized, they are presented in two subfigure.

The value range of the response area and the location of the response area in the heat map are the criteria for evaluating the feature representation ability. It is shown from Figure 7 that the energy area in the third row of pictures is located at the headlamps of the vehicle, the energy area in the fourth row of the pictures is located at the position of the logo and the rearview mirror. The energy location position shown in every image is not the same, and multiple features responding to different locations are finally combined into one feature, which often has robust semantic information. In the last line of Figure 7a, because of the weighted mask’s unsuccessful processing of the cluttered background, the discriminative area of the picture was judged incorrectly, resulting in the classification error. The last line of Figure 7b can be seen, although the weighted mask better handles the cluttered background. However, the characteristics of the discriminative regions are not obvious enough, which leads to classification error.

Comparing the heat map with the weighted mask, we can find that the position of the feature response area and the mask area are related. The mask area usually does not contain the feature response area. The mask component and the feature extraction component can reinforce each other; the heat map gives a visual explanation. Therefore, the WMHBP framework can distinguish subtle local differences between very fine categories.

Conclusions

This paper proposes a VMMR algorithm based on a hierarchical scheme. We augmented the standard dataset by collecting vehicle pictures from the Internet and classified the vehicle pictures into corresponding categories, which helped solve the problem of insufficient picture quantity and low picture resolution in the dataset. The HBP model combined with the weighted mask is used to reduce the cluttered background input into the model and achieve a better feature extraction effect. Finally, according to vehicle-specific characteristic, we split the original label into two to form a hierarchical structure label. Meanwhile, we improved the loss function to match the dataset label to achieve a strong ability to distinguish brands between different vehicles. The results of the experiment show that the proposed method achieved the highest processing accuracy under the premise of ensuring high operating speed. Therefore, the WMHBP method can be applied in ITS. The next step will be to optimize the network structure and unify multiple discriminative regions effectively to ensure high recognition accuracy.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Chaoqing Wang, Junlong Cheng; data collection: Chaoqing Wang; analysis and interpretation of results: Chaoqing Wang, Yurong Qian; draft manuscript preparation: Chaoqing Wang, Yuefei Wang. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work is partially supported by the National Natural Science Foundation of China (61966035), the National Science Foundation of China under grant (U1803261), the Xinjiang Uygur Autonomous Region Innovation Team (XJEDU2017T002), and the Autonomous Region Graduate Innovation Project (XJ2019G072).

ORCID iD

Junlong Cheng

References

Parkhi

O. M.

Vedaldi

Zisserman

Jawahar

C. V.

Cats and Dogs. Proc., IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, 2012, pp. 3498–3505.

Krause

Jin

Yang

Fei-Fei

Fine-Grained Recognition without Part Annotations. Proc., IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Providence, 2015, pp. 5546–5555.

Huang

Zhang

Tao

Webly-Supervised Fine-Grained Visual Categorization via Deep Domain Adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, No. 5, 2018, pp. 1100–1113.

Sochor

Špaňhel

Herout

BoxCars: Improving Fine-Grained Recognition of Vehicles Using 3-D Bounding Boxes in Traffic Surveillance. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, No. 1, 2019, pp. 97–108.

Lee

W. H.

Tseng

S. S.

Wang

C. H.

Design and Implementation of Electronic Toll Collection System Based on Vehicle Positioning System Techniques. Computer Communications, Vol. 31, No. 12, 2008, pp. 2925–2933.

Chen

Liu

Wan

Qiao

Pei

An Edge Traffic Flow Detection Scheme Based on Deep Learning in an Intelligent Transportation System. IEEE Transactions on Intelligent Transportation Systems, Vol. 22, No. 3, 2021, pp. 1840–1852.

Xie

Ahmad

Jin

Liu

Zhang

A New CNN-Based Method for Multi-Directional Car License Plate Detection. IEEE Transactions on Intelligent Transportation Systems, Vol. 19, No. 2, 2018, pp. 507–517.

Laroca

Severo

Zanlorensi

L. A.

Oliveira

L. S.

Goncalves

G. R.

Schwartz

W. R.

Menotti

A Robust Real-Time Automatic License Plate Recognition Based on the YOLO Detector. Proc., International Joint Conference on Neural Network, Rio de Janeiro, Brazil, 2018, pp. 1–10.

Gou

Wang

Yao

Vehicle License Plate Recognition Based on Extremal Regions and Restricted Boltzmann Machines. IEEE Transactions on Intelligent Transportation Systems, Vol. 17, No. 4, 2016, pp. 1096–1107.

10.

Krause

Stark

Deng

Fei-Fei

3D Object Representations for Fine-Grained Categorization. Proc., International Conference on Computer Vision, Sydney, Australia, 2013, pp. 554–561.

11.

Wen

Zhang

Zhao

Liu

Huang

Attention Convolutional Binary Neural Tree for Fine-Grained Visual Categorization. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Seattle, USA, 2020, pp. 10468–10477.

12.

Wang

Gao

Deep Attention-Based Spatially Recursive Networks for Fine-Grained Visual Recognition. IEEE Transactions on Systems Man and Cybernetics, Vol. 49, No. 5, 2019, pp. 1791–1802.

13.

Cui

Zhou

Wang

Liu

Lin

Belongie

Kernel Pooling for Convolutional Neural Networks. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Hawaii, USA, 2017, pp. 3049–3058.

14.

Sun

Yuan

Zhou

Ding

Multi-Attention Multi-Class Constraint for Fine-Grained Image Recognition. Proc., European Conference on Computer Vision, Munich, Germany, 2018, pp. 834–850.

15.

Zhang

Donahue

Girshick

R. B.

Darrell

Part-Based R-CNNs for Fine-Grained Category Detection. Proc., European Conference on Computer Vision, Zurich, Switzerland, 2014, pp. 834–849.

16.

Huang

Tao

Zhang

Part-Stacked CNN for Fine-Grained Visual Categorization. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 1173–1182.

17.

Sochor

Herout

Havel

BoxCars: 3D Boxes as CNN Input for Improved Fine-Grained Vehicle Recognition. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 3006–3015.

18.

Wang

Shen

Deep CNNs with Spatially Weighted Pooling for Fine-Grained Car Recognition. IEEE Transactions on Intelligent Transportation Systems, Vol. 18, No. 11, 2017, pp. 3147–3156.

19.

Chang

Cao

Dual Cross-Entropy Loss for Small-Sample Fine-Grained Vehicle Classification. IEEE Transactions on Vehicular Technology, Vol. 68, No. 5, 2019, pp. 4204–4212.

20.

Yang

Liu

Zhou

Wen

Dynamic Computational Time for Visual Attention. Proc., International Conference on Computer Vision, Venice, Italy, 2017, pp. 1199–1209.

21.

Liu

Xia

Wang

Yang

Zhou

Lin

Fully Convolutional Attention Networks for Fine-Grained Recognition, arXiv preprint arXiv:1603.06765, 2016.

22.

Wang

Global Structure Graph Guided Fine-Grained Vehicle Recognition. Proc., International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 2020, pp. 1913–1917.

23.

Xiang

Huang

Global Topology Constraint Network for Fine-Grained Vehicle Recognition. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 7, 2020, pp. 2918–2929.

24.

Zheng

Mei

Luo

Learning Multi-Attention Convolutional Neural Network for Fine-Grained Image Recognition. Proc., IEEE International Conference on Computer Vision, Venice, Italy, 2017, pp. 5219–5227.

25.

Zheng

Mei

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-Grained Image Recognition. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 4476–4484.

26.

Zheng

Zha

Z.-J.

Luo

Looking for the Devil in the Details: Learning Trilinear Attention Sampling Network for Fine-Grained Image Recognition. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, USA, 2019, pp. 5012–5021.

27.

Chen

Ying

Lin

Liu

Li W

Multi-View Vehicle Type Recognition with Feedback-Enhancement Multi-Branch CNNs. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, No. 9, 2019, pp. 2590–2599.

28.

Zhao

Zheng

Zhang

You

Hierarchical Bilinear Pooling for Fine-Grained Visual Recognition. Proc., European Conference on Computer Vision, Munich, Germany, 2018, pp. 595–610.

29.

Fang

Zhou

Fine-Grained Vehicle Model Recognition Using a Coarse-to-Fine Convolutional Neural Network Architecture. IEEE Transactions on Intelligent Transportation Systems, Vol. 18, No. 7, 2017, pp. 1782–1792.

30.

Yang

Luo

Loy

C. C.

Tang

A Large-Scale Car Dataset for Fine-Grained Categorization and Verification. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Boston, USA, 2015, pp. 3973–3981.

31.

Lin

T.-Y.

RoyChowdhury

Maji

Bilinear CNN Models for Fine-Grained Visual Recognition. Proc., IEEE International Conference on Computer Vision, Santiago, Chile, 2015, pp. 1449–1457.

32.

Moghimi

Belongie

S. J.

Saberian

M. J.

Yang

Vasconcelos

L.-J.

Boosted Convolutional Neural Networks. Proc., British Machine Vision Conference, York, UK, 2016.

33.

Wang

Liu

Hou

Factorized Bilinear Models for Image Recognition. Proc., International Conference on Computer Vision, Venice, Italy, 2017, pp. 2098–2106.

34.

Tan

Wang

Zhou

Peng

Zheng

Fine-Grained Classification via Hierarchical Bilinear Pooling with Aggregated Slack Mask. IEEE Access, Vol. 7, 2019, pp. 117944–117953.

35.

Cai

Zuo

Zhang

Higher-Order Integration of Hierarchical Convolutional Activations for Fine-Grained Visual Categorization. Proc., International Conference on Computer Vision, Venice, Italy, 2017, pp. 511–520.

36.

Lin

T.-Y.

Maji

Improved Bilinear Pooling with CNNs. Proc., British Machine Vision Conference, London, UK, 2017.

37.

Huang

A Hierarchical Scheme for Vehicle Make and Model Recognition from Frontal Images of Vehicles. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, No. 5, 2019, pp. 1774–1786.

38.

Tafazzoli

Frigui

Nishiyama

A Large and Diverse Dataset for Improved Vehicle Make and Model Recognition. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017, pp. 874–881.

39.

Wei

X.-S.

Luo

J.-H.

Zhou

Z.-H.

Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval. IEEE Transactions on Image Processing, Vol. 26, No. 6, 2017, pp. 2868–2881.

40.

Deng

Dong

Socher

L.-J.

Fei-Fei

ImageNet: A Large-Scale Hierarchical Image Database. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Miami, USA, 2009, pp. 248–255.

41.

Simonyan

Zisserman

Very Deep Convolutional Networks for Large-Scale Image Recognition. Proc., International Conference on Learning Representations, San Diego, USA, 2015.

42.

Szegedy

Vanhoucke

Ioffe

Shlens

Wojna

Rethinking the Inception Architecture for Computer Vision. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 2818–2826.

43.

Zhang

Ren

Sun

Deep Residual Learning for Image Recognition. Proc., IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, USA, 2016, pp. 770–778.

44.

Paszke

Gross

Massa

Lerer

Bradbury

Chanan

Killeen

Lin

Gimelshein

Antiga

Desmaison

PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems, Vol. 32, 2019, pp. 8026–8037.