MGFPN: Enhancing multi-scale feature for object detection

Abstract

Feature pyramids are commonly applied to solve the scale variation problem for object detection. One of the most representative works of feature pyramid is Feature Pyramid Network (FPN), which is simple and efficient. However, the fully power of multi-scale features might not be completely exploited in FPN due to its design defects. In this paper, we first analyze the structure problems of FPN which prevent the multi-scale feature from being fully exploited, then propose a new feature pyramid structure named Mixed Group FPN (MGFPN), to mitigate these design defects of FPN. Concretely, MGFPN strengthens the feature utilization by two modules named Mixed Group Convolution(MGConv) and Contextual Attention(CA). MGConv reduces the spatial information loss of FPN in feature generation stage. And CA narrows the semantic gaps between features of different receptive field before lateral summation. By replacing FPN with MGFPN in FCOS, our method can improve the performance of detectors in many major backbones by 0.7 to 1.2 Average Precision(AP) on MS-COCO benchmark without adding too much parameters and it is easy to be extended to other FPN-based models. The proposed MGFPN can serve as a simple and strong alternative for many other FPN based models.

Keywords

Object Detection Feature Pyramids FPN Mixed Group Convolution Contextual Attention

1 Introduction

One of the fundamental research of deep learning is object detection. With the fully development in deep convolutional networks, marvelous progress has been achieved in object detection.Existing object detectors can be briefly categorized into two branches:anchor-based detectors and anchor-free detectors. Anchor-based detectors can be roughly divided into one-stage methods including YOLOV3 [21], SSD [18], RetinaNet [17] which directly predict objects by anchors; and two-stage methods such as Faster R-CNN [22] and Mask R-CNN [8], which utilize candidate proposals for further refinement to infer the extracted region feature. Anchor-free detectors also make predictions in two different ways. One way is locating bounding boxes by predicting several pre-defined key points including CornerNet [15] and CenterNet [5]. Another way is to use the center point of bounding boxes and predict the four distances from center such as FCOS [28], YOLO [19].

However, the quick development of detectors and pattern of feature extracting still can not address the scale variation problem across object instances. The most intuitive solution of the problem is to exploit multi-scale image pyramid. But the huge cost of time and memory of this solution makes it impractical. Another solution is to utilize the feature pyramid to be a substitute for image pyramid at a lower computational cost. Among the methods of feature pyramid, FPN is the most representative one, which improves feature representation with combining the feature from shallow layers that is more detailed and features from high layers that is more semantic.

As is shown in Figure 1, the architecture design of FPN has the following intrinsic defects:

Fig. 1

The design defects in feature pyramid network: 1) different dimension of feature maps reduces to the same low dimension, 2) semantic gap between feature maps with different receptive fields before feature summation.

Semantic information loss after feature extraction. CNNs compute a hierarchy of representations that transit gradually from spatial to channel coding. Afterward layers have greater semantic information by extracting features from previous layers while reducing resolution, increasing the receptive field size of the units, and increasing the number of feature channels [10]. The dimension of features in FPN(for example ResNet50) to extract for precise prediction are 2048, 1024 and 512. These features will be reduced to dimension 256 with a 1×1 convolution kernel while resolution of features keep invariant which means the loss of semantic information loss. Semantic gaps between features with different receptive field. When performing feature fusion, shallow layers features combine with high layers features by simple summation where the large semantic gaps between these features are not considered [7]. It is suboptimal to fuse these features directly and it would degrade the power of multi-scale feature representation because of the inconsistency of feature summation.

In this paper, the FPN architecture has been rethinked and an alternative module is proposed to mitigate the problems mentioned above respectively. First, in the channel reduction stage of FPN, the module named Mixed Group Convlution is proposed, which uses different kernel size to capture more spatial information and then reassembles the feature maps by a novel attention structure. Second, a Contextual Attention module is utilized to aggregate more feature and bridge the gap between shallow layers features and high layers features.

Without bells and whistles, Mixed Grouped FPN based FCOS outperforms the original FCOS by 1.0 Average Precision (AP) using ResNet50 as backbone. Also MGFPN can be easily extended to other anchor-based methods or anchor-free methods which have FPN module with little modifications. It proves the generality of MGFPN.

In summary, our main contributions are as follows:

Revealing the problems of FPN architecture of channel reduction and semantic gap cause by summation, which prevent the multi-scale feature from being fully exploited.

Proposing a MGFPN which inherits the merits of FPN and is able to alleviate the problems we mentioned above.

Evaluating MGFPN equipped with various backbones and detectors on MS-COCO and its competitive results manifest the generality of MGFPN.

The remainder of this paper is organized as follows. Section 2 introduces related work of development of detectors, network architecture and attention mechanism. Section 3 presents the construction of our method, including MGConv and CA. In Section 4, experiments and analysis of our methods are represented. Finally, we conclude this research with future work in Section 5.

2 Related work

2.1 Deep object detectors

Object detectors conclude two methods: anchor-based detectors and anchor-free detectors. As for anchor-based detectors, the methods almost follow two paradigms, two-stage and one-stage. The advent of Faster R-CNN [22] establishes the dominant position of two-stage methods. Faster R-CNN improves from R-CNN and fuses a module named Region Proposal Network (RPN) [22] to detect objects. After that, a number of methods are proposed to improve the performance of Faster R-CNN, including architecture adjustment [25 , 31], context and attention mechanism [11 , 30], feature fusion and enhancement [27]. Contrary to two-stage methods, one-stage detectors have more advantages in computation yet less accurate. SSD [18] exploits multi-scale features to make dense predictions based on plenty of anchors. Although dense predictions improve recall of predictions of detectors, it makes detectors suffering from imbalance problem of easy and hard samples. Then RetinaNet [17] is proposed to introduce a novel focal loss to address the imbalance problem.

Anchor-free detectors also can be divided into two ways: one is keypoint-based methods which predict pre-defined key points and then generate bounding boxes. CornerNet [15] predicts a pair of Diagonal points to detect an object bounding box. CenterNet [5] predicts center point of bounding boxes based on CornerNet to improve its performance. Another is center-based method which predicts center points or center area of bounding boxes to differentiate positive and negative sampling, and then predicts the distances from positives to the four sides of bounding boxes. YOLO [19] divides the image into grids and the grid cell that contains the center of an object is defined positive to predict the object. FCOS defines positive points for the point which is inside a predefine bounding box and uses the point to predict four distances to perform detection.

2.2 Network architecture engineering

Network architecture engineering is one of the most prevailing domains in vision deep learning. The proposal of network design is achieving the balance of computation and performance and the optimal solution is obtained under the balance. An intuitive idea of network extension is to deepen the network VGG [24]. FracalNet [14] introduces a network of a binary path while Googlenet [25] introduces a multi-branch design where each branch has different receptive field and fuses features by concatenation. ResNet [9] proposes a short-cut mechanism to mitigate training unstable issues but fuses features by summation. ResNeXt [31] exploits the conception of cardinality which splits then merges features and leads to better classification accuracy. DenseNet [12] iteratively concatenates the input features with the output features, enabling features of all layers to have direct contact with the output layer.

2.3 Attention mechanism

A simple yet efficient function to enhance features is attention mechanism. SE-Net [11] propose a Squeeze-and-Excitation module to learn channel attention and achieves promising performance. Subsequently, SK-Net [16] uses multi-path and big kernels to refine the channel attention. CBAM[30] exploits both average and max pooling to aggregate features. GE [10] explores spatial attention by using a depth-wise convolution. The core idea of attention mechanism is to design a light weight but functional module which is good for feature aggregation.

2.4 Group convolution

Group convolution, which splits the feature maps by channel into groups and calculates with each groups respectively, is first used in AlexNet [13] to save the memory of GPUs. Interleaved Group Convolutions [33] is based on group convolution, which is a method of dividing the input channels into several partitions and performing a regular convolution over each partition separately. Depth-wise convolution is an extreme situation of groups convolution which divides all channels. The number of groups affects the performance of groups convolution and generally, the more groups come with the less representation of model. But the decomposition of reducing the redundancy of neural networks and parameter saving of group convolution shows that group convolution is worth involving with more research.

3 Proposed methods

The overall framework of MGFPN is shown in Figure 2. Following the setting of FCOS, features used to build the feature pyramid are denoted as C3, C4, C5. P3, P4, P5 are the features produced by feature pyramid. P6, P7 are the features extracted from P5, P6 respectively. The two components of MGFPN will be discussed in the following subsections.

Fig. 2

Here is the overall pipeline of MGFPN, where C3, C4 and C5 denote the features of backbone network and P3 to P7 are the feature levels used for the final prediction. H×W is the height and width of features. s (s = 8,16,...,128) is the down sampling ratio of the features at the level to input image.

3.1 Mixed group convolution

The main idea of MGConv is to mix up different receptive field features, such that previous layers information will pass to the next layers in a comprehensive way. Referring to the standard 1×1 convolution in FPN, the way of connecting of neurons is equal to fully connecting while reducing channels at the same time and the neurons connecting previous layers and afterward layers are independent. As for other big kernels convolution, their relation is cascaded (As is shown in Figure 3), i.e. it will cause the loss of spatial information. The proposed MGConv is designed to avoid the spatial information loss issue.

Fig. 3

a) 1×1 Convolution. b) 3×3 Convolution. The figure shows that big kernel will introduce more spatial information from previous layers to afterward layers. Our module contains multiple big kernels to obtain more spatial information.

MGConv is a computational unit which can be built upon a transformation mapping and input $X \in ℝ^{H^{'} \times W^{'} \times C^{'}}$ to feature maps $U \in ℝ^{H \times W \times C}$ . In general, H’= H, W’= W and C = 256. As shown in Figure 4, MGConv contains two stages: Split and Fusion. In the first stage, there are four different size of kernels to convolute with the input feature maps and then the input will be split into four parts which have different receptive field. The group convolution is exploited to decrease the huge computation caused by big kernels convolution and specially, different kernel size will come up with different group number. So the number of group of different kernels is a hyper parameter which need to be searched carefully and experiments about group number are conducted in the next section. The working flow here can be formulated as: $\begin{matrix} F_{i} (x) = {Conv}_{i} \cdot x & i = 3, 5, 7, 9 \end{matrix}$ (1) where $F_{i} (x) \in ℝ^{H \times W \times \frac{c}{4}}$ ; i denotes the size of convolution kernel and F_i (x) is the feature extracted by different convolution kernels. In the second stage, a combined representation will be obtained by concatenating these four features maps with a light channel attention module which is modified by SENet [11]. SE module performs feature recalibration by firstly making features passed through a squeeze operation, which produces a channel descriptor by aggregating feature maps across their spatial dimensions (H, W), allowing information from the global receptive field of the network to be used by all its layers. The aggregation is followed by an excitation operation, which utilizes the form of a simple self-gating mechanism that takes the embedding as input and produces a collection of per-channel modulation weights. The output of the SE block can be fed directly into subsequent layers of the network. The main goal of the attention module is to aggregate the feature maps which is split in the first stage. $\begin{matrix} F (x) = Concate (F_{i} (x)) \\ F = Atten (F (x)) & i = 3, 5, 7, 9 \end{matrix}$ (2) where F_i (x), $F (x) \in ℝ^{H \times W \times C}$ and $F \in ℝ^{H \times W \times C}$ ; i denotes the size of convolution kernel; F (x) is the concatenation of F_i (x) and F is the feature aggregation of F (x) by attention module. By optimally choosing the number of group number, MGConv can achieve better performance without adding too much computation.

Fig. 4

The module structure of MGConv. Ellipsis means the uncertain group numbers of group convolution of big kernels. MGFPN enhances the feature extraction using multiple spatial information.

Relation with other multi-path methods. Inception module[25] first exploits different kernel size with multi-path to extract features. ResNeXt splits the previous features into small groups to aggregate features in a more careful manner and the split procedure do help the regularization of network. MixNet [26] splits feature maps by channel directly to reduce the computation cost. The differences between theses methods and ours are illustrated as following:

1) The input and output data handled in these methods have the same dimension while our method aims to focus on the dimension reduction of data.To our best knowledge, there is seldom conducting multi-path in FPN.

2) These methods pay more attention to how to split features but our method concerns about fusing the split features.

3.2 Contextual attention

CA is the complement of summation between shallow layer features and high layer features of MGFPN. CA is light-weight which simply composes by a group convolution and a 1×1 standard convolution (illustrated in Figure 5). Previous study [29] has shown that simple low-level feature sets can effectively encode context information for visual tasks and it may be an effective alternative to the iterative method based on high-level semantic features.

Fig. 5

a) Module structure; b) Working flow of CA. CA extracts features from the original feature maps which bridge the gap between low detailed features and high semantic features before their summation.

By exploiting the group convolution, the context features are obtained in a low computation cost and a standard 1×1 convolution is beneficial for mitigating the sparse effect of group convolution. As shown in our ablation study in next section, the number of groups also affects the performance of CA.

Relation with other Contextual methods. Contextual information has been proved its importance in Semantic Segmentation. Deeplab-v2 [2] proposes dilated convolution to obtain context information and PSPNet [34] utilizes pyramid pooling to extract hierarchical global context information. GE [10] aims to help convolutional networks to exploit the contextual information contained in the field of feature responses computed by the network itself. Inspired by GE, we design a similar contextual attention structure and obtain the contextual information produced by the network itself. Different from GE, the output dimension of our proposed CA module is several times less than the input dimension while the dimensions of input and output are the same in GE.

3.3 Loss function

The definition of loss function as follow: $\begin{matrix} L (P_{x, y}, t_{x, y}) = \frac{1}{N_{pos}} \sum_{x, y} L_{cls} (P_{x, y}, c_{x, y}^{*}) + \\ \frac{λ}{N_{pos}} \sum_{x, y} I_{c_{x, y}^{*} > 0} L_{reg} (t_{x, y}, t_{x, y}^{*}) \end{matrix}$ (3) where the definition is in [28]. L_cls is classification loss and L_reg is the regression loss. N_pos denotes the number of positive samples and λ being 1 in this paper is the balance weight for L_reg. The summation is calculated over all locations on the feature maps. $I$ is the indicator function, being 1 if c^* >0 and 0 otherwise.

4 Experiments

4.1 Dataset and evaluation metrics

All experiments are performed on the MS COCO detection dataset with 80 categories. It contains 115k images for training, 5k images for validation and 20k images for testing. These models are trained on training sets and we report results of ablation study on minival sets which are identified with other baselines. The final results are also reported on minival sets. All reported results follow standard COCO-style Average Precision(AP) metrics.

4.2 Implementation details

As for ablation study, ResNet-50 based FCOS [28] is used as our backbone networks and the setting of hyper-parameters is same with FCOS. The other experiments which compared to other baselines follow all settings of these baselines. Specially, our network is trained with stochastic gradient descent(SGD) for 90K iterations with the initial learning rate being 0.01 and a mini-batch of 8 images. The learning rate is reduced by a factor of 10 at iteration 60K and 80K, respectively. Weight decay and momentum are set as 0.0001 and 0.9. The backbones of our model are initialized with the weights pre-trained on ImageNet [4]. And the newly added layers are initialized in [17]. The input images are resized to have their shorter side being 800 and their longer side less or equal to 1333. By default, these models are trained with 2 GPUs (4 images per GPU).

4.3 Main results

The evaluation of MGFPN on COCO minival set compared with other state-of-the-art one-stage and two-stage methods is conducted in this section. All results are shown in Table 1.

Table 1
Comparasion with other baselines

Method Backbone AP AP ₅₀ AP ₇₅ AP _S AP _M AP _L

one stage method:

YOLOv2 [20] DarkNet-19 [20] 21.6 44.0 19.2 5.0 22.4 35.5

SSD513 ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8

DSSD513 [6] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1

RetinaNet ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2

CornerNet Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9

FSAF [35] ResNext-64x4d-101-FPN 42.9 63.8 46.3 46.3 26.6 52.7

FCOS ResNet-50-FPN 36.5 54.5 39.2 19.8 40.0 48.9

FCOS ResNet-101-FPN 41.5 60.7 45.0 24.4 44.8 51.6

FCOS MoblieNet-V2-FPN [3] 32.5 53.4 34.4 18.6 34.6 45.8

FCOS ResNeXt-32x8d-101-FPN 42.7 62.2 46.1 26.0 45.6 52.6

FCOS ResNeXt-64x4d-101-FPN 43.2 62.8 46.6 26.5 46.2 53.3

two stage method:

Faster R-CNN ResNet-50-FPN 36.2 59.1 39.0 18.2 39.0 48.2

Faster R-CNN ResNet-101-FPN 38.9 60.9 42.3 22.4 42.4 48.3

Mask R-CNN ResNet-50-FPN 34.4 56.3 36.6 18.6 37.2 44.5

Mask R-CNN ResNet-101-FPN 36.3 58.5 38.7 19.2 39.3 47.4

ours:

FCOS ResNet-50-MGFPN 37.5[+1.0] 56.5 40.0 21.9 41.6 48.5

FCOS ResNet-101-MGFPN 42.4[+0.9] 64.4 46.3 24.6 45.7 54.0

FCOS MoblieNet-V2-MGFPN 33.2[+0.7] 53.2 35.8 15.0 37.5 47.4

FCOS ResNeXt-32x8d-101-MGFPN 43.4[+0.7] 63.4 46.5 26.1 47.3 54.0

FCOS ResNeXt-64x4d-101-MGFPN 44.4[+1.2] 64.4 46.7 26.3 47.4 54.2

Mask R-CNN ResNet-50-MGFPN 35.6[+1.2] 58.1 38.8 20.0 39.6 46.8

Mask R-CNN ResNet-101-MGFPN 37.1[+0.8] 58.9 40.1 22.3 40.9 47.4

improvements:

FCOS + GIOU [23] ResNet-50-MGFPN 37.7[+1.2] 60.0 40.8 22.8 41.4 48.4

FCOS + ATSS [32] ResNet-50-MGFPN 37.9[+1.4] 60.3 40.7 23.6 41.8 47.9

Method	Backbone	AP	AP ₅₀	AP ₇₅	AP _S	AP _M	AP _L
one stage method:
YOLOv2 [20]	DarkNet-19 [20]	21.6	44.0	19.2	5.0	22.4	35.5
SSD513	ResNet-101-SSD	31.2	50.4	33.3	10.2	34.5	49.8
DSSD513 [6]	ResNet-101-DSSD	33.2	53.3	35.2	13.0	35.4	51.1
RetinaNet	ResNet-101-FPN	39.1	59.1	42.3	21.8	42.7	50.2
CornerNet	Hourglass-104	40.5	56.5	43.1	19.4	42.7	53.9
FSAF [35]	ResNext-64x4d-101-FPN	42.9	63.8	46.3	46.3	26.6	52.7
FCOS	ResNet-50-FPN	36.5	54.5	39.2	19.8	40.0	48.9
FCOS	ResNet-101-FPN	41.5	60.7	45.0	24.4	44.8	51.6
FCOS	MoblieNet-V2-FPN [3]	32.5	53.4	34.4	18.6	34.6	45.8
FCOS	ResNeXt-32x8d-101-FPN	42.7	62.2	46.1	26.0	45.6	52.6
FCOS	ResNeXt-64x4d-101-FPN	43.2	62.8	46.6	26.5	46.2	53.3
two stage method:
Faster R-CNN	ResNet-50-FPN	36.2	59.1	39.0	18.2	39.0	48.2
Faster R-CNN	ResNet-101-FPN	38.9	60.9	42.3	22.4	42.4	48.3
Mask R-CNN	ResNet-50-FPN	34.4	56.3	36.6	18.6	37.2	44.5
Mask R-CNN	ResNet-101-FPN	36.3	58.5	38.7	19.2	39.3	47.4
ours:
FCOS	ResNet-50-MGFPN	37.5[+1.0]	56.5	40.0	21.9	41.6	48.5
FCOS	ResNet-101-MGFPN	42.4[+0.9]	64.4	46.3	24.6	45.7	54.0
FCOS	MoblieNet-V2-MGFPN	33.2[+0.7]	53.2	35.8	15.0	37.5	47.4
FCOS	ResNeXt-32x8d-101-MGFPN	43.4[+0.7]	63.4	46.5	26.1	47.3	54.0
FCOS	ResNeXt-64x4d-101-MGFPN	44.4[+1.2]	64.4	46.7	26.3	47.4	54.2
Mask R-CNN	ResNet-50-MGFPN	35.6[+1.2]	58.1	38.8	20.0	39.6	46.8
Mask R-CNN	ResNet-101-MGFPN	37.1[+0.8]	58.9	40.1	22.3	40.9	47.4
improvements:
FCOS + GIOU [23]	ResNet-50-MGFPN	37.7[+1.2]	60.0	40.8	22.8	41.4	48.4
FCOS + ATSS [32]	ResNet-50-MGFPN	37.9[+1.4]	60.3	40.7	23.6	41.8	47.9

As for one-stage methods, the based model FCOS are set to use different backbones which provide the generalization of MGFPN on each backbone. By replacing FPN with MGFPN, backbones including ResNet-50, ResNet-101, MoblieNet-V2, ResNext-32x8d-101, ResNeXt-64x4d-101 of FCOS improve the performance by 1.0, 0.9, 0.7, 0.7 and 1.2 AP respectively. The heatmap comparison between FPN and MGFPN (the backbones of both models are ResNet-50) is shown in Figure 6.

Fig. 6

Comparison of the responsive field of FPN and MGFPN. It is shown that MGFPN has more active fields than FPN on these illustrated images.

As for two-stage methods, the main idea of experiments is to validate the effectiveness of MGFPN on two different type of detectors. So experiments for two-stage methods are less than one-stage methods. As shown in Table 1, Mask R-CNN can be improved by 1.2 AP and 0.8 AP respectively when using ResNet-50 and ResNet-101 as backbone. These improvements show that MGFPN is a great alternative module for any other FPN based methods.

Finally, evaluation about some less computational but effective module are used in MGFPN-based FCOS to test whether MGFPN is conflicted with these implements. As can be seen in Table 1, FCOS is boosted to 1.2 AP and 1.4 AP when using GIOU and ATSS. Our experiments verify the robustness ability of MGFPN.

4.4 Ablation study

In this section, extensive ablation experiments are conducted to analyze the effects of individual components in our proposed method. All the ablation studies are conducted on ResNet-50 FCOS and other basic experiments settings are complied with the settings which are mentioned before.

Ablation studies on group number of CA. Group number is an important hyper parameter that to balance the computation cost and average precision. FLOPs (floating-point operations per second) is a parameters to judge the computation cost of a model.

$\begin{matrix} FLOPs = K_{h} \cdot K_{w} \cdot K_{c} \cdot K_{n} \end{matrix}$ (4)

As the formulation above illustrated, K_h, K_w, K_c, K_n means height, width, channel and number of kernels. Specially, K_c is equal to the channel of input feature maps and K_n is same with the channel of output channel feature maps.

As shown in Table 2, all the numbers of group of CA improve the baseline methods. This benefits from that CA narrow semantic gaps between the features after lateral connection and improves their semantic representation simultaneously. It is better to note that compared to the whole model, by choosing an optimal group number CA will introduce little extra parameters. Therefore it is worthy to add it to other FPN based detection models.

Table 2

Ablation studies of CA on MS-COCO minival. “GN” is group number and “unfixed” means that the group number is the same as the dimension of input features

GN	FLOPs	AP	AP ₅₀	AP ₇₅	AP _S	AP _M	AP _L
none	0	36.5	54.5	39.2	19.8	40.0	48.9
32	2.9 · 10⁵	37.3[+0.8]	56.1	39.9	21.2	41.4	48.7
64	1.4 · 10⁵	37.2[+0.7]	55.9	40.0	21.3	41.2	47.9
256	3.6 · 10⁴	37.1[+0.6]	55.9	39.9	21.4	40.8	48.2
unfixed	7.7 · 10²	36.8[+0.3]	55.5	39.7	21.5	41.0	47.8

Ablation studies on group number of MGConv. Experiments results related with four group number settings of MGConv are presented in Table 3. As it is mentioned in YOLOv4 [1], the increase of parameters is for greater capacity of a model to detect multiple objects of different sizes in a single image. So more computation usually brings better performance of detector. The experiments in Table 2 also prove this idea. As shown in Table 2, there is only one parameter; so the relation between computation and performance is linear. But there are four parameters in Table 3 which means experiments can be set in a way that keeping the computation roughly equal by adjusting these four parameters and then observing the influence of these parameters.

Table 3

Performance comparison on COCO minival for different group numbers of each convolution kernels. The baseline is FCOS-ResNet50-MGFPN. "Time" is an approximate training time estimates and its measurement is hour

CONV3	CONV5	CONV7	CONV9	FLOPs	AP	TIME
1	2	4	8	10.0 · 10⁷	37.3	104h
1	4	8	16	6.1 · 10⁶	37.5	92h
1	8	4	16	6.8 · 10⁶	37.4	92h
1	8	8	8	6.5 · 10⁶	37.4	92h

As we can see in Table 3, the performance of model is not closely relate to the computation cost. When the models consume more computing resources, the average precision may decrease. This is just like before ResNet [9] came into being, the more network layers, the larger the parameter quantity, which will damage the performance of the model. Therefore, the number of groups in group convolution needs more delicate design or a better design is needed to overcome its internal defects. It is worth noting that FLOPs of each model are close, the training time between them are roughly same.

Ablation studies on influence of each component. To analyze the influence of each component in MGFPN, MGConv and CA are gradually applied to the model to validate the effectiveness. At the same time, the improvements brought by combination of these two components are also presented to demonstrate that they are complementary to each other. The baseline experiments for all ablation studies is FCOS with ResNet50.

As shown in Table 4, CA improves the baseline method by 0.7 AP. This benefits from that CA narrow semantic gaps between the features after lateral connection in pipeline of FPN. And MGConv improves the detection performance from 36.5 to 37.4 AP. It can be seen that after reduce the spatial information loss, the detector benefits objects in small scale.

Table 4

Effect of each component. Results are reported on COCO minival. MGConv:Mixed Group Convolution. CA:Contextual Attention

MGConv	CA	AP	AP ₅₀	AP ₇₅	AP _S	AP _M	AP _L
		36.5	54.5	39.2	19.8	40.0	48.9
√		37.4[+0.9]	56.4	40.0	21.4	41.4	48.1
	√	37.2[+0.7]	55.9	40.0	21.3	41.2	47.9
√	√	37.5[+1.0]	56.5	40.0	21.9	41.6	48.5

5 Conclusion and future work

Some qualitative results are shown in Fig. 7. In this paper, the inherent problems along with FPN is analyzed and the effect of multi-scale features is not fully exploited in FPN. A new alternative FPN structure named MGFPN is proposed to further exploit the potential of multi-scale features. With two simple yet effective modules named MGConv and CA, MGFPN can improve the baseline methods by a large margin on the challenging MS-COCO dataset. Based on our study, some research directions in future work with respect to MGFPN may focus on the following aspects: 1) Computation. The whole structure may introduce too much computation when utilizing small group numbers. 2) Speed. Group convolution needs to be optimized at Compute Unified Device Architecture, otherwise the computing speed will be very slow.

Fig. 7

Here is part of detection results on COCO validation set. ResNet-50 is used as backbone of ours methods. As the figure illustrate, MGFPN perform well with a wide range of objects including crowded, occluded, highly overlapped, small and large objects.

Footnotes

Acknowledgments

This work is supported by the Natural Science Foundation of Guangdong Province No. 2018A030313318 and the Key-Area Research and Development Program of Guangdong Province No. 2019B111101001.

References

Bochkovskiy

, Wang

C.-Y.

and Liao

H.-Y.M.

, Yolov4: Optimal speed and accuracy of object detection, arXiv preprint arXiv:2004.10934, (2020).

Chen

L.-C.

, Papandreou

, Kokkinos

, Murphy

and Yuille

A.L.

, Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs, IEEE transactions on pattern analysis and machine intelligence 40(4) (2017), 834–848.

Chollet

, Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 1251–1258.

Deng

, Dong

, Socher

, Li

L.-J.

, Li

and Fei-Fei

, Imagenet: A large-scale hierarchical image database, In 2009 IEEE conference on computer vision and pattern recognition 248–255. Ieee, (2009).

Duan

, Bai

, Xie

, Qi

, Huang

and Tian

, Centernet: Keypoint triplets for object detection, In Proceedings of the IEEE International Conference on Computer Vision (2019), 6569–6578.

C.-Y.

, Liu

, Ranga

, Tyagi

and Berg

A.C.

, Dssd: Deconvolutional single shot detector, arXiv preprint arXiv:1701.06659, (2017).

Guo

, Fan

, Zhang

, Xiang

and Pan

, Augfpn: Improving multi-scale feature learning for object detection, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 12595–12604.

, Gkioxari

, Dollár

and Girshick

, Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (2017), 2961–2969.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 770–778.

10.

, Shen

, Albanie

, Sun

and Vedaldi

, Gather-excite: Exploiting feature context in convolutional neural networks, In Advances in neural information processing systems (2018), 9401–9411.

11.

, Shen

and Sun

, Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2018), 7132–7141.

12.

Iandola

, Moskewicz

, Karayev

, Girshick

, Darrell

and Keutzer

, Densenet: Implementing efficient convnet descriptor pyramids, arXiv preprint arXiv:1404.1869, (2014).

13.

Krizhevsky

, Sutskever

and Hinton

G.E.

, Imagenet classification with deep convolutional neural networks, In Advances in neural information processing systems (2012), 1097–1105.

14.

Larsson

, Maire

and Shakhnarovich

, Fractal-net: Ultra-deep neural networks without residuals, arXiv preprint arXiv:1605.07648, (2016).

15.

Law

and Deng

, Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV) (2018), 734–750.

16.

, Wang

, Hu

and Yang

, Selective kernel networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2019), 510–519.

17.

Lin

T.-Y.

, Goyal

, Girshick

, He

and Dollár

, Focal loss for dense object detection, In Proceedings of the IEEE international conference on computer vision (2017), 2980–2988.

18.

Liu

, Anguelov

, Erhan

, Szegedy

, Reed

, Fu

C.-Y.

and Berg

A.C.

, Ssd: Single shot multibox detector, In European conference on computer vision 21–37. Springer, (2016).

19.

Redmon

, Divvala

, Girshick

and Farhadi

, You only look once: Unified, real-time object detection, In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 779–788.

20.

Redmon

and Farhadi

, Yolo9000: better, faster, stronger, In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 7263–7271.

21.

Redmon

and Farhadi

, Yolov3: An incremental improvement, arXiv preprint arXiv:1804.02767, (2018).

22.

Ren

, He

, Girshick

and Sun

, Faster r-cnn: Towards real-time object detection with region proposal networks, In Advances in neural information processing systems (2015), 91–99.

23.

Rezatofighi

, Tsoi

and Gwak

J.Y.

, Amir Sadeghian, Ian Reid, and Silvio Savarese, Generalized intersection over union: A metric and a loss for bounding box regression, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 658–666.

24.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, (2014).

25.

Szegedy

, Liu

, Jia

, Sermanet

, Reed

, Anguelov

, Erhan

, Vanhoucke

and Rabinovich

, Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), 1–9.

26.

Tan

and Le

Q.V.

, Mixconv: Mixed depthwise convolutional kernels, arXiv preprint arXiv:1907.09595 (2019).

27.

Tan

, Pang

and Le

Q.V.

, Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 10781–10790.

28.

Tian

, Shen

, Chen

and He

, Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE international conference on computer vision (2019), 9627–9636.

29.

Wolf

and Bileschi

, A critical view of context, International Journal of Computer Vision 69(2) (2006), 251–261.

30.

Woo

, Park

, Lee

J.-Y.

and Kweon

I.S.

, Cbam: Convolutional block attention module, In Proceedings of the European conference on computer vision (ECCV) (2018), 3–19.

31.

Xie

, Girshick

, Dollár

, Tu

and He

, Aggregated residual transformations for deep neural networks, In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 1492–1500.

32.

Zhang

, Chi

, Yao

, Lei

and Li

S.Z.

, Bridging the gap between anchor-based and anchorfree detection via adaptive training sample selection, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 9759–9768.

33.

Zhang

, Qi

G.-J.

, Xiao

and Wang

, Interleaved group convolutions, In Proceedings of the IEEE international conference on computer vision (2017), 4373–4382.

34.

Zhao

, Shi

, Qi

, Wang

and Jia

, Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 2881–2890.

35.

Zhu

, He

and Savvides

, Feature selective anchor-free module for single-shot object detection, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 840–849.