Pointwise geometric and semantic learning network on 3D point clouds

Abstract

The geometric and semantic information of 3D point clouds significantly influence the analysis of 3D point cloud structures. However, semantic learning of 3D point clouds based on deep learning is challenging due to the naturally unordered data structure. In this work, we strive to impart machines with the knowledge of 3D object shapes, thereby enabling them to infer the high-level semantic information from the 3D model. Inspired by the vector of locally aggregated descriptors, we propose indirectly describing the high-level semantic information by associating each point’s low-level geometric descriptor with a few visual words. Based on this approach, we design an end-to-end network for 3D shape analysis that combines pointwise low-level geometric and high-level semantic information. The network includes a spatial transform and a uniform operation that make it invariant to input rotation and translation, respectively. Our network also employs pointwise feature extraction and pooling operations to solve the unordered point cloud problem. In a series of experiments with popular 3D shape analysis benchmarks, our network exhibits competitive performance on many important tasks, such as 3D object classification, 3D object part segmentation, semantic segmentation in scenes, and commercial 3D CAD model retrieval.

Keywords

3D point clouds convolutional neural network object classification semantic segmentation shape retrieval

1. Introduction

Automatic classification and segmentation of 3D models are tasks that have received substantial attention in recent years – not only because of their many applications, such as scene understanding, robot navigation, augmented reality and surface reconstruction, but also because they remain challenging. 3D point clouds [1, 2, 3] are an important type of geometric data structure of the 3D models. Although neural networks and deep learning approaches have been successfully applied to other data analysis domains, such as structural system identification [4, 5, 6, 7, 8, 9], defect and crack detection [10, 11, 12, 13], traffic network management [14, 15], image processing [16, 17, 18], and earthquake magnitude prediction [9], it is nontrivial to apply them to analyze 3D point clouds. To address 3D point clouds using deep learning, most researchers first transform such data into a structural representation and then feed the representation into the neural network. At present, the methods to create structural representations of from 3D models can be roughly divided into three types. (1) Operate directly on a 3D point cloud dominated by handcrafted features [19, 20]. (2) Convert a 3D point cloud into multiview images [21, 22, 23]. (3) Convert a 3D point cloud into a 3D volumetric representation [24, 25, 26, 27].

Figure 1.

Illustration of the low-level geometric and high-level semantic features of the aircraft.

Among the existing works, four problems still exist when applying deep learning to 3D point cloud classification and segmentation: (1) A 3D point cloud is a low-resolution resampling of the geometric shape of the 3D space. No strict grid structure exists, which makes it difficult to directly learn the semantic information from 3D point clouds using deep learning. (2) A 3D point cloud, which is a set of points that are scattered in the 3D space, is much more complicated than images. The pose transformation and geometric distortion of a 3D point cloud makes it more difficult to classify and segment. (3) A 3D point cloud is a set of points without a specific order. Thus, deep learning cannot learn an order-invariance function to canonicalize input point clouds.

The goal of this work is to enhance the performance in 3D shape analysis and make the 3D point cloud easier to classify and segment. To achieve this goal, we propose a novel network for 3D shape analysis that combines pointwise low-level geometric and high-level semantic information, named pointwise geometric and semantic learning network (PointwiseNet), whose intuition is shown in Fig. 1.

The low-level geometric features (point $p_{1}$ is located at the tail of the aircraft) and the high-level semantic features (the aircraft consists of 5 parts) of the aircraft can easily be identified by the human eye. The low-level geometric features, which are explicit expressions, can be described by many approaches [28, 29]. In contrast, the high-level semantic features, which are implicit expressions, are difficult to describe directly. The main technical challenge of our method is to capture the 3D model’s high-level semantic features. We were motivated by the successful feature detection algorithm NetVLAD [30], which stores the sum of the residuals for each visual word (cluster centre) of a 2D image and leverages the recent success of deep networks to solve point-cloud-based retrieval for place recognition. These visual words, which can be explicitly learned by a deep network from 3D point clouds, have significant relationship with each low-level geometric feature. Therefore, the high-level semantic feature can be indirectly described by the relationships between a low-level geometric feature and its associated visual words.

Based on this idea, we develop two types of pointwise fixed-dimensional vectors, including low-level geometric and high-level semantic features. As shown in Fig. 2, the pointwise feature learning component in our proposed network is composed of three phases: a spatial transformer network (STN), the K-nearest neighbour (KNN) algorithm, and a vector of locally aggregated descriptors (VLAD). The STN module enables our network to be invariant to input rotation, while the KNN and VLAD modules are used to extract the pointwise low-level geometric and high-level semantic features, respectively.

To verify our claims and justify the design choices in PointwiseNet, we performed experiments on a number of benchmark datasets (ModelNet [24], ShapeNet [1], S3DIS [31], and 3D CAD model database [32]). The results demonstrate that the proposed method achieves state-of-the-art performance.

This work provides the following three main contributions:

•

We present a novel method based on the VLAD mechanism to extract high-level semantic information from point clouds. The high-level semantic information is indirectly described by the relationship between each point’s low-level geometric descriptor and a few visual words; this relationship allows machines to infer the high-level semantic information of the 3D point cloud.

•

We construct an end-to-end network for 3D shape analysis, dubbed PointwiseNet, which combines pointwise low-level geometric and high-level semantic features. The network has the ability to classify and segment 3D models, and it does not require any pretraining.

•

Four different strategies, including pointwise feature extraction, spatial transform, uniform operation, and pooling operation, enable PointwiseNet to address be rotation invariant, translation invariant, and robust to the disorder of point cloud data.

The remainder of this paper is organized as follows. Section 2 briefly reviews some related works concerning the use of deep learning with 3D data. Section 3 outlines the overall framework of our network for 3D point cloud classification/segmentation. Sections 4 and 5 present some implementation details, such as descriptions of the STN, KNN, and VLAD modules. The experimental results on several datasets are reported in Section 6, followed by our conclusions in Section 7.

2. Related work

Our learning framework is a unified framework based on deep learning that can be used to perform various 3D shape analyses. Recently, a large number of works have been published concerning 3D deep learning. With the efforts of the whole community over the past few years, significant progress has been made on some longstanding problems. These approaches can be grouped into four types according to the 3D shape representation used in each solution: (1) Handcrafted features; (2) Multi-view CNNs; (3) Volumetric CNNs; (4) Point CNNs. We review each of these approaches in turn.

2.1 Handcrafted features

These types of methods extract the low-level features of 3D models via some feature descriptors and then feeds them into the neural network. Many excellent feature descriptors exist, including heat kernel signatures (HKS) [33], 3D voxel grids [34], spherical harmonic descriptor [28], light field descriptor [35] and so on; however, the accuracy of this these type of methods relies heavily on the choice of handcrafted features.

Qin et al. [19] presented a deep learning approach to automatically classify 3D CAD models according to a mechanical part catalogue, and it is was the first work to successfully apply the deep learning technique to commercial 3D CAD model classification. They employed Zernike moments and Fourier descriptors to characterize 3D models as a set of multidimensional vectors and then extracted high-level features through deep neural networks. Bu et al. [20] combined HKS and the average geodesic distance into low-level features, which were then converted into middle-level features through the bag-of-features models. They utilized deep belief networks to learn high-level features from middle-level features and applied those to 3D model retrieval and recognition.

These methods use multiple descriptors to superimpose and extract low-level features with one descriptor and extract higher-level features with another descriptor; thus, they lose some information compared to the model itself.

2.2 Multi-view CNNs

Convolutional neural networks (CNNs) [36, 37, 16] and their recent improvements [38] have been successfully applied to a wide range of applications in computer vision. These methods work well because the pixels of 2D images are located at fixed positions in a strict grid framework. Because 3D point clouds have no regular format as do 2D images, most researchers typically transform such data 3D point cloud data to collections of images before feeding them to a deep network.

Recently, Su et al. [21] designed a multi-view 2D CNN (MVCNN) for 3D shape recognition that achieved promising results. First, the method projects the 3D model into an image; then, it extracts the features of the projected image using a 2D model pretrained on ImageNet. Then, view-point pooling is used to combine all the streams obtained from each view, and finally, the fused features are classified by another CNN network. Qi et al. [22] conducted comprehensive experiments to compare the recognition performance of 2D multi-view CNNs against those of 3D volumetric CNNs. Pang et al. [23] studied the challenging problem of detecting 3D objects in point clouds with discrete sampling, noisy scans, occlusions and cluttered scenes, and they designed a multiview CNN for object detection in point clouds.

The 2D descriptor is learned with a trainable neural network in multi-view CNNs rather than handcrafted features. Consequently, they do not need to store and read the handcrafted features from disk, leading to significant computational gains. Despite their state-of-the-art performances, these multi-view CNNs still cannot fully exploit the 3D geometric information in the 3D point cloud.

2.3 Volumetric CNNs

To retain the geometric information of the 3D point cloud, some volumetric CNNs have been proposed in recent years. Volumetric CNNs first convert raw point clouds into a 3D volumetric grid, which is represented as a binary probability distribution (if the voxel is in the 3D surface, its value is 1; otherwise, it is 0) and then feed it into 3D deep CNNs for classification or segmentation.

Wu et al. [24] introduced the ModelNet database and proposed learning deep volumetric representations of shapes using a deep belief network architecture for shape recognition and completion. Maturana et al. [25] proposed VoxNet, which was a pioneering effort that used 3D convolutional networks for object recognition. VoxNet can efficiently manage large amounts of point cloud data by integrating a volumetric occupancy grid representation with a supervised 3D CNN. Riegler et al. [26] proposed OctNet, a representation for deep learning from sparse 3D data. OctNet is a memory-efficient data structure, a hybrid grid-octree, which enables 3D convolutional networks that are both deep and have high resolution. Wang et al. [27] presented an octree-based CNN, named O-CNN, for 3D shape analysis. O-CNN represents the 3D shapes with octrees and performs 3D CNN operations only on the sparse octants occupied by the boundary surfaces of the 3D shapes.

Compared with multi-view CNNs, volumetric CNNs are able to better maintain the geometric information of the 3D model. Unfortunately, the performance of volumetric CNNs is largely limited by the resolution loss, and they have exponentially increasing computational costs. Additionally, while extending 2D CNNs to 3D appears to be natural, the data sparsity introduces significant challenges. Overall, the high computational complexity of volumetric CNNs and the data sparsity of point clouds prevents them from scaling up and sustaining sufficient spatial resolution to preserve the details.

2.4 Point CNNs

A 3D point cloud contains 3D coordinates of some sample points on the surface of a 3D model. Analysing point cloud data directly through CNNs will encounter three main problems: (1) Point clouds are unordered data; (2) Point clouds are typically sparse data; (3) Point clouds contain very limited information.

To solve the above problems, Qi et al. [39] proposed PointNet, which was one of the first network architectures for directly handling 3D point clouds. The main limitation of PointNet is that it cannot capture the low-level geometric information of the point cloud in a hierarchical manner. To address this problem, PointNet $++$ [40] was developed to build a pyramid-like feature aggregation scheme, but its point sampling and grouping strategy does not reveal the spatial distribution of the input point cloud. Although PointNet $++$ achieves competitive performance, network structures based on PointNet are rather complex. Subsequently, PointCNN [41] explored the idea of equivariance rather than invariance and exhibited performance competitive to that of PointNet. Taking another a different direction, Klokov et al. [42] proposed a Kd-Net for 3D point cloud recognition. A Kd-tree was built on input point clouds, and hierarchical groupings were applied to model the local dependencies in points. Higher resolutions enlarge the Kd-tree in [42]; however, they also requires extra computation.

Compared to volumetric CNNs, point CNNs have lower computational costs and provide better performances; however, some point CNNs focus on 3D object classification, while others focus on 3D scene segmentation. To the best of our knowledge, it is difficult to construct one general network to cover the above limitations. The purpose of this study is to combine both pointwise low-level geometric and high-level semantic information using deep learning techniques to learn more representative characteristics from 3D point clouds, as well as to explore a simpler network for 3D shape analysis.

3. The network architecture

Figure 2.

The flowchart of the proposed network. N is the number of the input point clouds; C is the number of neurons in the last fully connected layer of the classification network; M is the number of categories in part segmentation or semantic segmentation in scene tasks; FC stands for a fully connected layer, and the numbers reflect the layer sizes.

Figure 2 shows the network architecture of our PointwiseNet, which consists of three main components: pointwise feature learning, classification and segmentation.

3.1 Pointwise feature learning

The pointwise feature learning consists of three phases: STN, KNN, and VLAD. (1) The STN module is used to apply transformations such as rotation and translation (Section 4). (2) The KNN module is used to extract the pointwise low-level geometric information for each point of the 3D point cloud (Section 5.1). (3) The VLAD module is used to extract the pointwise high-level semantic information for each point of the 3D point cloud, which is indirectly described by the relationship of each point’s low-level geometric descriptor with a few visual words (Section 5.2).

3.2 Classification network

For the 3D object classification task, the complete classification network consists of pointwise feature learning and classification. The feature learning takes $N$ points as input, applies input and feature transformations, and then aggregates the pointwise features into a global feature vector by using a global pooling layer. After the global pooling layer in the first component, the 3D point cloud is represented as a 1,024-dimensional feature vector. To classify the point clouds, three fully connected layers are attached after the global feature vector. The final output is the $C$ scores for all the $C$ candidate classes.

3.3 Segmentation network

For the 3D semantic segmentation task, the complete segmentation network consists of pointwise feature learning and segmentation. The feature learning process takes a single object for part region segmentation as input, while the segmentation component concatenates the three output vectors (low-level geometry vector, high-level semantic vector, and global feature vector) into a 1,536-dimensional feature vector. This vector is then input into the four fully connected layers to obtain the final classification result, which consists of $N\times M$ scores for each of the $N$ points and each of the $M$ semantic subcategories.

4. Spatial transform

In this section, we show how the STN module makes our model invariant to input rotation. The original 3D point cloud is represented as a set of 3D points $P=\{p_{1},\ldots,p_{N}|p_{n}\in\mathbb{R}^{3}\}$ , where each point is a vector of its $(x,y,z)$ coordinates. The spatial transform operation is defined as follows:

$\displaystyle\tilde{P}=PA_{\theta}$ (1)

where $\tilde{P}$ is the transformed 3D point cloud that has been aligned to a canonical space and $A_{\theta}$ is the transformation matrix. $A_{\theta}$ is represented as follows:

$\displaystyle A_{\theta}=\left[\begin{array}[]{ccc}\theta_{11}&,\theta_{12}&,% \theta_{12}\\ \theta_{21}&,\theta_{22}&,\theta_{23}\\ \theta_{31}&,\theta_{32}&,\theta_{33}\\ \end{array}\right]$ (2)

Jaderberg et al. [43] introduced the idea of a spatial transformer to align 2D images using deep networks; however, that effort was limited by its inability to be spatially invariant to the 3D point cloud. We need the representation learned from the input 3D point cloud to be spatially invariant, which can be achieved by extending the above research [43] to the 3D case. To achieve end-to-end training, we design a learnable module, namely, the STN module shown in Fig. 3, to obtain the transformation matrix.

Figure 3.

Illustration of the structure of the STN module.

The STN module is split into two parts: T-net and matrix multiply. T-net is a regressor network (including a number of hidden layers), which takes the original 3D point set as input and outputs the 9 parameters of the transformation matrix $A_{\theta}$ defined in Eq. (2). According to Eq. (1), the original 3D point set $P$ is multiplied with the transformation matrix $A_{\theta}$ to obtain the transformed 3D point cloud $\tilde{P}$ . The STN module is a dynamic mechanism that can actively spatially transform a 3D point set by producing an appropriate transformation for each input case.

5. Leaning pointwise features with different levels

As shown in Fig. 2, the KNN and VLAD modules both have unique advantages: neither can replace the other because they extract features from different levels. Thus, the two sets of features (the low-level geometric features and high-level semantic features) represent a 3D point set from different viewpoints.

5.1 Low-level geometric feature

After performing a spatial transformation on the 3D point set, our model needs to be able to capture the low-level geometric feature from nearby points. We propose extracting the low-level geometric feature based on the pointwise KNN search method (as shown in Fig. 4), which mainly includes the following three steps: (1) KNN search; (2) unified coordinates; (3) feature transformation and fusion.

Figure 4.

The structure of the KNN module.

5.1.1 KNN search

After transformation by the STN module, the original 3D point set $P=\{p_{1},\ldots,p_{N}|p_{n}\in\mathbb{R}^{3}\}$ is converted into the spatially invariant 3D point set $\tilde{P}=\{\tilde{p}_{1},\ldots,\tilde{p}_{N}|\tilde{p}_{n}\in\mathbb{R}^{3}\}$ . As shown in Fig. 4, we search for the $K$ ( $K=$ 16) nearest neighbours on the 3D points $\tilde{P}-\tilde{p}_{n}$ for each point $\tilde{p}_{n}$ . The point-to-point set $k$ NN search is defined as follows:

$\displaystyle\hat{p}_{n,k}=\textit{kNN}(\tilde{p}_{n}|(\tilde{P}-\tilde{p}_{n}% )),$ (3)

where $\hat{p}_{n,k}$ represents the k-th nearest neighbour of the point $\tilde{p}_{n}$ . Thus, the $K$ nearest neighbours set of the point $\tilde{p}_{n}$ can be represented as $\{\hat{p}_{n,1},\ldots,\hat{p}_{n,K}|\hat{p}_{n,k}\in\mathbb{R}^{3}\}$ .

5.1.2 Unified coordinates

The procedure to obtain a signal vector includes the coordinate of the point $\tilde{p}_{n}$ , and the $K$ nearest neighbours set corresponding to the point $\tilde{p}_{n}$ is constructed in two steps. First, each point $\tilde{p}_{n}$ is normalized into the $K$ nearest neighbours set $\{\hat{p}_{n,1},\ldots,\hat{p}_{n,K}\}$ by subtracting its associated coordinate. Then, a 6-dimensional vector for each point is obtained for each point by concatenating the point $\tilde{p}_{n}$ and the $K$ nearest neighbours set, which is represented as

$\displaystyle u_{n,k}=\tilde{p}_{n}\oplus(\hat{p}_{n,k}-\tilde{p}_{n}),$ (4)

where $n\in[1,N]$ , $k\in[1,K]$ , $u_{n,k}\in\mathbb{R}^{6}$ , and $\oplus$ is a concatenation operator.

5.1.3 Feature transform and fusion

The output from the above section forms an $N\times K\times 6$ tensor, where $N$ denotes the number of points in the 3D point set, $K$ denotes the number of neighbours for each point, and the last 6 dimensions are the coordinates of the point $\tilde{p}_{n}$ and the unified coordinates of the $k$ -th nearest neighbour point. Thus, each point is represented as a $K\times 6$ matrix. To obtain sufficient expressive power to transform each point feature into a higher-dimensional feature, a fully connected layer (illustrated in Fig. 4 (right)) is added to the KNN module. Given the input vector $u_{n,k}$ , $k\in[1,K]$ , the output from the fully connected layer can be formulated as

$\displaystyle\tilde{u}_{n,k}=f_{fc}\big{(}u_{n,k}\big{)},$ (5)

where $f_{fc}$ denotes the fully connected layer and $\tilde{u}_{n,k}$ is a 64-dimensional vector. After the fully connected layer, each point is represented as a $K\times 64$ matrix. Thus, the 3D point set is represented as an $N\times K\times 64$ tensor.

A pointwise local pooling layer is applied to generate the low-level geometric feature. Given the input matrix $\{\tilde{u}_{n,1},\ldots\tilde{u}_{n,K}|\tilde{u}_{n,k}\in\mathbb{R}^{64}\}$ , the output $v_{n}$ from the pointwise local pooling layer can be formulated as

$\displaystyle v_{n}=f_{\textit{pool}}(\tilde{u}_{n,1},\ldots,\tilde{u}_{n,K})$ (6)

where $f_{\textit{pool}}$ is the pointwise local pooling layer and $v_{n}$ is a 64-dimensional vector. According to our experiments, max-pooling performs better than does average-pooling.

After the pointwise local pooling layer, each point is represented as a 64-dimensional vector. Thus, the 3D point set is represented as an ( $N\times 64$ )-dimensional feature, denoted as $\{v_{1},\ldots,v_{N}|v_{n}\in\mathbb{R}^{64}\}$ . Here, the KNN module can be regarded as the component that learns to extract the pointwise low-level geometric feature from the input 3D point set. Moreover, the KNN module extracts low-level geometric features in a pointwise manner, which solves the problem of point cloud disorder while effectively improving the accuracy of per-point segmentation tasks.

Figure 5.

The structure of VLAD, which provides guidance during the decoding process by using character spatial information as supervision.

5.2 High-level semantic feature

In this section, we show how to leverage the VLAD mechanism [44] to extract the high-level semantic features from the 3D point set. VLAD is a popular descriptor pooling method for both instance-level retrieval and image classification. Arandjelovic et al. [45] proposed an end-to-end deep network named NetVLAD that stores the sum of residuals for each visual word (cluster centre) of a 2D image and performs image-based retrieval for place recognition. PointNetVLAD [30] leverages on the success of PointNet [39] and NetVLAD [45] to perform 3D point-cloud-based retrieval for large-scale place recognition.

The pointwise high-level semantic feature (e.g., skeleton or part of the 3D model) is an implicit expression that is difficult to describe directly. Inspired by [45, 30], we can indirectly describe the high-level semantic feature by the relationship between each point’s low-level geometric descriptor and a few visual words. As shown in Fig. 5, the VLAD module mainly consists of the following two steps: (1) top- $K$ VLAD feature selection; (2) feature transformation and fusion.

5.2.1 Top- $K$ VLAD feature selection

We take $N$ low-level geometric feature descriptors $\{v_{1},\ldots,v_{N}|v_{n}\in\mathbb{R}^{D}\}$ as input for the VLAD module, where $D=$ 128. Meanwhile, $M$ visual words (“cluster centres”) are initialized, which are learnable parameters via backpropagation, denoted as $\{c_{1},\ldots,\linebreak c_{M}|c_{m}\in\mathbb{R}^{D}\}$ . Each point’s low-level geometric feature descriptor $v_{n}$ is assigned to each visual word $c_{m}$ and represented by a residual vector ( $v_{n}-c_{m}$ ) that records the difference between the low-level geometric feature descriptor and the visual word. The relationship of the $n$ -th low-level geometric feature descriptor $v_{n}$ to the $M$ visual words is denoted as $r$ . The ( $n$ , $d$ ) element of $r$ is computed as follows:

$\displaystyle r_{n,d}=\sum_{m=1}^{M}a_{n}(c_{m})(v_{n,d}-c_{m,d}),d\in[1,D]$ (7)

where $a_{n}(c_{m})$ are the attention coefficients, and $c_{m,d}$ and $v_{n,d}$ are the $d$ -th dimension of the $m$ -th visual word and the $n$ -th low-level geometric feature descriptor, respectively. The attention coefficients $a_{n}(c_{m})$ are utilized as weights to reflect the importance of the $m$ -th visual word to the $n$ -th low-level geometric feature descriptor. $a_{n}(c_{m})$ is 1 when $c_{m}$ is the closest visual word to $v_{n}$ and 0 otherwise. This value represents the hard assignment $a_{n}(c_{m})$ of the visual word $c_{m}$ to the low-level geometric feature descriptor $v_{n}$ .

To make the VLAD module differentiable and the coefficients easily comparable across different visual words, we apply a soft-assignment of the low-level geometric feature descriptor to the visual words. Therefore, the computation of the coefficients can be formulated as follows:

$\displaystyle a_{n}(c_{m})=\frac{e^{W_{m}^{T}v_{n}+b_{m}}}{\sum_{m=1}^{M}e^{W_% {m}^{T}v_{n}+b_{m}}},$ (8)

where $\{w_{m}\}$ and $\{b_{m}\}$ denote the weight and bias terms, respectively. These parameters are learned during training. $a_{n}(c_{m})$ ranges between 0 and 1, where the highest weight is assigned to the closest visual word.

Note that Eq. (7) is a weighted sum of residuals (the difference between the low-level geometric feature descriptor and the visual word) for each visual word. However, each point in the 3D model may have a significant relationship with several visual words. Therefore, it is necessary to consider only the influence of visual words with higher attention scores on the high-level semantic feature. To accomplish this, we provide the following top- $K$ selection definition.

.

Top-K selection. Given a low-level geometric feature descriptor $v_{n}$ and a visual word set $C=\{c_{1},\ldots,c_{M}|c_{m}\in\mathbb{R}^{D}\}$ , the top-K attention score returns a subset $\tilde{C}=\{\tilde{c}_{1},\ldots,\tilde{c}_{K}|\tilde{c}_{k}\in\mathbb{R}^{D}\}$ , $\tilde{C}\subseteq C$ such that for any visual word $\tilde{c}\in\tilde{C}$ and $\hat{c}\in C-\tilde{C}$ , $a_{n}(\tilde{c})\geqslant a_{n}(\hat{c})$ .

Based on Definition 1, we designed a top- $K$ VLAD feature selection operation in the VLAD module. Finally, the $(n,d)$ element of $r$ is computed as follows:

$\displaystyle r_{n,d}=\sum_{k=1}^{\text{top-}K}\frac{e^{W_{k}^{T}v_{n}+b_{k}}}% {\sum_{k=1}^{\text{top-}K}e^{W_{k}^{T}v_{n}+b_{k}}}(v_{n,d}-c_{k,d})$ (9)

where $k\in[1,\text{top-}K]$ and $d\in[1,D]$ . On the one hand, the top- $K$ value controls the number of residual vectors, and on the other hand, it represents the overlap between the different visual words. Moreover, to improve the nonlinear transformation of the network, we use the shared FC layer, and finally, we aggregate the top- $K$ transformed features. Therefore, the output of the VLAD module is consistent with the input (although the vector dimensions are different). This approach not only accelerates the calculation but also improves model accuracy.

As shown in Fig. 5, during network initialization, the visual words are obtained by uniform initialization from [ $-$ 0.01, 0.01]. Then, during network training, the visual words are continuously adjusted by optimizing the loss function. The soft-assignment of the visual word $c_{m}$ to the low-level geometric feature descriptor $v_{n}$ can be regarded as a two-step process: (i) a convolution with a set of $M$ filters $\{w_{m}\}$ that have spatial support 1 $\times$ 1 and biases $\{b_{m}\}$ , producing the output $w_{m}^{T}v_{n}+b_{m}$ ; (ii) the convolution output is then passed through the softmax function to obtain the final soft assignment $a_{n}(c_{m})$ . The top- $K$ selection is performed according to Definition 1.

5.2.2 Feature transform and fusion

The output from the above section forms an $N\times 3\times 128$ tensor, where $N$ denotes the number of points in the 3D point set and $3$ denotes the number of top- $K$ visual words, as explained in Definition 1. Thus, each point is represented as a $3\times 128$ matrix. As illustrated in Fig. 5, we added a fully connected layer in the VLAD module. After the fully connected layer, each point is represented as a $3\times 256$ matrix. Thus, the 3D point set is represented as an $N\times 3\times 256$ tensor.

The output tensor of the fully connected layer is then fed into a pointwise global pooling layer, which generates the high-level semantic feature for each point. According to our experiments, max-pooling performs better than does average-pooling. After the pointwise local pooling layer, each point is represented as a 256-dimensional vector. Here, the VLAD module can be regarded as the component that learns to extract a ( $n\times 256$ )-dimensional high-level semantic feature from the input 3D point set. Moreover, the VLAD module extracts high-level semantic features in a pointwise manner, which not only solves the problem of point cloud disorder but also effectively improves the accuracy of per-point segmentation tasks.

6. Experiments and results

In this section, we perform a set of architecture analyses to determine the important parameters of our network (Section 6.1). Then, we evaluate the performance of our PointwiseNet in four different applications, namely, 3D object classification (Section 6.2), 3D object part segmentation (Section 6.3), semantic segmentation (Section 6.4) and CAD model retrieval (Section 6.5).

To demonstrate the efficiency and efficacy of our model, we conducted all the experiments on a desktop machine equipped with an Intel Core I7-6300 CPU (3.4 GHz) and a GeForce 1080 GPU (16 GB memory). The training and testing program was implemented in TensorFlow, and the corresponding code is available from our community site.1

6.1 Architecture analysis

In this section, we focus on an ablation analysis to select the various architectures and conduct validation studies to determine the important setups. First, we introduce the datasets used in our experiments and some implementation details of our method (Section 6.1.1). Then, a hyper-parameter analysis is conducted to select the best network parameters (Section 6.1.2). After that, an ablation analysis is performed to evaluate the efficacy of each component (Section 6.1.3). Finally, we analyse different pooling operations (Section 6.1.4).

6.1.1 Dataset and implementation details

ModelNet [24] is a CAD model dataset that has served as a standard benchmark for 3D shape classification in recent years. ModelNet currently contains 127,915 3D CAD models from 662 categories. The 10-class and the 40-class variants of the ModelNet (ModelNet10 and ModelNet40) benchmarks, containing 4,899 and 12,311 models, respectively, are used for 3D shape classifications. ModelNet10 is split into 3,991 models for training and 909 models for testing. ModelNet40 is split into 9,843 models for training and 2,468 models for testing. To obtain 3D point clouds, we uniformly sampled 1,024 points from meshes by Poisson disk sampling using MeshLab [46] and normalized them into a unit sphere.

The network configuration of the classification model is set as follows. The loss function is $L(y,\hat{y})=-\sum_{n=1}^{N}y_{n}\log(\hat{y}_{n})$ , where $y$ is the output score and $\hat{y}$ is the ground truth label. All the weights are initialized to a uniform distribution of [ $-$ 0.001, 0.001]. Batch normalization and ReLU activation are applied to all fully connected layers. We optimize PointwiseNet using the Adam optimizer with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 32. The dropout ratio is 0.8. The initial learning rate is 0.001, and it is decreases by a factor of 0.7 after every 20 epochs. The optimization terminates after approximately 100 epochs.

6.1.2 Hyper-parameter

In this section, we consider several hyper- parameters: (1) the $k$ value in the KNN module; (2) the number of the visual words in the VLAD module; (3) the top- $K$ value in the VLAD module. The number of visual words depends on the application scenario. For 3D object classification, ModelNet40 contains meshed CAD models from 40 categories, and the object categories are labelled with 3 to 5 parts. In the experiments, $M$ is set to 32. For 3D object part segmentation and semantic segmentation, because the number of object parts for each category is fixed, we assume that each category is a visual word; thus $M$ depends on the number of categories.

Therefore, we mainly analyse the influence of $k$ in the KNN module and top- $K$ in the VLAD module on the model’s performance. Specifically, we conducted numerous experiments on ModelNet40. The relationship between the size of $k$ and the classification accuracy is shown in Fig. 6 (symbol: square). As the value of $k$ increases, the classification accuracy also increases. However, when $k$ exceeds 6, the classification accuracy increases only very slowly. Based on these experimental results, we recommend setting the value of KNN to 6. The effect of the top- $K$ value is shown in Fig. 6 (symbol: triangle). When top- $K$ $=$ 3, our network achieves its peak value considering the balance between network complexity and classification accuracy.

Figure 6.

Quantitative comparisons of different settings in the KNN module and top- $K$ selection.

6.1.3 Ablation analysis

Figure 2 shows the network architecture of PointwiseNet, which is composed of three main modules: the STN module, the KNN module, and the VLAD module. To investigate the efficacy of each module, we conducted an ablation analysis on ModelNet40. A basic version that represents 3D object classification without the KNN and VLAD modules is shown in Fig. 7.

Figure 7.

Ablation analysis. The accuracy improves when the STN module is integrated with the KNN and VLAD modules.

We also compared different designs for the neural network architectures, and the results are reported in Table 1. The first row presents the results of a basic version with only the STN module. Unsurprisingly, the performance without the KNN and VLAD modules is poor. The second row presents the results of integrating the STN and KNN modules, and the third row presents the results of integrating the STN and VLAD modules. As the results show, the introduction of the KNN and VLAD modules alone enhances the performance by 2.72% and 5.10%, respectively. When all three modules are integrated, as shown in the last row, PointwiseNet achieves a classification accuracy of 90.86%.

Table 1

Effects of KNN and VLAD modules for 3D object classification on ModelNet40

Basic version	KNN	VLAD	Accuracy
(STN module)	module	module	(overall)
✓			84.93
✓	✓		87.65
✓		✓	90.03
✓	✓	✓	90.86

We analyse the results of this experiment as follows: (1) Although different modules make different contributions to the classification accuracy, the final classification accuracy could be further improved by adding more modules. (2) In the 3D object classification task, the high-level semantic feature is more important than the low-level geometric feature, which fully demonstrates that the VLAD module effectively extracts the high-level semantic feature. (3) PointwiseNet achieves the state-of-the-art performance even with a much smaller input data size ( $1024\times 3$ points). This result again confirms the effectiveness of the low-level geometric feature and high-level semantic feature extracted by our network in the 3D model analysis.

6.1.4 Pooling operations

In this section, to study the influence of different pooling operations (including max-pooling and average-pooling), we quantitatively evaluated the proposed approach on the ModelNet40 benchmarks for 3D object classification. For this study, the effect of pooling operations effects includes are as follows. First, the pooling operations reduce the computational complexity caused by the high-dimensional feature vector. Second, the pooling operations cause the global feature to retain more semantic information. Finally, the pooling operations make our model invariant to input permutations.

As shown in Fig. 2, our model includes two pooling operations, namely, local pooling and global pooling implemented as either max-pooling or average-pooling. We compare the different pooling combination strategies in Table 2.

Table 2
Comparison of different pooling combination strategies on ModelNet40

Local pooling		Global pooling		Accuracy (overall)
Average	Max	Average	Max
✓		✓		88.43
	✓	✓		89.28
✓			✓	89.89
	✓		✓	90.86

As shown, in both the local pooling layer and global pooling layer, the use of average-pooling does not contribute to the accuracy improvement; instead, it degrades the performance. In contrast, the use of max-pooling in both layers contributes to the performance improvement. Based on these results, we adopt max-pooling to extract both local features and global features in this study.

6.2 3D object classification

Table 3
ModelNet shape classification. Comparison of the accuracy of the proposed model with the state-of-the-art models. Our network achieves better performance on ModelNet10, and it achieves state-of-the-art performance on ModelNet40 compared to other deep networks on 3D input. The top 2 ranked values are highlighted in bold and the first and second are shown in red and blue, respectively

			Accuracy
Network	Representation	Input	ModelNet10		ModelNet40
			(avg. class)	(overall)	(avg. class)	(overall)
MVCNN [21]	2D images	$80\times(164^{2})$	–	–	89.7	92.0
3DShapeNets [24]	3D volumetric grid	$30^{3}$	83.5	–	77.3	–
OctNet [26]	3D volumetric grid	$128^{3}$	90.1	90.9	83.8	86.5
VoxNet [25]	3D volumetric grid	$32^{3}$	92.0	–	83.0	–
O-CNN [27]	3D volumetric grid	$64^{3}$	–	–	–	90.6
RGCNN [47]	Points $+$ normal	$1024\times(3+3)$	–	–	87.3	90.5
PointNet $++$ [40]	Points $+$ normal	$5000\times(3+3)$	–	–	–	91.9
So-Net [48]	Points $+$ normal	$5000\times(3+3)$	95.5	95.7	90.8	93.4
ECC [49]	Points	$1000\times 3$	90.0	90.8	83.2	87.4
PointNet [39]	Points	$1024\times 3$	–	–	86.2	89.2
DeepSets [50]	Points	$5000\times 3$	–	–	–	90.0
PointNet $++$ [40]	Points	$1024\times 3$	–	–	–	90.7
Kd-Net (depth 10) [42]	Points	$2^{15}\times 3$	92.8	93.3	86.3	90.6
Kd-Net (depth 15) [42]	Points	$2^{15}\times 3$	93.5	94.0	88.5	91.8
So-Net [48]	Points	$2048\times 3$	93.9	94.1	87.3	90.9
PointwiseNet (1024)	Points	$1024\times 3$	94.1	94.5	89.1	91.3
PointwiseNet (2048)	Points	$2048\times 3$	94.7	95.0	89.1	91.6
PointwiseNet (5000)	Points	$5000\times 3$	94.8	95.1	90.0	92.7

In this section, we show the efficiency of PointwiseNet when applied to representation learning and 3D point cloud feature extraction from 3D point clouds. We quantitatively evaluated the proposed approach on the ModelNet10 and ModelNet40 benchmarks (the same dataset as in Section 6.1) and compared it with several state-of-the-art methods (including MVCNN [21], 3DShapeNets [24], OctNet [26], VoxNet [25], O-CNN [27], RGCNN [47], PointNet $++$ [40], So-Net [48], ECC [49], PointNet [39], DeepSets [50], and Kd-Net [42]). The network configuration of the classification model is the same as that described in Section 6.1, except for the number of iterations; here, the optimization terminates after approximately 250 epochs. Our results are shown in Table 3.

From Table 3, PointwiseNet outperforms all the other methods that only use points as input data on the ModelNet10 dataset. PointwiseNet is 3.7% better than ECC [49] with an input data size of $1,000\times 3$ points. PointwiseNet utilizes a considerably smaller input data size than does So-Net [48], but it still substantially outperforms the other state-of-the-art results. Compared with the volumetric-based methods, PointwiseNet achieves better performance and obtains an accuracy 3.6% higher than that of OctNet [26] with an input data size of $128^{3}$ voxels.

On the ModelNet40 dataset, PointwiseNet achieves state-of-the-art performance among methods based on 3D input (3D volumetric grid). There is only a small gap between PointwiseNet and MVCNN [21], which we believe is due to MVCNN pretrained on ImageNet images from 1000 categories and fine-tuned on all 2D views of the 3D shapes in the training set. However, the drawback of MVCNN is that one needs to prepare multi-view images of the 3D data, and thus, it is computationally more expensive. We conclude that our PointwiseNet can achieve performance similar to that of the mature multi-view CNN. Compared with point-based methods, our results are better than all previous single-model results except for Kd-Net (depth 15) [42], but the input data size required for PointwiseNet is only 1/32 of the Kd-Net (depth 15).

From Table 3, PointNet $++$ [40] and So-Net [48] significantly improve the ModelNet shape classification accuracy of the network by increasing the input data size or adding more point features (such as normal vectors). Following this, the last two rows of Table 3 report the classification accuracy of PointwiseNet when increasing the input data size to 2048 and 5000, respectively. PointwiseNet can also get performance improvement by increasing the input data size, and we believe that similar phenomena will occur when adding more point features. From the perspective of balancing performance and network complexity, we believe that PointwiseNet has more advantages.

6.3 3D object part segmentation

Part segmentation is a challenging task in the 3D object recognition domain, and it is defined as a per-point classification problem. We use the model discussed in Section 3 to predict the part label of each point in a 3D point cloud object (e.g., in an aircraft, each point can correspond to the body, wings, tail or engine). Similar to [39], we utilize the intersection over union (IoU) of each category as the evaluation metric. The IoU of each shape is averaged over the IoU of each part that occurs in this shape. The mean IoU of each category is obtained by averaging the IoUs of all the shapes in the category. The overall mean IoU can then be calculated by averaging the IoUs of all categories. We employed the segmentation module of our architecture as discussed in Section 3 to predict part labels for individual points within point clouds.

6.3.1 Dataset

ShapeNet [1] was released in May 2015, and the repository has already been widely used by hundreds of groups in academia and industry. ShapeNet is used for 3D reconstruction, 3D shape analysis and synthesis, 3D printing, and scanning data analysis. In this study, we adopted ShapeNet to evaluate our architecture for part segmentation. ShapeNet contains 16,881 shapes represented as separate point clouds from 16 categories with per-point annotations (with 50 parts in total). In this dataset, both the categories and the parts within the categories are highly imbalanced, which poses a challenge for all methods, including ours. To

Table 4
Part segmentation results on the ShapeNet-core dataset. Intersection over union (IoU) is reported as the evaluation metric. The top 3 ranked values are highlighted in bold and the first, second and third places are shown in red, blue and green, respectively

	Mean	Plane	Bag	Cap	Car	Chair	e.ph.	Guitar	Knife	Lamp	Laptop	Motor	Mug	Pistol	Rocket	Skate	Table
# dataset		2690	76	55	898	3758	69	787	392	1547	451	202	184	283	66	152	5271
# test set		341	14	11	158	704	14	159	80	286	83	51	38	44	12	31	848
Wu et al. [51]	–	63.2	–	-	-	73.5	–	–	–	74.4	–	–	–	–	–	–	74.8
3DCNN [39]	79.4	75.1	72.8	73.3	70.0	87.2	63.5	88.4	79.6	74.4	93.9	58.7	91.8	76.4	51.2	65.3	77.1
ShapeNet [1]	81.4	81.0	78.4	77.7	75.7	87.6	61.9	92.0	85.4	82.5	95.7	70.6	91.9	85.9	53.1	69.8	75.3
Kd-Net [42]	82.3	80.1	74.6	74.3	70.3	88.6	73.5	90.2	87.2	81.0	94.9	57.4	86.7	78.1	51.8	69.9	80.3
PointNet [39]	83.7	83.4	78.7	82.5	74.9	89.6	73.0	91.5	85.9	80.8	95.3	65.2	93.0	81.2	57.9	72.8	80.6
RGCNN [47]	84.3	80.2	82.8	92.6	75.3	89.2	73.7	91.3	88.4	83.3	96.0	63.9	95.7	60.9	44.6	72.9	80.4
3DmFVNet [52]	84.3	82.0	84.3	86.0	76.9	89.9	73.9	90.8	85.7	82.6	95.2	66.0	94.0	82.6	51.5	73.5	81.8
So-Net [48]	84.6	81.9	83.5	84.8	78.1	90.8	72.2	90.1	83.6	82.3	95.2	69.3	94.2	80.0	51.6	72.1	82.6
PointNet $++$ [40]	85.1	82.4	79.0	87.7	77.3	90.8	71.8	91.0	85.9	83.7	95.3	71.6	94.1	81.3	58.7	76.4	82.6
Ours	85.1	82.9	80.7	87.8	76.6	90.8	79.2	91.0	86.6	83.3	95.3	71.9	94.4	80.9	62.0	75.1	82.5

Figure 8.

Results of part segmentation on the validation data of the ShapeNet part dataset.

convert a CAD model into a point cloud, we use the same strategy as that discussed in Section 6.2 to uniformly sample 2,048 points for each object.

6.3.2 Implementation details

The network configuration of the segmentation model is set as follows. The $k$ value of the nearest neighbour search in the KNN module is set as to 6. The number of visual words in the VLAD module is 16, and the top- $K$ setting is 3. Batch normalization and ReLU activation are applied to all the fully connected layers. We optimize PointwiseNet using the Adam optimizer with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 32. The dropout ratio is 0.8. The initial learning rate is 0.001, which decreases by a factor of 0.5 after every 20 epochs. The optimization terminates after approximately 200 epochs.

6.3.3 Results

We compare our network with Wu et al. [51], 3DCNN [39], ShapeNet [1], Kd-Net [42], PointNet [39], RGCNN [47], 3DmFVNet [52], So-Net [48], and PointNet $++$ [40]. We report both the IoUs (%) of each category and the overall mean IoU (%) in Table 4.

From the experimental results, the overall mean IoU of PointwiseNet is equivalent to that of PointNet $++$ [40], which is better than other networks. Although it does not achieve the state-of-the-art overall mean IoU compared with PointNet $++$ , PointwiseNet exhibits comparable performance and even better performance on certain categories (plane, cap, earphone, knife, motor, mug and rocket). However, PointwiseNet does not perform well on categories such as bag, car, guitar and pistol.

Compared with PointNet $++$ , PointwiseNet has the following two advantages: (1) PointwiseNet does not augment the input points with additional normal information. (2) PointNet $++$ is a more complex architecture; PointwiseNet is simpler, requires fewer parameters and has a lower computational cost.

Some segmentation results from PointwiseNet are visualized in Fig. 8. Examples are plane, bag, cap, car, chair, earphone, guitar, knife, lamp, laptop, motor, mug, pistol, rocket, skate and table. Due to space limitations, it is impossible to show all the examples; thus, we randomly selected a model from each category for visual comparison. For each group of objects, the leftmost one is the ground truth, the middle one was predicted by PointNet [39], and the right one was predicted by PointwiseNet. Zooming into Fig. 8, by observing the details of the segmentation results, we can find that the segmentation accuracy of this method is significantly better than that of PointNet. For example, in the second case of the first line (bag), PointNet incorrectly splits part of the tape while PointwiseNet does not. Moreover, in the second case of the fifth line (lamp), the segmentation result of PointwiseNet is closer to the ground truth.

6.4 Semantic segmentation in scenes

To validate the suitability of PointwiseNet for large-scale point cloud analyses, we also conducted experiments on a semantic scene labeling task.

6.4.1 Dataset

S3DIS [31] contains 3D scans from Matterport scanners in 6 areas, including 271 rooms. Each point in the scan is annotated with one of the semantic labels from 13 categories (chair, table, floor, wall and so forth plus clutter). Each point is represented by a 9-dimensional vector of XYZ, RGB and normalized location within the room (from 0 to 1). To perform scene segmentation, each squared-meter block of the scene (measured on the floor), sampled to 4096 points, is fed into the network. The predictions for all the blocks are then assembled to obtain the prediction of the entire scene.

6.4.2 Implementation details

Because semantic segmentation is similar to part segmentation, our network can easily be extended to semantic scene segmentation, where point labels become semantic object classes rather than object part labels. The input to the semantic segmentation network is the 9-dimensional vector of XYZ, RGB and normalized room location described above. We remove the STN and KNN modules, preserving only the VLAD module. With this modification, our network is able to predict per-point semantic object classes by relying on both local and global features. The number of visual words for the VLAD module is set to 13, and the top- $K$ is set to 3. Batch normalization and ReLU activation are applied to all fully connected layers. We optimize PointwiseNet using the Adam optimizer with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 20. The initial learning rate is 0.001. The optimization stops after approximately 50 epochs.

To ensure a fair comparison, we conducted 6-fold cross-validation in our experiment following PointNet [39]. Specifically, the dataset is divided into 6 splits: 5 are used for training, and one is used for testing, resulting in 6 models. Finally, we calculated the average IoU and the overall segmentation accuracy over the 6 models.

6.4.3 Results

The results show the feasibility of our PointwiseNet for semantic segmentation. From Table 5, PointwiseNet significantly outperforms PointNet [39], SegCloud [53], RSNet [54] and A-SCN [55], achieving an accuracy of 83.36%.

Table 5
Comparison of scene segmentation on the S3DIS dataset. The evaluation metric is the average IoU over 13 classes (structural and furniture elements plus clutter) and the classification accuracy is calculated on points

Network	Accuracy
	(avg. class)	(overall)
PointNet [39]	47.71	78.62
SegCloud [53]	48.92	–
RSNet [54]	51.93	–
A-SCN [55]	52.72	81.59
PointwiseNet	54.56	83.36

Figure 9.

Qualitative results for semantic segmentation. From left to right: original input scenes; ground truth point cloud segmentation; PointNet [39] segmentation results and PointwiseNet segmentation results.

The visualization of semantic parsing is shown in Fig. 9. We selected 5 room scenes (from top to bottom are conference room #1, office #1, office #3, lounge #1, and lobby #1) from the evaluation dataset for display. The first column is the input point cloud, with the walls and ceiling hidden for clarity. The second, third, and last columns are the ground truth segmentation, the prediction from PointNet [39], and the prediction from PointwiseNet, respectively, where the points belonging to different semantic regions are coloured differently (chairs in red, tables in purple, bookcase in green, floors in blue, clutters in black, beam in yellow, board in grey, and doors in khaki). By comparing the segmentation results with the ground-truth point cloud segmentation, it can be seen that PointwiseNet effectively recognizes the objects in the indoor scene correctly.

As shown in Fig. 9, the segmentation accuracy of PointwiseNet is substantially better than that of PointNet [39]. For example, in the lower right corner of the third scene, the segmentation results PointwiseNet are closer to the ground truth than those of PointNet. Moreover, in the lower left corner of the fifth scene, PointNet recognizes the trash can as a chair while our method does not.

6.5 Commercial 3D CAD model retrieval

Currently, 3D CAD models are widely available, but a method for retrieving 3D CAD models is essential for managing and analysing such models. The key to 3D CAD model retrieval is to generate a compact and informative feature for each 3D CAD model and then using the feature to retrieve the most similar 3D CAD model. Given a query 3D CAD model and a 3D CAD model library, the similarity between the query and the candidates can be computed by their feature vector distances. The retrieval set of a query 3D CAD model is constructed by collecting all the 3D CAD models with the same label and then sorting them by the feature vector distance between the query 3D CAD model and the retrieved 3D CAD model.

Table 6
Comparison of our approach and the state-of-the-art approaches

Network	Input feature	Classifier	Avg. correct rate
Ip et al. [56]	Enhanced shape distribution	kNN	72.30
Ip and Regli [57]	Curvatures	SVM	75.33
Hou et al. [58]	MGP geometric ratios, and principal moments	SVM	88.24
Qin et al. [19]	Fourier descriptor	Deep neural network	97.29
Qin et al. [19]	Modified light field descriptor	Deep neural network	98.64
PointwiseNet	–	Deep neural network	99.32

Figure 10.

The retrieval results. Leftmost column: queries. Right five columns: retrieved models from the 3D CAD model database.

6.5.1 Dataset

To evaluate the performance of PointwiseNet, a commercial database was used as the test dataset. The commercial 3D CAD model database [32] was generated from several mechanical manufacturing enterprises for CAD model retrieval tasks over several recent years. The models were designed by experienced engineers using mainstream commercial CAD toolkits, such as SolidWorks, Pro/Engineer, CATIA, and UG NX. The database includes a total of 7,464 models in 28 generic categories: gears, screws, nuts, springs, wheels, keys, bearing houses, flanges, washers, and so forth. The mechanical part catalogue is used as a reference for selecting those categories. The entire model dataset is divided into 5,990 samples for training, 737 samples for validation, and 737 samples for testing. We first converted the STEP-based model to an STL-based model, and then converted the STL-based model to a point cloud using the same strategy as discussed in Section 6.2.

6.5.2 Implementation details

We trained our network (using the same network architecture as used for our classification network) as the feature extractor and found the nearest neighbour results based on L2 distance. We set the K value of the nearest neighbour search in the KNN module to 6, the number of visual words in the VLAD module to 16, and the top- $K$ value to 3. Batch normalization and ReLU activation are applied to all the fully connected layers. We optimized PointwiseNet using the Adam optimizer with a momentum of 0.9, a weight decay of 0.0005, and a batch size of 32. The initial learning rate was 0.001, which was decreased by a factor of 0.7 after every 20 epochs. The optimization terminates after approximately 200 epochs.

6.5.3 Results

We tested PointwiseNet on the commercial 3D CAD model database and compared the results with those of the state-of-the-art methods. The quantitative results are presented in Table 6. The second column shows the features used to describe the 3D models. PointwiseNet directly consumes the point cloud, and it requires no handcrafted features. The third column shows the classifiers, and the average correct rates of each approach are compared in the rightmost column. As shown in the table, PointwiseNet performs better than all the other approaches. The advantage of PointwiseNet is that there is no need to extract handcrafted features from the commercial 3D CAD model. Instead, we need only convert the 3D CAD model into point cloud data to achieve the commercial 3D CAD model retrieval task. Therefore, PointwiseNet not only avoids the cost of storing handcrafted features but also effectively improves the retrieval precision.

Some visualization results of the model retrieval tested on commercial 3D CAD models of PointwiseNet are shown in Fig. 10. The first column is query shapes. For each query in the test set, a retrieval list (columns 2–6) is returned, which is ordered by feature similarity. PointwiseNet achieved 100% recognition accuracy for 25 of the 28 categories in the test dataset. Among the 737 samples in the test dataset, PointwiseNet incorrectly recognized only 5 models. These results show that PointwiseNet achieves the highest commercial 3D CAD model retrieval performance among all 5 compared methods.

7. Conclusion

In this manuscript, we present a simple end-to-end network for 3D shape analysis named PointwiseNet. PointwiseNet combines pointwise low-level geometric and high-level semantic features with the help of three phases: STN, KNN, and VLAD. The STN module makes the network invariant to input rotation. The KNN and VLAD modules extract the low-level geometric information and high-level semantic information for each point of the 3D point cloud, respectively. To impart PointwiseNet with translation invariance and fidelity to the 3D input cloud, the KNN module performs the uniform operation and a concatenation operation. Furthermore, we present the VLAD mechanism to extract high-level semantic information, which is indirectly described by the relationship of each point’s low-level geometric descriptor to a few visual words. Moreover, PointwiseNet also solves the disorder problem of point cloud data based on pointwise features and pooling operations.

Overall, the proposed PointwiseNet is simple, effective, end-to-end, requires fewer parameters, and is robust to input noise because it learns semantic information from 3D point clouds. We conducted extensive experiments on a number of benchmark datasets (ModelNet, ShapeNet, S3DIS and the commercial 3D CAD model database), and the results show that PointwiseNet achieves state-of-the-art performance.

Inspired by the recently popularized graph neural networks, in future work, we plan to consider building a deeper network composed of VLAD layers and learning the semantic relationships between distinct points by utilizing graph neural networks.

Footnotes

https://github.com/djzgroup/PointwiseNet.

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China under Grants 61702350 and 61472289 and in part by the Open Project Program of the State Key Laboratory of Digital Manufacturing Equipment and Technology, HUST, under Grant DMETKF2017016.

References

Kim

Ceylan

Shen

Yan

, et al. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (TOG). 2016; 35(6): 210.

Zhang

Han

. Quantitative optimization of interoperability during feature-based data exchange. Integrated Computer-Aided Engineering. 2016; 23(1): 31–50.

Zhang

Han

Zou

Chen

. An efficient approach to directly compute the exact Hausdorff distance for 3D point sets. Integrated Computer-Aided Engineering. 2017; 24(3): 261–277.

Kang

Cha

. Autonomous UAVs for structural health monitoring using deep learning and an ultrasonic beacon system with geo-tagging. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(10): 885–902.

Rafiei

Khushefati

Demirboga

Adeli

. Supervised deep restricted boltzmann machine for estimation of concrete. ACI Materials Journal. 2017; 114(2).

Rafiei

Adeli

. A novel unsupervised deep learning model for global and local health condition assessment of structures. Engineering Structures. 2018; 156: 598–607.

Adeli

. Neural networks in civil engineering: 1989–2000. Computer-Aided Civil and Infrastructure Engineering. 2001; 16(2): 126–142.

Adeli

Jiang

. Dynamic fuzzy wavelet neural network model for structural system identification. Journal of Structural Engineering. 2006; 132(1): 102–111.

Adeli

Panakkat

. A probabilistic neural network for earthquake magnitude prediction. Neural Networks. 2009; 22(7): 1018–1024.

10.

Yang

Luo

Huang

Yang

. Automatic pixel-level crack detection and measurement using fully convolutional network. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(12): 1090–1109.

11.

Cha

Choi

Suh

Mahmoudkhani

Büyüköztürk

. Autonomous structural visual inspection using region-based deep learning for detecting multiple damage types. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(9): 731–747.

12.

Xue

. A fast detection method via region-based fully convolutional neural networks for shield tunnel lining defects. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(8): 638–654.

13.

Gao

Mosalam

. Deep transfer learning for image-based structural damage recognition. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(9): 748–768.

14.

Nabian

Meidani

. Deep learning for accelerated seismic reliability analysis of transportation networks. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(6): 443–458.

15.

Hashemi

Abdelghany

. End-to-end deep learning methodology for real-time traffic network management. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(10): 849–863.

16.

Zhang

Ren

Sun

. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 770–778.

17.

Greff

Srivastava

Koutník

Steunebrink

Schmidhuber

. LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems. 2017; 28(10): 2222–2232.

18.

Wang

Yuan

. Simultaneously discovering and localizing common objects in wild images. IEEE Transactions on Image Processing. 2018; 27(9): 4503–4515.

19.

Qin

Lu-Ye

Gao

Yang

Chen

. A deep learning approach to the classification of 3D CAD models. Journal of Zhejiang Universityence C. 2014; 15(2): 91–106.

20.

Liu

Han

. Learning high-level feature by deep belief networks for 3-D model retrieval and recognition. IEEE Transactions on Multimedia. 2014; 16(8): 2154–2167.

21.

Maji

Kalogerakis

Learned-Miller

. Multi-view convolutional neural networks for 3d shape recognition. In: Proceedings of the IEEE International Conference on Computer Vision; 2015. pp. 945–953.

22.

Nießner

Dai

Yan

Guibas

. Volumetric and multi-view cnns for object classification on 3d data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 5648–5656.

23.

Pang

Neumann

. 3d point cloud object detection with multi-view convolutional neural network. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE; 2016. pp. 585–590.

24.

Song

Khosla

Zhang

Tang

, et al. 3d shapenets: A deep representation for volumetric shapes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 1912–1920.

25.

Maturana

Scherer

. VoxNet: A 3D Convolutional Neural Network for real-time object recognition. In: Ieee/rsj International Conference on Intelligent Robots and Systems; 2015. pp. 922–928.

26.

Riegler

Osman Ulusoy

Geiger

. Octnet: Learning deep 3d representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 3577–3586.

27.

Wang

Liu

Guo

Sun

Tong

. O-cnn: octree-based convolutional neural networks for 3d shape analysis. ACM Transactions on Graphics (TOG). 2017; 36(4): 72.

28.

Kazhdan

Funkhouser

Rusinkiewicz

. Rotation invariant spherical harmonic representation of 3 d shape descriptors. In: Symposium on Geometry Processing. Vol. 6; 2003. pp. 156–164.

29.

Connor

Kumar

. Fast construction of k-nearest neighbor graphs for point clouds. IEEE Transactions on Visualization and Computer Graphics. 2010; 16(4): 599–608.

30.

Angelina Uy

Hee Lee

. Pointnetvlad: Deep point cloud based retrieval for large-scale place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 4470–4479.

31.

Armeni

Sener

Zamir

Jiang

Brilakis

Fischer

, et al. 3d semantic parsing of large-scale indoor spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 1534–1543.

32.

Qin

Gao

Yang

Bai

Zhao

. A sketch-based semantic retrieval approach for 3D CAD models. Applied Mathematics-A Journal of Chinese Universities. 2017; 32(1): 27–52.

33.

Sun

Ovsjanikov

Guibas

. A concise and provably informative multi-scale signature based on heat diffusion. In: Computer Graphics Forum. Vol. 28. Wiley Online Library; 2009. pp. 1383–1392.

34.

Knopp

Prasad

Willems

Timofte

Van Gool

. Hough transform and 3D SURF for robust three dimensional classification. In: European Conference on Computer Vision. Springer; 2010. pp. 589–602.

35.

Chen

Tian

Shen

Ouhyoung

. On visual similarity based 3D model retrieval. In: Computer Graphics Forum. Vol. 22. Wiley Online Library; 2003. pp. 223–232.

36.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems; 2012. pp. 1097–1105.

37.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

, et al. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. pp. 1–9.

38.

Xie

Girshick

Dollár

. Aggregated residual transformations for deep neural networks. In: Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on. IEEE; 2017. pp. 5987–5995.

39.

Charles

Guibas

. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 77–85.

40.

Guibas

. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems; 2017. pp. 5099–5108.

41.

Sun

Chen

. PointCNN: Convolution On X-Transformed Points. In: Advances in Neural Information Processing Systems; 2018. pp. 828–838.

42.

Klokov

Lempitsky

. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In: Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE; 2017. pp. 863–872.

43.

Jaderberg

Simonyan

Zisserman

, et al. Spatial transformer networks. In: Advances in Neural Information Processing Systems; 2015. pp. 2017–2025.

44.

Jégou

Douze

Schmid

Pérez

. Aggregating local descriptors into a compact image representation. In: Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on. IEEE; 2010. pp. 3304–3311.

45.

Arandjelovic

Gronat

Torii

Pajdla

Sivic

. NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. pp. 5297–5307.

46.

Cignoni

Callieri

Corsini

Dellepiane

Ganovelli

Ranzuglia

. Meshlab: an open-source mesh processing tool. In: Eurographics Italian Chapter Conference. Vol. 2008; 2008. pp. 129–136.

47.

Zheng

Guo

. RGCNN: Regularized Graph CNN for Point Cloud Segmentation. In: Proceedings of the 26th ACM International Conference on Multimedia. MM ’18. New York, NY, USA: ACM; 2018. pp. 746–754. Available from: http://doi.acm.org/10.1145/3240508.3240621.

48.

Chen

Hee Lee

. So-net: Self-organizing network for point cloud analysis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 9397–9406.

49.

Simonovsky

Komodakis

. Dynamic edge-conditioned filters in convolutional neural networks on graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2017. pp. 3693–3702.

50.

Zaheer

Kottur

Ravanbakhsh

Poczos

Salakhutdinov

Smola

. Deep Sets. In: Guyon

Luxburg

Bengio

Wallach

Fergus

Vishwanathan

, et al., editors. Advances in Neural Information Processing Systems 30. Curran Associates, Inc.; 2017. pp. 3391–3401. Available from: http://papers.nips.cc/paper/6931-deep-sets.pdf.

51.

Shou

Wang

Liu

. Interactive shape co-segmentation via label propagation. Computers & Graphics. 2014; 38: 248–254.

52.

Ben-Shabat

Lindenbaum

Fischer

. 3DmFV: three-dimensional point cloud classification in real-time using convolutional neural networks. IEEE Robotics and Automation Letters. 2018; 3(4): 3145–3152.

53.

Tchapmi

Choy

Armeni

Gwak

Savarese

. Segcloud: Semantic segmentation of 3d point clouds. In: 2017 International Conference on 3D Vision (3DV). IEEE; 2017. pp. 537–547.

54.

Huang

Wang

Neumann

. Recurrent slice networks for 3d segmentation of point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 2626–2635.

55.

Xie

Liu

Chen

. Attentional ShapeContextNet for Point Cloud Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2018. pp. 4606–4615.

56.

Regli

Sieger

Shokoufandeh

. Automated learning of model classifications. In: Proceedings of the Eighth ACM Symposium on Solid Modeling and Applications. ACM; 2003. pp. 322–327.

57.

Yiu Ip

Regli

. Content-based classification of CAD models with supervised learning. Computer-aided Design and Applications. 2005; 2(5): 609–617.

58.

Hou

Lou

Ramani

. SVM-based semantic clustering and retrieval of a 3D model database. Computer-Aided Design and Applications. 2005; 2(1-4): 155–164.

59.

Russakovsky

Deng

Krause

Satheesh

, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision. 2015; 115(3): 211–252.

60.

Qin

Gao

Yang

Chen

. A deep learning approach to the classification of 3D CAD models. Journal of Zhejiang University SCIENCE C. 2014; 15(2): 91–106.

61.

Eigen

Fergus

. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In: Proceedings of the IEEE International Conference on Computer Vision; 2015. pp. 2650–2658.

62.

Gong

Wang

Guo

Lazebnik

. Multi-scale orderless pooling of deep convolutional activation features. In: European Conference on Computer Vision. Springer; 2014. pp. 392–407.

63.

Csurka

Dance

Fan

Willamowski

Bray

. Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV. Vol. 1. Prague; 2004. pp. 1–2.

64.

Engelcke

Rao

Wang

Tong

Posner

. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE; 2017. pp. 1355–1361.

65.

Pirk

Guibas

. Fpnn: Field probing neural networks for 3d data. In: Advances in Neural Information Processing Systems; 2016. pp. 307–315.

66.

Lenc

Vedaldi

. Understanding image representations by measuring their equivariance and equivalence. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2015. pp. 991–999.

Pointwise geometric and semantic learning network on 3D point clouds

Abstract

Keywords

1. Introduction

2.1 Handcrafted features

2.2 Multi-view CNNs

2.3 Volumetric CNNs

2.4 Point CNNs

3. The network architecture

3.2 Classification network

3.3 Segmentation network

4. Spatial transform

5.1 Low-level geometric feature

5.2.1 Top- K VLAD feature selection

.

6. Experiments and results

6.1 Architecture analysis

6.1.1 Dataset and implementation details

6.1.2 Hyper-parameter

Table 2 Comparison of different pooling combination strategies on ModelNet40

6.3.1 Dataset

Table 4 Part segmentation results on the ShapeNet-core dataset. Intersection over union (IoU) is reported as the evaluation metric. The top 3 ranked values are highlighted in bold and the first, second and third places are shown in red, blue and green, respectively

6.3.3 Results

6.4 Semantic segmentation in scenes

6.4.1 Dataset

6.4.2 Implementation details

6.4.3 Results

Table 5 Comparison of scene segmentation on the S3DIS dataset. The evaluation metric is the average IoU over 13 classes (structural and furniture elements plus clutter) and the classification accuracy is calculated on points

Table 6 Comparison of our approach and the state-of-the-art approaches

6.5.2 Implementation details

6.5.3 Results

7. Conclusion

Footnotes

Acknowledgments

References

5.2.1 Top- $K$ VLAD feature selection

Table 2
Comparison of different pooling combination strategies on ModelNet40

Table 4
Part segmentation results on the ShapeNet-core dataset. Intersection over union (IoU) is reported as the evaluation metric. The top 3 ranked values are highlighted in bold and the first, second and third places are shown in red, blue and green, respectively

Table 5
Comparison of scene segmentation on the S3DIS dataset. The evaluation metric is the average IoU over 13 classes (structural and furniture elements plus clutter) and the classification accuracy is calculated on points

Table 6
Comparison of our approach and the state-of-the-art approaches