SS3D: A Semi-Supervised Learning Approach for Accurate Three-Dimensional Shape Reconstruction

Abstract

The challenge of reconstructing three-dimensional (3D) models from images lies in how to infer a complete 3D structure with detailed geometry from two-dimensional (2D) images. However, single-view reconstruction requires only one image to reason about the 3D shape, demonstrating significant application potential. Yet, most existing methods rely on fully supervised learning approaches, which demand large amounts of labeled data. To alleviate this issue, semi-supervised learning strategies have been proposed to reduce the dependence on annotated data, offering a more efficient solution for single-view 3D reconstruction. We propose a semi-supervised single-view 3D point cloud reconstruction framework that employs a teacher-student paradigm to leverage limited labeled data together with abundant unlabeled samples. To address the modality heterogeneity between 2D images and 3D point clouds, a Heterogeneous Feature Attention Mechanism is designed to align cross-modal features and embed image cues into 3D spatial structures, preserving both geometric and appearance information. Moreover, a Self-Attention Decoder captures global dependencies and salient regions, enabling fine-grained structural recovery. Our model demonstrates outstanding performance on both the ShapeNet dataset and the Pix3D dataset, achieving an L1-Chamfer distance (CD) value of 5.60 $\times 10^{- 2}$ on ShapeNet and an L1-CD value of 6.29 $\times 10^{- 2}$ on Pix3D. Furthermore, rigorous ablation studies provide additional confirmation of the remarkable effectiveness of our approach.

Keywords

three-dimensional point cloud reconstruction semi-supervision cross-modal fusion single-view reconstruction

1. Introduction

Single-view three-dimensional (3D) reconstruction is an important research topic in the field of computer vision, aiming to infer and reconstruct the corresponding 3D structure from a single two-dimensional (2D) image. This task holds significant application value in various practical domains such as augmented reality, robotic navigation, and medical imaging (Gerbaud et al., 2024; Liu et al., 2024; Wu et al., 2024). In recent years, the development of deep learning has greatly advanced progress in single-view 3D reconstruction. Triplane Meets Gaussian Splatting (Zou et al., 2024) achieves high-quality and efficient 3D reconstruction by integrating explicit point clouds with implicit triplane features, while leveraging Transformer and differentiable Gaussian point cloud rendering. Gamba (Shen et al., 2025) proposes an end-to-end Mamba-based efficient Gaussian point cloud generation method, which significantly improves reconstruction speed while maintaining accuracy.

Despite the remarkable progress that single-view 3D reconstruction has achieved in recent years under the impetus of deep learning (Ding et al., 2025a, 2025b; Jin et al., 2024; Melas-Kyriazi et al., 2023; Szymanowicz et al., 2024), this task still faces numerous challenges. First, there exists a highly complex and nonlinear mapping relationship between 2D images and 3D structures, which makes it difficult for models to capture complete and stable geometric correspondences during the learning process (Jin et al., 2024). Second, the spatial information carried by a single image is inherently limited, lacking both depth and multi-view constraints, which often leads to uncertainty and ambiguity in the reconstructed 3D results (Szymanowicz et al., 2024). This issue becomes particularly pronounced when dealing with fine structural details or occluded scenes. Third, acquiring large-scale and high-quality 3D annotated data is extremely costly, requiring not only complex acquisition devices and labor-intensive annotation processes but also being constrained by the diversity and precision of the data, which severely restricts the broader adoption and application of supervised learning methods (Melas-Kyriazi et al., 2023). In summary, how to alleviate modality discrepancies, enhance geometric consistency, and improve the representation of fine details under limited annotation conditions remains a central challenge in single-view 3D reconstruction research.

In the field of 2D image classification, semi-supervised learning (SSL) has already demonstrated excellent performance by combining a small amount of labeled data with a large quantity of unlabeled data (Vanyan & Khachatrian, 2021). This approach effectively alleviates the reliance on large-scale annotated datasets by mining the latent information within unlabeled data, thereby reducing annotation costs while maintaining high accuracy. It provides new opportunities for single-view 3D reconstruction, reducing the dependence on labeled data while still achieving high precision. We introduce the SSL paradigm into point cloud-based single-view 3D reconstruction, thereby improving the reconstruction quality of models in 3D reconstruction tasks.

Our main contributions are summarized as follows:

We propose a single-view semi-supervised point cloud reconstruction method. With a teacher–student architecture, it effectively leverages limited labeled data together with large-scale unlabeled data, improving the structural consistency and stability of single-view reconstruction;

In this method, we design a heterogeneous feature attention mechanism for 2D–3D feature alignment and a decoder architecture integrated with self-attention to model cross-modal interactions and global dependencies, thereby enhancing 3D structure recovery capability;

The L1 Chamfer distance (CD) value on the ShapeNet dataset is 5.60 $\times 10^{- 2}$ , and the L1-CD value on the Pix3D dataset is 6.29 $\times 10^{- 2}$ , indicating that our method can generate high-quality 3D point clouds and validating the effectiveness of SSL for single-view reconstruction.

2. Related Work

2.1. Deep Learning for 3D Reconstruction

3D reconstruction is an important research direction in computer vision, with the goal of reconstructing the 3D structure of objects from 2D images. Existing approaches can be broadly divided into two categories: Explicit representations and implicit representations.

Explicit representation methods mainly include three forms: Voxels, meshes, and point clouds. In voxel-based methods, 3D-R2N2 (Choy et al., 2016) employs a 2D convolutional network to encode 2D images into low-dimensional embeddings, processes the embeddings with 3D LSTM, and then decodes them into voxel grids through a 3D convolutional network. CIGNet (Gao et al., 2023) integrates category priors and intrinsic geometric relationships, utilizing reconstruction and refinement modules to generate progressively detailed 3D reconstruction results. However, the memory and computational cost of voxel-based methods grow cubically with resolution, which severely limits the achievable output resolution (Tiong et al., 2022; Xie et al., 2019, 2020a; Yagubbayli et al., 2021). In mesh-based methods, Pixel2Mesh (Wang et al., 2018) employs a mesh deformation network to capture semantic information from 2D images and gradually deform an initial ellipsoid into the target shape. Std-Net (Mao et al., 2021) combines an autoencoder with a topology-adaptive graph convolutional network, enabling the reconstruction of diverse objects with complex topologies. However, mesh-based approaches often struggle to represent internal or irregular structures (Wen et al., 2022a; Yang et al., 2023; Zhang et al., 2024). In point cloud-based methods, PSGN (Fan et al., 2017) effectively addresses the permutation invariance problem of point clouds by adopting CD and Earth Mover’s Distance as loss functions. Part-Wise AtlasNet (Yu et al., 2022) employs a generator–discriminator framework and introduces adversarial loss to enhance global semantic consistency. Such point cloud-based methods usually offer the advantages of low memory consumption and strong detail representation (Afifi et al., 2020).

Implicit representation methods, on the other hand, establish continuous 3D representations through neural radiance fields (NeRFs) and differentiable rendering. NeRF (Barron et al., 2021; Hong et al., 2023; Lin et al., 2023; Metzer et al., 2023; Mildenhall et al., 2021; Yu et al., 2021) achieves high-quality 3D reconstruction by minimizing the discrepancy between synthesized images and real images. Recent studies (Jun & Nichol, 2023; Nichol et al., 2022; Poole et al., 2023) further integrate diffusion models for single-view 3D model generation. 0-1-to-3 (Liu et al., 2023b) leverages a stable diffusion model to perform novel view synthesis conditioned on relative camera poses. 1-2-3-4-5 (Liu et al., 2023a), an improved version of Zero-1-to-3, extracts 2D features from multi-view images and constructs a 3D cost volume with camera poses to infer geometric structures. Wonder3D (Long et al., 2024) fuses image and text embeddings with camera parameters to produce consistent multi-view representations and employs a novel fusion algorithm to reconstruct high-fidelity textured meshes.

Explicit methods rely heavily on large-scale annotated 3D data, making it difficult to scale them to complex and diverse scenarios. Implicit methods often adopt self-supervision or weak supervision, but they usually require a large number of multi-view images along with camera pose information, which leads to poor performance under conditions with limited viewpoints. These shortcomings highlight that achieving high-quality single-view 3D reconstruction with limited annotations and restricted viewpoints remains an urgent challenge to be addressed.

2.2. Semi-Supervised Deep Learning

The core challenge of SSL lies in how to effectively leverage large amounts of unlabeled data for training under the condition of limited labeled data. Existing methods can generally be divided into two categories: Entropy minimization (Grandvalet & Bengio, 2004; Lee, 2013; Pham et al., 2021; Zoph et al., 2020) and consistency regularization (Berthelot et al., 2019a, 2019b; Gong et al., 2021; Miyato et al., 2018; Xie et al., 2020b). Entropy minimization methods originate from self-training, with the core idea of assigning pseudo-labels to unlabeled data and combining these pseudo-labels with manually annotated data for further training. Consistency regularization assumes that the predictions of unlabeled data should remain unchanged under different perturbations. Therefore, data augmentation is often introduced to expand the training distribution, thereby improving the generalization ability and robustness of the algorithm. Common augmentation techniques include random flipping, geometric transformations, and image contrast adjustments. Meanwhile, more sophisticated augmentation strategies also exist, such as Cutout (DeVries & Taylor, 2017), which achieves effective perturbation by randomly masking local regions.

In the task of single-view 3D reconstruction, there has already been preliminary exploration of the feasibility of semi-supervised paradigms. The study in Yang et al. (2018) was the first to achieve 3D reconstruction with limited annotated data, and it introduced additional camera poses to mitigate issues of pose invariance and viewpoint consistency. Subsequently, Semi-Supervised Soft Rasterizer (Laradji et al., 2021) adopted a Siamese network structure and incorporated image silhouettes as part of the unsupervised loss, thereby improving reconstruction performance. SSP3D (Xing et al., 2022) proposed a semi-supervised 3D reconstruction framework that relies solely on single-view images. By introducing explicit shape priors, a shape discriminator, and a prototype shape prior module, it achieved voxel-based 3D generation.

Although the aforementioned methods have validated, to varying degrees, the potential of SSL for 3D reconstruction, most existing research has focused on voxel or mesh representations, while semi-supervised 3D reconstruction based on point cloud representations is still in its early exploratory stage. Given the advantages of point cloud methods in memory efficiency and geometric detail representation, this direction holds significant research value and application potential.

3. Method

As shown in Figure 1, the SS3D framework consists of two training phases: A pretraining phase and a semi-supervised teacher–student phase. In the first phase, the model is trained on a labeled dataset $D_{L} = {(x_{i}^{l}, y_{i}^{l})}$ to learn the mapping relationship between images and point clouds, thereby obtaining an initialized model. In the second stage, based on the trained teacher model, the student model conducts semi-supervised training under the guidance of the trained teacher model. Specifically, both the teacher network and the student network share the same architecture as the model in the pretraining phase. The teacher model generates corresponding pseudo-labels from weakly augmented unlabeled data $D_{U} = {(x_{i}^{u})}$ and provides them as supervisory signals to guide the training of the student model. The parameters of the student model are initialized with the pretrained weights of the teacher model and are trained using strong augmentations and feature perturbations. Meanwhile, the teacher model is continuously updated through exponential moving average (EMA), enabling effective utilization of unlabeled samples and progressively enhancing the overall performance of 3D reconstruction.

Figure 1.

Illustration of the SS3D framework. It consists of two phases: In the pretraining phase, a spherical shell point cloud with 1024 points is used to train the network, achieving effective alignment between images and point clouds and obtaining an initialized model. In the semi-supervised teacher–student phase, the teacher and student networks share the same architecture as the model in the pretraining stage. The teacher network generates pseudo-labels based on weak augmentation, while the student network learns through strong augmentation and feature perturbation. The teacher weights are continuously updated through EMA, thereby improving the accuracy and robustness of 3D reconstruction.

Next, we first introduce the pretraining phase (Section Pretraining Phase), then explain the semi-supervised teacher–student framework (Section semi-supervised teacher–student framework), and finally describe the design and optimization of the loss functions (Section Loss Function Design and Optimization).

3.1. Pretraining Phase

As shown in Figure 1, SS3D is composed of four main modules: an image encoder, an attribute flow encoder, a shape-matching transformer, and a self-attention decoder. Recent studies have shown that hybrid deep learning models that combine structures such as CNNs and Transformers exhibit strong feature representation capability and robustness in complex tasks. For example, Kumar et al. (2025) employed a hybrid model integrating ResNet-50 and a Transformer in an internet of things security scenario, demonstrating the effectiveness of this type of architecture for feature extraction. Therefore, we use a pretrained ResNet-50 in the image encoder to extract visual features, adopt PointTransform in the attribute-stream encoder to extract point cloud features, and establish cross-modal associations between image features and point cloud features. Building on this, the shape-matching transformer further integrates geometric and semantic information, progressively deforming and refining the initial point cloud to obtain a geometric representation that better aligns with the target structure. Finally, the self-attention decoder models global dependencies on top of the fused features, thereby generating high-precision 3D point cloud outputs.

Since single-view input lacks depth information, 3D reconstruction often suffers from the problem of different 3D structures sharing the same 2D projection. To alleviate this issue, it is common to introduce an initial point cloud prior to feature extraction. Afifi et al. (2020) proposed using a uniformly distributed spherical point cloud as the initial shape to improve the accuracy of point cloud generation. Therefore, we employ a spherical shell point cloud consisting of 1024 points as the initial input to more comprehensively cover the object surface, thereby enhancing the capacity to capture geometric features. The initial point cloud is composed of uniformly sampled points on a spherical shell; it contains neither the geometric nor semantic information of a specific object, but serves as a structurally neutral geometric initialization. A self-attention-based point cloud encoding module is directly applied to model inter-point geometric relationships, so that, guided by image features, geometric deformation and structural refinement are progressively carried out.

Attribute Flow Encoder. The attribute flow encoder is designed to achieve cross-modal fusion between image features and point cloud features, providing both geometric and semantic support for subsequent shape matching and decoding. The attribute-stream encoder receives image features from the image encoder, as well as geometric features obtained by feeding the initial point cloud into a PointTransform-based encoder. While the image features contain rich appearance and semantic information, the point cloud features provide geometric priors of the object, and their combination establishes the foundation for cross-modal fusion. Through a heterogeneous feature attention mechanism, correspondences between image and point cloud features are established, enabling modality alignment and information interaction. Subsequently, the Geometry Module extracts geometric information from the fused features and generates shape adjustment parameters $γ, β$ to guide shape matching. At the same time, the Semantic Module learns high-level semantic features $S$ , which provide semantic constraints and global contextual support for the decoding process.

Heterogeneous Feature Attention. Cross-modal interactions are employed to establish associations between images and point clouds, thereby improving the quality of the reconstructed 3D models. Unlike traditional cross-attention mechanisms with one-way feature input, heterogeneous feature attention adopts a symmetric dual-branch design to realize bidirectional representations of point cloud features and image features. The query $Q$ , key $K$ , and value $V$ do not come from the same modality-encoding space; instead, they are drawn from the 2D semantic space and the 3D geometric space, respectively. Specifically, the point cloud features $H_{p}$ obtained by feeding the initial point cloud into a PointTransform-based encoder and the image features $H_{i}$ extracted by a ResNet-based image encoder are linearly projected into the query $Q$ , key $K$ , and value $V$ , as shown in Equations (1), (2), and (3). Where, $H_{p}$ = $x_{pc}$ and $H_{i}$ = $x_{img}$ .

\begin{aligned} Q_{p}, Q_{i} & = W_{p}^{Q} H_{p}, W_{i}^{Q} H_{i} \end{aligned}

(1)

\begin{aligned} K_{p}, K_{i} & = W_{p}^{K} H_{p}, W_{i}^{K} H_{i} \end{aligned}

(2)

\begin{aligned} V_{p}, V_{i} & = W_{p}^{V} H_{p}, W_{i}^{V} H_{i} \end{aligned}

(3)

Subsequently, the dot product between the queries ( $Q$ ) and keys ( $K$ ) of the point cloud and image features is computed through the softmax function, followed by scaling and normalization to obtain attention weights. This step enables the model to assess the correlations between the two modalities, thereby facilitating the effective fusion of feature sequences. Using the attention weights, the values ( $V$ ) of each modality are weighted and fused, achieving cross-modal information propagation. As shown in equations (4) and (5), $Δ H_{p}$ represents the information propagated from the image to the point cloud, while $Δ H_{i}$ represents the information propagated from the point cloud to the image.

\begin{aligned} Δ H_{p} & = soft max (\frac{Q_{p} K_{i}^{T}}{\sqrt{d}}) V_{p} \end{aligned}

(4)

\begin{aligned} Δ H_{i} & = softmax (\frac{Q_{i} K_{i}^{T}}{\sqrt{d}}) V_{i} \end{aligned}

(5)

Equations (1)–(5) describe the process of the heterogeneous feature attention mechanism. To further enhance the expressive power of the model, we adopt a multi-head attention mechanism (set to 8 heads), where each attention head independently learns relationships in different subspaces. The concatenated results are then fused through a linear mapping, yielding richer cross-modal representations. Finally, the features of the current modality are updated using the information propagated from the other modality, as shown in equations (6) and (7).

\begin{aligned} Δ H_{p_cross} & = LayerNorm (H_{p} + Δ H_{p}) \end{aligned}

(6)

\begin{aligned} Δ H_{i_cross} & = LayerNorm (H_{i} + Δ H_{i}) \end{aligned}

(7)

After the heterogeneous feature attention, we introduce a feedforward network composed of a single linear layer and a ReLU activation function to enhance the nonlinear expressive capability of the features. By applying residual connections and normalization, the cross-modally updated features are combined with the feedforward outputs, thereby producing more precise and stable representations, as shown in equations (8) and (9).

\begin{aligned} H_{p_final} & = LayerNorm (Δ H_{p_cross} + Feedforward (Δ H_{p_cross})) \end{aligned}

(8)

\begin{aligned} H_{i_final} & = LayerNorm (Δ H_{i_cross} + Feedforward (Δ H_{i_cross})) \end{aligned}

(9)

In the Geometry Module, the features $H_{p_final}$ and $H_{i_final}$ obtained from the heterogeneous feature attention layer are first concatenated and then fed into a multilayer perceptron (MLP). The features output by the MLP are reshaped to generate the parameters ${γ, β}$ for adaptive instance normalization (AdaIN), as shown in equation (10):

\begin{aligned} {γ, β} = MLP (H_{p_final} : H_{i_final}) \end{aligned}

(10)

where, ”:” denotes feature concatenation. The resulting

{γ, β}

are ultimately fed into the shape-matching transformer to guide the reconstruction of the 3D model.

In the Semantic Module, the extracted image features are first compressed into an attribute encoding $z$ through an MLP. Following the idea of EigenGAN (He et al., 2021), for the $j$ th dimension $z_{j}$ of the attribute encoding ${z}$ , its deeper semantic properties are explored through the orthogonal basis in the semantic subspace $U = {u_{j}}$ , where $u_{j} \in R^{N \times C}$ and $C$ denotes the number of feature dimensions. The specific formulation is as follows:

\begin{aligned} {\hat{z}}_{j} = ℓ_{j} z_{j} u_{j} \end{aligned}

(11)

where,

ℓ_{j}

is a learnable weight that indicates the importance of

z_{j}

within its corresponding semantic subspace. The results across all dimensions are then aggregated in a weighted manner to obtain the semantic feature

S

\begin{aligned} S = \sum_{j} {\hat{z}}_{j} + b \end{aligned}

(12)

In this equation, $b$ represents the bias term. The final semantic feature $S$ provides global semantic constraints for the subsequent decoding process.

Shape-Matching Transformer. A spherical-shell point cloud containing 1024 points is used as the input, and the corresponding point features are output to guide the progressive deformation and reconstruction of the point cloud. We introduce Point-transformer as the core unit of the shape-matching transformer. Point-transformer can dynamically model relationships between points within local neighborhoods, thereby enhancing the flexibility and accuracy of point cloud feature representations. For clarity, we denote the point features output by Point-transformer as $Q = q_{k}$ .

The shape-matching transformer takes the geometric features ${γ, β}$ and semantic features $S$ as inputs. First, AdaIN is applied to geometrically modulate the point cloud features:

\begin{aligned} {\hat{q}}_{k} = γ_{k} \cdot \frac{q_{k} - μ (q_{k})}{σ (q_{k})} + β_{k} \end{aligned}

(13)

where,

k

denotes the

k

th point, and

q_{k}

represents the point cloud feature.

μ (q_{k})

and

σ (q_{k})

denote the mean and standard deviation of this point along the feature dimensions, respectively.

Subsequently, the geometrically modulated point features are fused with the semantic feature $S$ through an MLP layer to obtain the point cloud feature $P_{k}$ :

\begin{aligned} P_{k} = {\hat{q}}_{k} + ϕ (S ∣ θ_{S}) \end{aligned}

(14)

where,

ϕ

denotes the MLP layer, and

θ_{S}

represents the parameter weights of the MLP layer used to generate

P_{k}

, which maps the semantic feature into a space consistent with the point features. The final point cloud features are then processed by the self-attention decoder to generate high-quality 3D point cloud representations.

Self-Attention Decoder. Traditional MLP decoders for point clouds usually rely on local feature mappings, making it difficult to effectively model global contextual relationships, which in turn limits their performance on complex geometric structures. Moreover, due to the absence of a feature selection mechanism, MLPs often lack adaptability across different semantic regions, leading to insufficient detail recovery. In contrast, the self-attention mechanism can compute global correlations between points and adaptively adjust the distribution of feature weights, thereby capturing long-range dependencies and strengthening the modeling of critical geometric and semantic regions. Although self-attention was initially overlooked in early studies due to its computational complexity and risk of overfitting, its advantages have since been validated with the advent of more efficient attention variants and carefully designed architectures. Therefore, we introduce self-attention modules into the decoder as a replacement for the traditional MLP structure, enabling global feature aggregation and context-aware point cloud decoding, ultimately producing more refined and consistent 3D reconstruction results.

Specifically, we adopt a multi-head self-attention mechanism to capture global context and inter-feature relationships, combined with layer normalization to enhance the generalization ability of the model. In this module, the point cloud feature vector $P_{k}$ is first projected into query ( $Q$ ), key ( $K$ ), and value ( $V$ ) vectors:

\begin{aligned} {\begin{cases} Q = W_{Q} \cdot P_{k} \\ K = W_{K} \cdot P_{k} \\ V = W_{V} \cdot P_{k} \end{cases} \end{aligned}

(15)

where,

W_{Q}

W_{K}

, and

W_{V}

are learnable projection matrices. Next, the dot-product similarity between the queries and keys is computed, followed by scaling and Softmax normalization to obtain the attention weights:

\begin{aligned} Attention (Q, K, V) = softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) \cdot V \end{aligned}

(16)

In this equation, $d_{k}$ denotes the dimensionality of the key vectors, which is used to scale the dot-product similarity. Finally, the point cloud displacements obtained from the self-attention mechanism are added to the initial point cloud $P_{o}$ , generating the final reconstructed point cloud $P$ :

\begin{aligned} P = P_{o} + MLP (Attention (Q, K, V)) \end{aligned}

(17)

3.2. Semi-supervised Teacher–student Framework

In the second stage, based on the pretrained teacher model, the student model performs SSL under the teacher’s guidance. As shown in Figure 2, we introduce perturbations simultaneously at both the image level and the feature level to expand the perturbation space and enhance the granularity of the supervisory signals. Compared with applying perturbations only at the image level, this approach provides the student model with richer and more effective supervisory signals, helping it better understand and learn complex patterns within the data.

Figure 2.

Feature perturbation method. “FP” denotes feature perturbation, the green line represents the unsupervised loss, and the orange line represents the supervised loss.

In the semi-supervised teacher–student framework, after the teacher model converges during the pretraining phase, it generates pseudo-labels for unlabeled samples to supervise the training of the student model. The student model is initialized with the weights of the teacher model and is designed with three training branches, using the unlabeled dataset $D_{U}$ and fused point clouds to train the student network, as shown in equation (18). First, the input image is weakly augmented to obtain a weakly perturbed image $X_{w}$ . Then, $X_{w}$ is fed into the image decoder to generate the weak image $e_{w}$ , while feature perturbation produces the perturbed feature $e_{fp}$ .

\begin{aligned} {\begin{cases} e_{w} = g (X_{w}) \\ e_{fp} = FP (e_{w}) \end{cases} \end{aligned}

(18)

where,

g

denotes the image encoder, and FP represents feature perturbation, such as the addition of uniform noise. Subsequently, the two sets of features are further processed by the network to obtain the pseudo-label

P_{w}

and the feature-perturbed point cloud

P_{fp}

At the same time, the input image undergoes two different strong data augmentations, producing strongly perturbed images $X_{s 1}$ and $X_{s 2}$ . After being processed by the network, the corresponding perturbed point clouds $P_{s 1}$ and $P_{s 2}$ are obtained. The point clouds generated from the three types of perturbations, together with the pseudo-label, are jointly used to train the student model. To improve the quality of the pseudo-labels, we adopt an EMA strategy to dynamically update the teacher model parameters throughout the training process. The update rule is given by:

\begin{aligned} Ψ_{tea} = α Ψ_{tea} + (1 - α) Ψ_{stu} \end{aligned}

(19)

where,

α

is the momentum coefficient, which is gradually increased to 1 through cosine scheduling to ensure stability during training. With this joint perturbation and dynamic updating mechanism, the model can fully exploit the limited labeled data along with large amounts of unlabeled data, thereby significantly enhancing the robustness and generalization ability of 3D reconstruction.

3.3. Loss Function Design and Optimization

The training process is divided into two phases. First, in the pretraining phase, we train the teacher model on the labeled dataset $D_{L}$ by jointly using reconstruction loss and regularization loss. In the 3D reconstruction network, both the reconstruction predictions and the ground truth are represented in the form of point clouds, which are used to compute the reconstruction loss. We adopt the L1 Chamfer loss as the reconstruction loss function:

\begin{aligned} L_{CD} (P_{r}, P_{gt}) & = \frac{1}{2 N} \sum_{p_{r} \in P_{r}} min_{p_{gt} \in P_{gt}} {‖ p_{r} - p_{gt} ‖}_{2} + \frac{1}{2 N} \sum_{p_{gt} \in P_{gt}} min_{p_{r} \in P_{r}} {‖ p_{gt} - p_{r} ‖}_{2} \end{aligned}

(20)

where, $P_{r}$ and $P_{gt}$ denote the predicted point cloud and the ground-truth point cloud, respectively, while $N$ represents the number of points in the point cloud.

To ensure the orthogonality of $U_{i}$ , we further introduce a regularization loss:

\begin{aligned} L_{orth} = \sum_{i = 1}^{3} {‖ U_{i} U_{i}^{⊤} - I_{n_{basis}} ‖}_{F}^{2} . \end{aligned}

(21)

where

U_{i} \in R^{n_{basis} \times d}

denotes the basis-matrix parameters of the subspace layer,

‖ \cdot ‖_{F}

denotes the Frobenius norm, and

I_{n_{basis}}

is the

n_{basis} \times n_{basis}

identity matrix. This regularization term is used to constrain the row vectors of

U_{i}

to be approximately orthonormal (

U_{i} U_{i}^{⊤} \approx I

The total training loss in the pretraining phase is expressed as:

\begin{aligned} L_{s} = L_{CD} + ξ L_{Orth} \end{aligned}

(22)

where,

ξ

is the balancing factor, set to 100.

In the second phase, the student model is optimized using supervised and unsupervised losses. The teacher model does not participate in backpropagation or gradient updates; however, its parameters are iteratively updated via an EMA of the student model parameters to generate more stable pseudo-label supervision signals. For supervised data, we use the loss function shown in Equation (20). For unlabeled data, the following loss function is used:

\begin{aligned} L_{un} = & λ L_{CD} (P_{w}, P_{fp}) + \frac{v}{2} (L_{CD} (P_{w}, P_{s 1}) + L_{CD} (P_{w}, P_{s 2})) \end{aligned}

(23)

where, $P_{w}$ , $P_{fp}$ , $P_{s 1}$ , and $P_{s 2}$ represent the weakly perturbed point cloud, the feature-perturbed point cloud, and the two sets of strongly perturbed point clouds, respectively. $λ$ and $v$ are weighting coefficients, both set to 0.5.

Finally, the loss function of the student model is defined as follows:

\begin{aligned} L = L_{un} + L_{sup} \end{aligned}

(24)

Through joint training with both supervised and unsupervised losses, the model can effectively leverage labeled and unlabeled data, thereby significantly improving the accuracy of 3D model reconstruction.

4. Experiments

4.1. Datasets and Experimental Setup

Datasets. In our experiments, we use the ShapeNet (Chang et al., 2015) and Pix3D (Sun et al., 2018) datasets. For ShapeNet, we select nine categories and randomly split the training set into labeled and unlabeled data according to a 20% labeling ratio. Pix3D is a publicly available dataset that provides precise alignment between real images and 3D models. From Pix3D, we select eight categories and randomly choose 10% of the training set as labeled data, with the remaining samples treated as unlabeled data. Model performance is evaluated using the L1-CD defined in equation (20).

The experimental environment is configured as follows: on the software side, we use Python 3.7.9, the PyTorch deep learning framework, and CUDA 11.7; the operating system is Ubuntu 20.04. The hardware setup includes an Intel(R) Gold 6134 CPU @ 3.20GHz $\times$ 32 processor, 24 GB of memory, and an NVIDIA GeForce RTX 3090 GPU. Model training is conducted in two phases, with batch sizes set to 24 and 14, respectively. The learning rate decays gradually from $1 \times 10^{- 3}$ to $2 \times 10^{- 4}$ . The optimizer is AdamW (Loshchilov & Hutter, 2017), with the momentum parameter $α$ set to 0.9996. The number of heads in the multi-head attention mechanism is 8, and the point cloud size is 2048. Both the pretraining and semi-supervised teacher–student framework phases are trained for 400 epochs.

4.2. Comparative Experiments

We evaluate reconstruction accuracy on the ShapeNet dataset using the L1-CD metric. To ensure both relevance and fairness in evaluation, we make appropriate adjustments to the comparison methods so that they better align with the task requirements of this study. The comparison methods include 3DAttriFlow (Wen et al., 2022b), Ccd-3dr (Di et al., 2023), InversionGAN (Li et al., 2024), and RGB2point (Lee & Benes, 2025).

As shown in Table 1, our method outperforms existing mainstream approaches on the L1-CD metric, demonstrating higher 3D reconstruction accuracy. Our method achieves the L1-CD value of 5.60 $\times 10^{- 2}$ , which is significantly lower than those of the other methods. Compared with 3DAttriFlow, Ccd-3dr, InversionGAN, and RGB2point, our method reduces the average error by 20.90%, 16.30%, 11.53%, and 7.60%, respectively. This indicates that our approach is able to reconstruct 3D structures and details more accurately.

Table 1.
Comparative Results of 3D Reconstruction on the ShapeNet Dataset with 20% Labeled Data, Evaluated by L1-CD.

Method Airplane Bench Cabinet Video Speaker Rifle Table Phone Vessel Avg

3DAttriFlow (Wen et al., 2022b) 4.58 6.74 8.76 7.83 10.29 4.50 8.79 6.21 6.04 7.08

Ccd-3dr (Di et al., 2023) 4.07 6.31 8.36 7.42 9.98 4.19 8.43 5.85 5.60 6.69

InversionGAN (Li et al., 2024) 3.72 5.92 8.06 7.03 9.77 3.70 8.02 5.51 5.27 6.33

RGB2point (Lee & Benes, 2025) 3.47 5.61 7.99 6.78 9.47 3.21 7.78 5.19 5.03 6.06

Ours 3.00 5.23 7.56 6.35 9.02 2.72 7.23 4.75 4.62 5.60

Method	Airplane	Bench	Cabinet	Video	Speaker	Rifle	Table	Phone	Vessel	Avg
3DAttriFlow (Wen et al., 2022b)	4.58	6.74	8.76	7.83	10.29	4.50	8.79	6.21	6.04	7.08
Ccd-3dr (Di et al., 2023)	4.07	6.31	8.36	7.42	9.98	4.19	8.43	5.85	5.60	6.69
InversionGAN (Li et al., 2024)	3.72	5.92	8.06	7.03	9.77	3.70	8.02	5.51	5.27	6.33
RGB2point (Lee & Benes, 2025)	3.47	5.61	7.99	6.78	9.47	3.21	7.78	5.19	5.03	6.06
Ours	3.00	5.23	7.56	6.35	9.02	2.72	7.23	4.75	4.62	5.60

Results are reported as $\times 10^{2}$ (lower is better). CD: Chamfer distance; 3D: three-dimensional.

A further analysis of the results across different categories shows that SS3D achieves lower errors in every category, with particularly noticeable improvements in categories with complex structures (such as Airplane and Vessel). This further demonstrates that SS3D exhibits stronger generalization ability and accuracy in fine-grained reconstruction across different categories.

Qualitative comparison on ShapeNet. As shown in Figure 3, we present a visual comparison of our method with four approaches: 3DAttriFlow (Wen et al., 2022b), Ccd-3dr (Di et al., 2023), InversionGAN (Li et al., 2024), and RGB2point (Lee & Benes, 2025) on the ShapeNet dataset. In terms of generating high-quality 3D models from a single image, our method is able to more finely restore object shapes and preserve richer geometric details compared with these four methods, thereby significantly improving both the visual quality and the accuracy of the reconstruction results.

Figure 3.

Visualization results on the ShapeNet dataset with 20% labeled data.

The SSL strategy adopted in SS3D shows significant advantages in effectively leveraging unlabeled data. Compared with fully supervised methods that rely entirely on labeled data, the semi-supervised approach better captures the latent structure and distribution characteristics of the data, thereby improving the model’s generalization ability and robustness. At the same time, we introduce a heterogeneous feature attention fusion mechanism, which effectively alleviates the modality mismatch between images and point clouds, further enhancing the model’s geometric representation ability and stability.

As shown in Table 2, we evaluate the 3D reconstruction performance of our model on the ShapeNet dataset using 1%, 10%, and 20% of labeled data. The results indicate a consistent downward trend in L1-CD as the proportion of labeled data increases. The lowest value of 5.60 $\times 10^{- 2}$ is achieved with 20% labeled data. Even with only 10% labeled data, our method still achieves a low error of 6.10 $\times 10^{- 2}$ , fully validating the effectiveness of the semi-supervised strategy in 3D reconstruction tasks.

Table 2.

Comparative Results of 3D Reconstruction on the ShapeNet Dataset under Different Proportions of Labeled Data, Evaluated by L1-CD.

Method	1%	10%	20%
Ours	7.09	6.10	5.60

Results are reported as $\times 10^{2}$ (Lower is better). CD: Chamfer distance; 3D: three-dimensional.

As shown in Figure 4, we conduct a visual analysis of the reconstruction results under different proportions of labeled data. The results demonstrate that as the amount of labeled data increases, the point clouds generated by the model gradually exhibit more complete and detailed geometric structures. This observation is consistent with the quantitative results in Table 2, where the L1-CD distance continuously decreases as the labeling ratio increases.

Figure 4.

Visualization results on the ShapeNet dataset under different proportions of labeled data.

Specifically, when the labeling ratio is only 1%, the model can roughly recover the overall contour of the input image, but the point cloud is relatively sparse, and local details exhibit omissions and distortions. For example, the backrest of a bench appears uneven, the surface structure of a cabinet is incomplete, the shape of table legs is blurred, and the edges of airplane wings are irregular. When the labeling ratio is increased to 10%, the density and geometric integrity of the point cloud improve significantly: the corners of the cabinet become clearer, the connections between table legs are more reasonable, and the wing structures of airplanes become more stable. With a further increase to 20% labeled data, the reconstruction results align closely with the ground truth in both geometric structure and point cloud distribution. For instance, the bench backrest is straighter, the cabinet edges are well defined, the connections between tabletops and legs are precise, and the overall airplane structure is complete, showing strong detail recovery and global consistency. Finally, although the performance improvement for certain categories is relatively small, this also reflects the robustness and consistency of SS3D in handling different types of 3D shapes. In contrast, some supervised methods may perform well on specific categories but poorly on others, leading to greater fluctuations in overall performance.

As shown in Figure 5, we present visualization results of the ShapeNet dataset from different viewing angles. The generated 3D models exhibit high completeness in overall geometric structures, with the main contours of objects clearly represented across all perspectives. For example, the wings, fuselage, and tail fins of airplanes are consistently reconstructed from different viewpoints; furniture objects such as tables and bookshelves not only preserve reasonable overall frameworks but also achieve good restoration of local details; the outlines of vessels and phones are also clear and complete, demonstrating the model’s adaptability to diverse categories. In addition, the point clouds are uniformly distributed overall, without noticeable collapse or large voids. In summary, these results indicate that our method captures macro-level geometric shapes while maintaining strong cross-view consistency and robustness.

Figure 5.

Visualization results of three-dimensional (3D) reconstruction on the ShapeNet dataset from different viewpoints.

The quantitative comparison results on Pix3D are shown in Table 3. We compare our method with Ccd-3dr (Di et al., 2023), InversionGAN (Li et al., 2024), and RGB2point (Lee & Benes, 2025) on the Pix3D dataset, using the L1-CD metric as the evaluation criterion.

Table 3.

Comparative Results of 3D Reconstruction on the Pix3D Dataset with 10% Labeled Data, Evaluated by L1-CD.

Method	Bed	Bookcase	Desk	Misc	Sofa	Table	Tool	Wardrobe	Avg
Ccd-3dr (Di et al., 2023)	7.04	6.23	7.39	13.28	5.24	8.26	8.17	4.26	7.48
InversionGAN (Li et al., 2024)	6.49	5.62	6.49	12.61	4.57	7.73	7.69	3.84	6.88
RGB2point (Lee & Benes, 2025)	6.26	5.31	6.37	12.38	4.39	7.40	7.33	3.52	6.62
Ours	5.80	4.88	6.10	12.16	4.01	7.15	7.03	3.23	6.29

Results are reported as $\times 10^{2}$ (lower is better). CD: Chamfer distance; 3D: three-dimensional.

As shown in Table 3, our method outperforms existing mainstream approaches on the Pix3D dataset in terms of the L1-CD metric, achieving higher 3D reconstruction accuracy. Specifically, our method attains an L1-CD value of 6.29 $\times 10^{- 2}$ , which is lower than those of the other methods. Compared with Ccd-3dr, InversionGAN, and RGB2point, the average error is reduced by 15.90%, 8.58%, and 4.98%, respectively. This demonstrates that our method can also accurately reconstruct the 3D structures and details of objects in real-world scenarios.

Qualitative comparison on Pix3D. As shown in Figure 6, we present visual comparison results between our method and Ccd-3dr (Di et al., 2023), InversionGAN (Li et al., 2024), and RGB2point (Lee & Benes, 2025) on the Pix3D dataset.

Figure 6.

Visualization results on the Pix3D dataset with 10% labeled data.

From the Figure 6, it can be observed that although Ccd-3dr and InversionGAN are able to generate the rough shapes of objects, their point cloud distributions are noticeably sparse, with severe losses of local details, such as incomplete table legs, blurred sofa armrests, and indistinct geometric boundaries. RGB2point shows some improvement in maintaining overall shapes, but it still suffers from fractures and collapses in local structure reconstruction, particularly in slender parts such as table legs. In contrast, our method consistently demonstrates clearer geometric structures and more uniform point cloud distributions across reconstruction results of different furniture categories. It effectively restores both object contours and local details, with overall results highly consistent with the ground truth.

In summary, SS3D outperforms existing methods on both the ShapeNet and Pix3D datasets. Even under conditions with limited labeled data, the model achieves lower L1-CD through cross-modal fusion enabled by the heterogeneous feature attention mechanism and the semi-supervised teacher–student framework. Moreover, it demonstrates higher accuracy in maintaining global structures and restoring local details. The experimental results show that SS3D possesses excellent robustness and generalization ability across different categories and data scenarios, fully validating its effectiveness and advantages in single-view 3D reconstruction tasks.

4.3. Ablation Experiments

In this section, we verify the effectiveness of the heterogeneous feature attention mechanism, the semi-supervised teacher–student framework, and the self-attention decoder.

Impact of the heterogeneous feature attention mechanism. To evaluate the effectiveness of the heterogeneous feature attention mechanism in 3D model generation, we design a comparative experiment with an experimental group (using the mechanism) and a control group (without the mechanism). Both models are trained under identical conditions, and L1-CD is used as the evaluation metric for comparison.

As shown in Table 4, when the heterogeneous feature attention mechanism is not introduced, the average L1-CD values on the ShapeNet and Pix3D datasets are 6.20 and 6.88, respectively. After incorporating the heterogeneous feature attention mechanism, the L1-CD values decrease to 5.60 and 6.29, respectively, resulting in significant improvements in reconstruction accuracy. This indicates that the heterogeneous feature attention mechanism enables more effective fusion of features from different modalities, thereby enhancing the precision of 3D reconstruction.

Table 4.
Comparative Results of 3D Reconstruction on the ShapeNet Dataset Under Different Proportions of Labeled Data, Evaluated by L1-CD.

Method ShapeNet Pix3D

w/o Heterogeneous featureattention mechanism 6.20 6.88

w/ Heterogeneous featureattention mechanism 5.60 6.29

Method	ShapeNet	Pix3D
w/o Heterogeneous featureattention mechanism	6.20	6.88
w/ Heterogeneous featureattention mechanism	5.60	6.29

Results are reported as $\times 10^{2}$ (lower is better). CD: Chamfer distance; 3D: three-dimensional.

As shown in Figure 7, we compare visualization results on the ShapeNet dataset under conditions with and without the heterogeneous feature attention mechanism. From the figure, it is evident that incorporating the heterogeneous feature attention mechanism allows the model to achieve superior performance in both overall structural reconstruction and detail restoration.

Figure 7.

Visualization results on the ShapeNet dataset with and without the heterogeneous feature attention mechanism.

Specifically, for complex geometric structures such as airplanes and chairs, the model without the heterogeneous feature attention mechanism produces point clouds with blurred edges and local missing regions, such as sparse and unclear contours around the wings and chair backs. In contrast, after introducing the heterogeneous feature attention mechanism, the point cloud distribution becomes more uniform, geometric boundaries are clearer, and the overall shape remains highly consistent with the ground truth. For regular objects such as videos, the model with the heterogeneous feature attention mechanism more accurately restores planar structures and straight edges, better preserving geometric regularity and consistency compared with the model without the mechanism. For categories with rich local details, such as pistols and tables, the mechanism likewise enhances the detail-level performance of point clouds, resulting in clearer and more accurate geometric structures.

In summary, the results in Table 4 and Figure 7 corroborate each other. The heterogeneous feature attention mechanism effectively enhances the model’s feature fusion capability, achieving complementary and reinforced information exchange, thereby improving geometric completeness and detail accuracy in 3D reconstruction. This mechanism not only alleviates the limitations caused by insufficient single-modal feature representation but also strengthens the model’s reconstruction ability for complex structures and fine local details.

Impact of the semi-supervised teacher–student framework. To verify the effectiveness of the semi-supervised teacher–student framework in 3D model reconstruction, we design a comparative experiment. We set up an experimental group using the semi-supervised teacher–student framework and a control group without it. The experimental group is trained with 20% of the ShapeNet dataset and 10% of the Pix3D dataset, while the control group does not employ the teacher–student framework. Under otherwise identical training conditions, both models are trained and evaluated using L1-CD as the metric to measure the accuracy of the generated results.

As shown in Table 5, on the ShapeNet dataset, the control group without the semi-supervised teacher–student framework achieves an average L1-CD value of 5.77. After introducing the semi-supervised teacher–student framework, the value decreases to 5.60, representing an error reduction of about 2.95%. On the Pix3D dataset, the L1-CD value decreases from 6.94 to 6.29, corresponding to an error reduction of approximately 9.36%. These results demonstrate the effectiveness of the semi-supervised teacher–student framework in improving 3D reconstruction quality.

Table 5.

Comparative Results of Single-view 3D Reconstruction on the ShapeNet and Pix3D Datasets with and without the Semi-supervised Teacher–student Framework, Evaluated by L1-CD.

Method	ShapeNet	Pix3D
w/o Semi-supervised	5.77	6.94
w/ Semi-supervised	5.60	6.29

Results are reported as $\times 10^{2}$ (lower is better). CD: Chamfer distance; 3D: three-dimensional.

As shown in Figure 8, we compare visualization results on the ShapeNet dataset under conditions with and without the semi-supervised teacher–student framework. From the overall effect, it is evident that after introducing the teacher–student framework, the generated point clouds show significant improvements in both global structural consistency and local geometric detail.

Figure 8.

Visualization results on the ShapeNet dataset with and without the semi-supervised teacher–student framework.

Specifically, for complex categories such as airplanes and chairs, the model without the semi-supervised teacher–student framework produces point clouds with blurred geometric structures and incomplete local details, such as sparse distributions and unclear edges around airplane wings, or incomplete backrest structures in chairs. In contrast, after introducing the semi-supervised teacher–student framework, the 3D model point clouds become more uniformly distributed, with clearer edge contours, and the overall shape is highly consistent with the ground truth. For objects such as cabinets and videos, the reconstruction results without the framework show sparse and scattered point clouds, whereas incorporating the teacher–student framework produces outputs with better planar structures and straighter edges. For objects like tables, which include curved surfaces and numerous details, the teacher–student framework also improves the capture of corners and curved shapes, making both the overall structure and local details more closely aligned with the real point clouds.

The experiments demonstrate that the semi-supervised teacher–student framework can fully leverage unlabeled data. By generating pseudo-labels for unlabeled samples through the teacher network and guiding the student network under conditions of strong augmentation and feature perturbation, the model is able to reconstruct 3D models that are highly consistent with the ground-truth point clouds, even under limited labeled data conditions.

Impact of decoder type. To verify the effectiveness of the self-attention decoder, we compare the performance of a MLP decoder and a self-attention decoder on the ShapeNet and Pix3D datasets.

As shown in Table 6, on the ShapeNet dataset, the average L1-CD value using the MLP decoder is 5.92, while with the self-attention decoder it decreases to 5.60, representing an error reduction of about 5.41%. On the Pix3D dataset, the average L1-CD value with the MLP decoder is 6.55, whereas with the self-attention decoder it drops to 6.29, a reduction of approximately 3.97%. These results indicate that the self-attention decoder outperforms the MLP decoder on both datasets, capturing geometric details in complex scenes more effectively. The MLP decoder relies on layer-by-layer nonlinear mappings, with its expressive power primarily reflected in the combination of local features, making it difficult to sufficiently model the global dependencies among different points in 3D shapes. In contrast, the self-attention decoder dynamically models long-range dependencies in point clouds through the attention mechanism, enabling the network to more accurately reconstruct complex 3D structures and thereby achieve consistent improvements in overall reconstruction quality.

Table 6.

Comparative Results of Single-view 3D Object Reconstruction on the ShapeNet and Pix3D Datasets using Different Decoders, Evaluated by L1-CD.

Method	ShapeNet	Pix3D
MLP decoder	5.92	6.55
Self-attention decoder	5.60	6.29

Results are reported as $\times 10^{2}$ (lower is better). CD: Chamfer distance; 3D: three-dimensional.

As shown in Figure 9, when using the MLP decoder, the generated results show clear deficiencies in detail representation. For example, the point cloud distribution in the wing regions of airplanes is sparse, with incomplete edges; the outer contours of videos appear relatively rough; and models of other categories generally exhibit problems such as missing local details and blurred boundaries. In contrast, the results produced with the self-attention decoder are closer to the ground truth in both overall structure restoration and local detail performance. They preserve geometric boundaries more clearly and better capture the semantic consistency of objects. This demonstrates that the self-attention decoder can effectively model the global dependencies of point clouds, thereby improving the structural completeness and detail accuracy of 3D reconstruction.

Figure 9.

Visualization results on the ShapeNet dataset using different decoders.

The experimental results show that the decoder architecture has a critical impact on the performance of 3D reconstruction. Compared with the traditional MLP decoder, the self-attention decoder achieves superior performance in 3D model reconstruction tasks. This finding provides a valuable reference for the design of decoders in future 3D reconstruction models.

5. Efficiency Analysis

As shown in Table 7, we compare 3DAttriFlow, Ccd-3d, RGB2Point, and our proposed SS3D in terms of parameter count, inference time, GPU memory footprint, and computational cost (floating-point operation (FLOPs)) under a batch size of 1. The results indicate that, compared with 3DAttriFlow, our method uses more parameters but achieves shorter inference time and lower computational complexity, demonstrating higher generation efficiency. Compared with Ccd-3dr and RGB2Point, our method shows clear advantages in parameter count, inference speed, memory usage, and FLOPs. These findings suggest that our method attains good computational efficiency while maintaining strong modeling capability, achieving a favorable balance between performance and resource consumption.

6. Limitation Analysis

Although our model achieves strong performance in cross-modal feature fusion and semi-supervised 3D reconstruction, certain limitations remain. As shown in Figure 10, in complex geometric regions such as airplane tail fins and vessel hull edges, the reconstruction results still exhibit blurred details compared with ground-truth point clouds. This indicates that the model has slight shortcomings in capturing fine details and restoring edge structures. In addition, as shown in Table 7, compared with 3DAttriFlow, SS3D increases the number of parameters by 44.2% and the GPU memory footprint by 5.2%, which imposes certain limitations on its deployment in resource-constrained or real-time application scenarios.

Table 7.
Model Efficiency Comparison: the Number of Parameters, Inference time, GPU Memory usage, and Floating-Point Operations (FLOPs).

3DAttriFlow (Wen et al., 2022b) Ccd-3dr (Di et al., 2023) RGB2Point (Lee & Benes, 2025) Ours

Parameters (M) 51.82 84.56 89.16 74.71

Inference time (s) 11.64 18.73 15.23 8.14

GPU memory (GB) 1.55 2.25 1.93 1.63

FLOPs (G) 18.92 64.00 (per step) 24.22 12.65

	3DAttriFlow (Wen et al., 2022b)	Ccd-3dr (Di et al., 2023)	RGB2Point (Lee & Benes, 2025)	Ours
Parameters (M)	51.82	84.56	89.16	74.71
Inference time (s)	11.64	18.73	15.23	8.14
GPU memory (GB)	1.55	2.25	1.93	1.63
FLOPs (G)	18.92	64.00 (per step)	24.22	12.65

Figure 10.

Visualization of limitation analysis results on the ShapeNet dataset.

Therefore, in future work, we plan to explore how to further improve the reconstruction accuracy of the model in complex components and edge regions, thereby enhancing the overall reconstruction quality. We also plan to design a more lightweight cross-modal fusion structure to improve the efficiency and scalability of the model.

7. Conclusion

We propose a single-view 3D reconstruction method that integrates SSL with a heterogeneous feature attention mechanism, aiming to enhance the model’s reconstruction ability under conditions of limited image information and background interference. Specifically, the heterogeneous feature attention mechanism enables cross-modal fusion of image and point cloud features, strengthening the complementarity of multi-source information. The semantic module and geometric module respectively reinforce global semantic consistency and local geometric structure representation, while AdaIN aligns multi-scale features. Finally, the self-attention decoder effectively captures long-range dependencies, generating high-quality 3D point clouds. Meanwhile, we introduce a semi-supervised teacher–student framework in which the teacher network generates pseudo-labels through weak augmentation, and the student network is optimized under strong augmentation and feature perturbation. An EMA mechanism is employed to steadily update the teacher network weights, improving the reliability of pseudo-labels and further enhancing the stability of overall training. Experimental results show that our proposed method can generate 3D point cloud models with clear structures and rich details, significantly outperforming existing mainstream methods in performance. Overall, the framework of our study provides new insights and directions for future research on cross-modal fusion and semi-supervised 3D reconstruction.

Footnotes

Acknowledgments

This research was funded by the research on 3D Object Detection Technology Based on Narrative Representation, grant number: (No.JJKH20250525KJ).

ORCID iDs

Linxuan Li

Cheng Han

Yang Ding

Shan Jiang

Bo Li

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Afifi

A. J.

Magnusson

Soomro

T. A.

Hellwich

(2020). Pixel2point: 3D object reconstruction from a single image using cnn and initial sphere. IEEE Access, 9, 110–121.

Barron

J. T.

Mildenhall

Tancik

Hedman

Martin-Brualla

Srinivasan

P. P.

(2021). Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 5855–5864).

Berthelot

Carlini

Cubuk

E. D.

Kurakin

Sohn

Zhang

Raffel

(2019a). Remixmatch: Semi-supervised learning with distribution alignment and augmentation anchoring. arXiv preprint arXiv:1911.09785.

Berthelot

Carlini

Goodfellow

Papernot

Oliver

Raffel

C. A.

(2019b). Mixmatch: A holistic approach to semi-supervised learning. Advances in Neural Information Processing Systems, 32, 1–10.

Chang

A. X.

Funkhouser

Guibas

Hanrahan

Huang

Savarese

Savva

Song

Xiao

(2015). Shapenet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012.

Choy

C. B.

Gwak

Chen

Savarese

(2016). 3D-R2N2: A unified approach for single and multi-view 3D object reconstruction. In European conference on computer vision (pp. 628–644). Springer.

DeVries

Taylor

G. W.

(2017). Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.

Zhang

Wang

Zhai

Zhang

Manhardt

Busam

Tombari

(2023). Ccd-3dr: Consistent conditioning in diffusion for single-image 3D reconstruction. arXiv preprint arXiv:2308.07837.

Ding

Yang

Han

Zhang

Liu

(2025a). SketchFormer3D: Generating 3D shapes from sketches with implicit SDF priors via diffusion models. Expert Systems with Applications, 298, 129931.

10.

Ding

Yang

Zhang

(2025b). DGGR-Net: Single-image 3D reconstruction from complex backgrounds via graph-based refinement and difference-guided fusion. Journal of King Saud University Computer and Information Sciences, 37(8), 222.

11.

Fan

Guibas

L. J.

(2017). A point set generation network for 3D object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 605–613).

12.

Gao

Kong

Wang

Yin

(2023). CIGNet: Category-and-intrinsic-geometry guided network for 3D coarse-to-fine reconstruction. Neurocomputing, 554, 126607.

13.

Gerbaud

Cavalier

Horna

Zrour

Naudin

Guillevin

Meseure

(2024). Topological 3D reconstruction of multiple anatomical structures from volumetric medical data. Computers & Graphics, 121, 103947.

14.

Gong

Wang

Liu

(2021). Alphamatch: Improving consistency for semi-supervised learning with alpha-divergence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13683–13692).

15.

Grandvalet

Bengio

(2004). Semi-supervised learning by entropy minimization. Advances in Neural Information Processing Systems, 17, 1–7.

16.

Kan

Shan

(2021). EigenGAN: Layer-wise eigen-learning for GANs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 14408–14417).

17.

Hong

Zhang

Zhou

Liu

Sunkavalli

Bui

Tan

(2023). LRM: Large reconstruction model for single image to 3D. arXiv preprint arXiv:2311.04400.

18.

Jin

Kulkarni

Fouhey

D. F.

(2024). 3DFIRES: Few image 3D reconstruction for scenes with hidden surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9742–9751).

19.

Jun

Nichol

(2023). Shap-E: Generating conditional 3D implicit functions. arXiv preprint arXiv:2305.02463.

20.

Kumar

Pawar

P. P.

Addula

S. R.

Meesala

M. K.

Oni

Cheema

Q. N.

Haq

A. U.

Sajja

G. S.

(2025). AI-powered security for IoT ecosystems: A hybrid deep learning approach to anomaly detection. Journal of Cybersecurity and Privacy, 5(4), 90.

21.

Laradji

Rodríguez

Vazquez

Nowrouzezahrai

(2021). Ssr: Semi-supervised soft rasterizer for single-view 2D to 3D reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1427–1436).

22.

Lee

D. H.

(2013). Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML (Vol. 3, p. 896). Atlanta.

23.

Lee

J. J.

Benes

(2025). RGB2point: 3D point cloud generation from single RGB images. In 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (pp. 2952–2962). IEEE.

24.

Guo

Sheng

(2024). Self-supervised single-view 3D point cloud reconstruction through GAN inversion. The Journal of Supercomputing, 80(14), 21365–21393.

25.

Lin

K. E.

Lin

Y. C.

Lai

W. S.

Lin

T. Y.

Shih

Y. C.

Ramamoorthi

(2023). Vision transformer for nerf-based view synthesis from a single input image. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 806–815).

26.

Liu

Jin

Chen

Varma

T. M.

(2023a). One-2-3-45: Any single image to 3D mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 22226–22246.

27.

Liu

Van Hoorick

Tokmakov

Zakharov

Vondrick

(2023b). Zero-1-to-3: Zero-shot one image to 3D object. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9298–9309).

28.

Liu

Ran

Yuan

Zheng

(2024). 3D face reconstruction from a single image based on hybrid-level contextual information with weak supervision. Computers & Graphics, 118, 80–89.

29.

Long

Guo

Y. C.

Lin

Liu

Dou

Liu

Zhang

S. H.

Habermann

Theobalt

Wang

(2024). Wonder3d: Single image to 3D using cross-domain diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9970–9980).

30.

Loshchilov

Hutter

(2017). Fixing weight decay regularization in Adam. CoRR.

31.

Mao

Dai

Liu

Yang

Gao

Liu

Y. J.

(2021). STD-Net: Structure-preserving and topology-adaptive deformation network for single-view 3D reconstruction. IEEE Transactions on Visualization and Computer Graphics, 29(3), 1785–1798.

32.

Melas-Kyriazi

Rupprecht

Vedaldi

(2023). Pc2: Projection-conditioned point cloud diffusion for single-image 3D reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12923–12932).

33.

Metzer

Richardson

Patashnik

Giryes

Cohen-Or

(2023). Latent-nerf for shape-guided generation of 3D shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 12663–12673).

34.

Mildenhall

Srinivasan

P. P.

Tancik

Barron

J. T.

Ramamoorthi

(2021). Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM, 65(1), 99–106.

35.

Miyato

Maeda

S. I.

Koyama

Ishii

(2018). Virtual adversarial training: A regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979–1993.

36.

Nichol

Jun

Dhariwal

Mishkin

Chen

(2022). Point-E: A system for generating 3D point clouds from complex prompts. arXiv preprint arXiv:2212.08751.

37.

Pham

Dai

Xie

Q. V.

(2021). Meta pseudo labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11557–11568).

38.

Poole

Jain

Barron

J. T.

Mildenhall

(2023). Dreamfusion: Text-to-3D using 2D diffusion. In ICLR.

39.

Shen

Zhou

Zhang

Yan

Wang

(2025). Gamba: Marry Gaussian Splatting with Mamba for single-view 3D reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–16.

40.

Sun

Zhang

Xue

Tenenbaum

J. B.

Freeman

W. T.

(2018). Pix3D: Dataset and methods for single-image 3D shape modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2974–2983).

41.

Szymanowicz

Rupprecht

Vedaldi

(2024). Splatter image: Ultra-fast single-view 3D reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10208–10217).

42.

Tiong

L. C. O.

Sigmund

Teoh

A. B. J.

(2022). 3D-C2FT: Coarse-to-fine transformer for multi-view 3D reconstruction. In Proceedings of the Asian Conference on Computer Vision. (pp. 1438–1454).

43.

Vanyan

Khachatrian

(2021). Deep semi-supervised image classification algorithms: A survey. Journal of Universal Computer Science (JUCS), 27(12), 1–17.

44.

Wang

Zhang

Liu

Jiang

Y. G.

(2018). Pixel2Mesh: Generating 3D mesh models from single RGB images. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 52–67).

45.

Wen

Zhang

Cao

Xue

(2022a). Pixel2Mesh++: 3D mesh generation and refinement from multi-view images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2166–2180.

46.

Wen

Zhou

Liu

Y. S.

Dong

Han

(2022b). 3D shape reconstruction from 2D images with disentangled attribute flow. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3803–3813).

47.

Mildenhall

Henzler

Park

Gao

Watson

Srinivasan

P. P.

Verbin

Barron

J. T.

Poole

(2024). Reconfusion: 3D reconstruction with diffusion priors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 21551–21561).

48.

Xie

Yao

Sun

Zhou

Zhang

(2019). Pix2Vox: Context-aware 3D reconstruction from single and multi-view images. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 2690–2698).

49.

Xie

Yao

Zhang

Zhou

Sun

(2020a). Pix2Vox++: Multi-scale context-aware 3D object reconstruction from single and multiple images. International Journal of Computer Vision, 128(12), 2919–2935.

50.

Xie

Dai

Hovy

Luong

(2020b). Unsupervised data augmentation for consistency training. Advances in Neural Information Processing Systems, 33, 6256–6268.

51.

Xing

Jiang

Y. G.

(2022). Semi-supervised single-view 3D reconstruction via prototype shape priors. In European conference on computer vision (pp. 535–551). Springer.

52.

Yagubbayli

Wang

Tonioni

Tombari

(2021). Legoformer: Transformers for block-by-block multi-view 3D reconstruction. arXiv preprint arXiv:2106.12102.

53.

Yang

Cui

Belongie

Hariharan

(2018). Learning single-view 3D reconstruction with limited pose supervision. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 86–101).

54.

Yang

Lin

Zhou

(2023). Single-view 3D mesh reconstruction for seen and unseen categories. IEEE Transactions on Image Processing, 32, 3746–3758.

55.

Tancik

Kanazawa

(2021). pixelNeRF: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4578–4587).

56.

Yang

Wei

(2022). Part-wise AtlasNet for 3D point cloud reconstruction from a single image. Knowledge-Based Systems, 242, 108395.

57.

Zhang

Jiang

Zhu

Tai

Wang

Zhang

(2024). T-Pixel2Mesh: Combining global and local transformer for 3D mesh generation from a single image. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2435–2439). IEEE.

58.

Zoph

Ghiasi

Lin

T. Y.

Cui

Liu

Cubuk

E. D.

(2020). Rethinking pre-training and self-training. Advances in Neural Information Processing Systems, 33, 3833–3845.

59.

Zou

Z. X.

Guo

Y. C.

Liang

Cao

Y. P.

Zhang

S. H.

(2024). Triplane meets gaussian splatting: Fast and generalizable single-view 3D reconstruction with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10324–10335).

SS3D: A Semi-Supervised Learning Approach for Accurate Three-Dimensional Shape Reconstruction

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Deep Learning for 3D Reconstruction

2.2. Semi-Supervised Deep Learning

3. Method

4.1. Datasets and Experimental Setup

4.2. Comparative Experiments

Table 4. Comparative Results of 3D Reconstruction on the ShapeNet Dataset Under Different Proportions of Labeled Data, Evaluated by L1-CD. Method ShapeNet Pix3D w/o Heterogeneous featureattention mechanism 6.20 6.88 w/ Heterogeneous featureattention mechanism 5.60 6.29

6. Limitation Analysis

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of Conflicting Interests

References

Table 4.
Comparative Results of 3D Reconstruction on the ShapeNet Dataset Under Different Proportions of Labeled Data, Evaluated by L1-CD.

Method ShapeNet Pix3D

w/o Heterogeneous featureattention mechanism 6.20 6.88

w/ Heterogeneous featureattention mechanism 5.60 6.29