A portable non-contact tongue imaging system with automated analysis for community and home settings

Abstract

Background

Traditional tongue inspection relies on visual assessment by practitioners, which introduces subjectivity and compromises reproducibility. Existing solutions often rely on enclosed, dedicated acquisition instruments with nontrivial operation, whereas mobile self-capture approaches are more accessible but sensitive to environmental variability, making reliable analysis challenging in real-world use.

Objective

To develop a portable non-contact tongue imaging and automated analysis system that is robust to real-world acquisition variability.

Methods

We designed a portable acquisition terminal that integrates a camera, touchscreen preview, touch-initiated capture with voice prompts, and supplementary illumination for acquisition assistance. For automated analysis, we developed TongueSegNet (TSegNet) for tongue segmentation, incorporating stage-dependent residual modulation, deep-stage attention enhancement, and gated skip-pathway feature fusion to improve feature representation and boundary delineation. For fissured-tongue feature recognition, we developed Residual Kolmogorov-Arnold Network (ResKAN), which combines a convolutional neural network feature extractor with a Kolmogorov-Arnold Network–based head to improve modelling capacity for fine-grained texture patterns.

Results

On tongue images acquired under unconstrained conditions, TSegNet achieved mean Dice of 98.16%, mean intersection over union of 96.42%, and mean pixel accuracy of 98.31%, outperforming representative baselines. ResKAN achieved mean accuracy of 92.48%, sensitivity of 92.67%, specificity of 92.31%, and a fissured-class F1 score of 92.34%.

Conclusion

The proposed system enables reliable non-contact tongue imaging with automated server-side analysis under unconstrained conditions. These findings support the feasibility of this integrated approach as an initial step toward more accessible automated tongue-image analysis in community and home settings.

Keywords

non-contact tongue imaging deep learning tongue segmentation fissured tongue tongue image analysis feature recognition

Introduction

With the growing global emphasis on preventive healthcare, there is a rising public demand for continuous and personalized health assessment.^1,2 Traditional Chinese Medicine (TCM) emphasizes individualized assessment and preventive treatment, aligning naturally with these goals.^3–5 Tongue inspection is a key non-invasive diagnostic method in TCM for assessing health status through observation of tongue morphology and appearance.^6–8 Among these features, fissure patterns are representative and meaningful morphological signs characterized by grooves or cracks on the dorsal surface of the tongue, and they are often interpreted in relation to internal conditions such as syndrome-related hotness, blood deficiency, and spleen insufficiency.^9,10 Emerging clinical evidence also suggests that fissured tongue may have observational value in modern clinical settings. For instance, a hospital-based cross-sectional study reported that fissured tongue was independently associated with upper gastrointestinal precancerous lesions.¹¹ These findings indicate that fissured tongue is a meaningful tongue feature warranting further investigation.

However, conventional tongue inspection still relies heavily on manual assessment by experienced practitioners. This reliance on individual expertise and subjective judgment may limit diagnostic reliability and objectivity.^12,13 In addition, proficiency in tongue diagnosis requires substantial training and accumulated clinical experience, which limits the availability of qualified practitioners and constrains service scalability.¹⁴ These limitations highlight the need for more objective and automated tongue analysis. Furthermore, reliance on in-person assessment may introduce operational barriers, including patient scheduling difficulties, uncertain waiting times, and repetitive manual workloads for clinicians.¹⁵ In this context, portable digital health approaches that can be deployed in community and home settings may help extend access to tongue assessment beyond conventional clinical environments.^16,17 As a representative and visually recognizable morphological sign within tongue diagnosis, fissured tongue provides a practical initial target for developing and evaluating accessible automated assessment approaches.

Reliable automated tongue analysis begins with tongue image acquisition.^18,19 Existing acquisition solutions can be broadly categorized into dedicated imaging devices and mobile approaches. Dedicated imaging devices typically rely on constrained setups to provide relatively controlled imaging conditions, but their limited portability may restrict deployment outside clinical or laboratory environments.^20,21 Mobile approaches, particularly smartphone-based solutions, have been investigated to improve accessibility and enable more flexible use.^22–24 However, these methods remain susceptible to environmental variations, which may degrade image quality and affect subsequent analysis.²⁵ Moreover, self-capture typically requires active manual operation and may therefore present usability challenges for some users, especially older adults and individuals with limited digital literacy or sensorimotor capability.^26–28 A portable acquisition solution that improves imaging consistency while remaining simple to operate is therefore important for community and home use.

Tongue image segmentation is a fundamental step in automated tongue image analysis because it isolates the tongue region from surrounding structures, such as the lips, teeth, and facial skin, and provides a reliable basis for subsequent feature extraction and quantitative assessment.²⁹ Early studies primarily employed traditional image processing methods, including edge detection, thresholding, and region-growing techniques.^30–33 These approaches depended heavily on handcrafted features, required laborious parameter tuning, and were often sensitive to illumination changes and background interference. Deep learning-based methods have since substantially advanced tongue segmentation performance. Representative studies, such as TongueNet, TU-Net, and RTC TongueNet, have improved segmentation accuracy through enhanced shape modeling, feature representation, and contextual learning.^34–36 Despite these advances, tongue images acquired in real-world settings often exhibit substantial appearance variability and background complexity, whereas many existing segmentation models have been developed and evaluated on datasets acquired under relatively controlled conditions, such as uniform illumination and clean backgrounds.^37–39 This mismatch may reduce model robustness and generalizability when such methods are deployed in real-world settings.

Existing studies on fissured-tongue feature recognition have mainly formulated the task as image-level classification and have relied on transfer learning with pre-trained convolutional neural networks (CNNs) to extract discriminative representations.⁴⁰ Although these approaches have shown promising results, most of them still employ conventional linear or multi-layer perceptron (MLP)-based heads on top of CNN features. The potential value of more expressive classifier designs for fissured-tongue feature recognition therefore remains insufficiently explored.

Kolmogorov-Arnold Networks (KANs) have recently been proposed as an alternative to MLPs by replacing fixed node-wise activation functions with learnable univariate functions on edges, which may provide greater functional flexibility and improved interpretability.⁴¹ However, directly applying pure KAN architectures to high-dimensional image data remains challenging because of their computational and optimization burdens. A practical strategy is therefore to combine CNN-based feature extraction with KAN-based nonlinear modeling.^42–45 Nevertheless, the utility of such hybrid architectures for automated fissured-tongue feature recognition has not yet been adequately investigated.

To address these challenges, this study proposes an integrated tongue analysis system combining a custom-designed acquisition terminal with an advanced deep learning framework. Our main contributions are summarized as follows:

We design a portable tongue image acquisition prototype with supplementary illumination and multimodal interaction (touch initiation with voice guidance) to facilitate reliable data collection and ease of use.

We propose TongueSegNet (TSegNet) for robust tongue image segmentation in unconstrained conditions. The model incorporates stage-dependent residual modulation, deep-stage attention enhancement, and gated skip connections to improve feature representation and boundary delineation under complex backgrounds and illumination variations.

We develop Residual Kolmogorov-Arnold Network (ResKAN), a hybrid architecture combining a CNN feature extractor with a KAN-based head for fissured-tongue feature recognition. This design enhances the model’s non-linear modelling capacity, enabling more effective recognition of fine-grained texture patterns.

Methods

Overall system workflow

As illustrated in Figure 1, the system employs a client-server architecture. The terminal focuses on tongue image acquisition and interaction, while computationally intensive inference is performed on the server.

Figure 1.

Overall system workflow. The acquisition terminal captures and transmits a tongue image to the server. The server performs tongue-body segmentation for region of interest (ROI) extraction and then applies ResKAN for fissured-tongue recognition. The resulting report is returned to the terminal for visualization.

Specifically, the process begins with the user interacting with the acquisition device. Under controlled illumination, the terminal captures the tongue image and transmits it to the server. Upon receiving the image, TSegNet first performs tongue segmentation to generate a binary segmentation mask, which separates the tongue region from non-tongue pixels. The mask is further used analysis result is returned to the terminal and presented as a user-facing report.

Acquisition terminal

The acquisition terminal is a standalone and portable unit for non-contact tongue image acquisition in public and home environments.As shown in Figure 2, it integrates a camera, a touchscreen interface for on-device visualization, and internal control electronics.

Figure 2.

Acquisition terminal design and prototype. (a) Schematic illustration of the mechanical layout and the A–A sectional view, with key components labeled, including the camera, touchscreen, illumination module, raspberry Pi 4B, printed circuit board, and power supply; (b) Photographs of the assembled prototype from multiple views: front (top-left), side (top-right), rear (bottom-left), and oblique (bottom-right).

Detail hardware specification are summarized in Table 1. To mitigate environment variations, the terminal uses an ambient-light sensing mechanism to adaptively adjust LED intensity, providing supplemental illumination across different deployment conditions. User interface is performed via the touchscreen to initiate image capture and display real-time preview, system status, and the final analysis report.

Table 1.

Hardware configuration of the acquisition terminal.

Module	Configuration & specs	Primary function
Interaction	STM32F103 microcontroller; Ambient light sensor	Detects ambient light levels for adaptive brightness adjustment.
Data acquisition	Raspberry Pi 4B; 12-megapixel CMOS camera; 7-inch IPS touchscreen	Captures and transmits tongue images; displays the real-time preview and analysis results.
Illumination	Dual LED panels (CCT: 5500K; side-mounted)	Provides supplemental illumination to reduce shadows and image variability.
Enclosure	3D-PLA housing (170×120×150 mm)	Integrates the components into a portable form factor for community and home use.

Abbreviations: CMOS, complementary metal-oxide-semiconductor; IPS, in-plane switching; LED, light-emitting diode; CCT, correlated color temperature; PLA, polylactic acid.

TongueSegNet

To address challenges in tongue image segmentation under unconstrained acquisition conditions, including complex backgrounds, blurred edges, and variable object shapes, we propose TSegNet. The encoder is organized into four stages with stage-dependent residual modulation. The two shallow stages use plain residual units to preserve local details, whereas the two deep stages incorporate attention enhanced residual modulation to strengthen semantic discrimination and suppress background-induced false positives. In addition, a gated feature fusion mechanism is introduced into the intermediate skip pathways before feature concatenation to reduce noise propagation and alleviate the semantic gap between encoder and decoder features. The overall architecture is shown in Figure 3.

Figure 3.

Overall architecture of TSegNet. The encoder has four stages: shallow stages use residual blocks (ResBlock) to preserve boundary details, while deep stages use ResEMA (residual block with EMA attention) for semantic refinement. CSAF-Gate (channel-spatial attention fusion gate) is applied to the intermediate skip connections before concatenation to suppress background-dominant responses. Down-sampling is performed by max pooling and up-sampling by transposed convolution followed by convolutional refinement.

Stage-dependent residual modulation with efficient multi-scale attention

In open environments, tongue image segmentation faces two challenges at shallow and deep feature levels. In the shallow stages, varying illumination often degrades the contrast of tongue boundaries, so the network needs to preserve local details such as edges and textures. Conversely, in deeper stages, complex backgrounds and surrounding tissues may induce false positives, which calls for stronger semantic context to distinguish the tongue region.

To meet these requirements, we employ a hierarchical encoder with stage-dependent feature modulation. Let $X_{l} \in R^{C_{l} \times H_{l} \times W_{l}}$ denote the input feature map of the $l - th$ block. We first project it to the target channel space via a convolution to obtain the basis feature ${\tilde{X}}_{l}$ . The output ${\tilde{X}}_{l + 1}$ is then computed as:

X_{l + 1} = {\tilde{X}}_{l} + M_{l} (F ({\tilde{X}}_{l}))

(1)

Where

F (\cdot)

represents the residual mapping function composed of stacked convolutions, and

M_{l} (\cdot)

is the stage-dependent modulation function.

In shallow high-resolution stages ( $l \in {1, 2}$ ), we set $M_{l}$ as an identity mapping:

M_{l} (F) = F

(2)

This configuration reduces the block to a standard residual unit. This identity shortcut facilitates gradient propagation and helps preserve low-level boundary cues during downsampling.⁴⁶ As the network deepens ( $l \in {3, 4}$ ), the receptive field expands and the features become more semantic, while the main difficulty shifts to suppressing background noise. We therefore activate a stage-dependent modulation based on the efficient multi-scale attention (EMA) module.⁴⁷

M_{d e e p} (F) = EMA (F)

(3)

Accordingly, the deep-stage block becomes:

X_{l + 1} = {\tilde{X}}_{l} + EMA (F ({\tilde{X}}_{l}))

(4)

which we refer to as ResEMA (Figure 4). This formulation retains the advantages of residual learning (stable optimization and direct information flow through the identity shortcut) while allowing the residual features to be selectively enhanced before being merged back.

Figure 4.

Structure of the ResEMA module. ResEMA applies a residual convolutional transform followed by efficient multi-scale attention (EMA) for feature modulation. EMA first groups channels into $G$ sub-features, extracts direction-aware descriptors using 1D pooling along height and width, and then aggregates two parallel branches via cross-spatial learning to capture pixel-level interactions. The resulting attention weights are applied to the residual features via reweighting, so that tongue-relevant regions are emphasized while background activations are suppressed.

Gated skip connection with channel-spatial attention fusion

Direct skip concatenation in U-Net may propagate background-dominant low-level activations (e.g., lips, teeth, or shadows) to the decoder, which can hinder mask reconstruction under unconstrained acquisition conditions. To mitigate this issue, we introduce a lightweight Channel-Spatial Attention Fusion gate (CSAF-Gate) into the skip pathways (Figure 5). Drawing inspiration from attention gating protocols and dual-domain feature refinement, the CSAF-Gate functions as a dynamic filter, which explicitly suppresses irrelevant background responses by recalibrating features along both channel and spatial dimensions before feature fusion.^48,49

Figure 5.

Structure of the CSAF-Gate. Channel attention is computed using global average pooling (GAP) and global max pooling (GMP), followed by a shared MLP. Spatial attention utilizes channel-wise pooling and convolution to highlight informative regions. The final gated feature is obtained by element-wise summation of the two reweighted streams.

Given an encoder feature map $E_{s} \in R^{C_{s} \times H_{s} \times W_{s}}$ at stage $s$ , CSAF-Gate computes channel and spatial attention in parallel and fuses them by summation:

{\hat{E}}_{s} = (M_{s}^{c} ⊙ E_{s}) + (M_{s}^{s} ⊙ E_{s})

(5)

Where

⊙

is element wise multiplication,

M_{s}^{c}

is the channel-attention map, and

M_{s}^{c}

is the spatial-attention map.

The channel-attention branch is designed to explicitly model inter-channel dependencies by aggregating global context via both global average pooling and global max pooling, followed by a shared two-layer MLP:

\begin{array}{l} M_{s}^{c} = σ (ψ (GAP (E_{s})) + ψ (GMP (E_{s}))) \\ M_{s}^{c} \in R^{C_{s} \times 1 \times 1} \end{array}

(6)

Where

σ (\cdot)

denotes the sigmoid function and

ψ (\cdot) = W_{2} δ (W_{1} (\cdot))

is a shared two-layer MLP implemented by 1×1 convolutions. This branch emphasizes tongue-relevant channels while down-weighting channels dominated by background artifacts.

The spatial attention branch highlights informative locations by pooling along the channel dimension and applying a lightweight convolution:

\begin{array}{l} M_{s}^{s} = σ (f ([{Avg}_{c} (E_{s}); {Max}_{c} (E_{s})]) \\ M_{s}^{c} \in R^{1 \times H_{s} \times W_{s}} \end{array}

(7)

Where

[\cdot; \cdot]

denotes channel-wise concatenation and

f (\cdot)

represents a convolution operation with kernel size

k = 7

. This branch suppresses non-tongue regions that frequently appear in open environments.

The gated skip feature ${\hat{E}}_{s}$ is concatenated with the upsampled decoder feature and then refined by the decoder block. To balance boundary fidelity and computational overhead, CSAF-Gate is only applied to the deeper skip connections (channels 384), while the shallowest skips are kept unchanged.

Composite loss function

In addition to the network architecture, the loss function is critical for effective optimization. In open-environment tongue images, the tongue region typically occupies a relatively small portion of the image. This leads to class imbalance, which can result in incomplete or inaccurate segmentation of tongue boundaries.

The Dice Similarity Coefficient (DSC) loss is a region--based criterion that directly measures the overlap between the predicted segmentation and the ground-truth mask, making it robust to the imbalance between foreground and background pixels.⁵⁰ It is defined as:

L_{D S C} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ε}{\sum_{i = 1}^{N} p_{i}^{2} + \sum_{i = 1}^{N} y_{i}^{2} + ε}

(8)

Where

N

is the total number of pixels in the image,

p_{i}

is the predicted probability for pixel

i

y_{i}

is the ground-truth label for pixel

i

, and

ε

is a small constant to avoid division by zero and enhance gradient stability.

However, DSC loss alone may exhibit unstable gradients and optimization difficulty, especially at the early training stage or when the initial overlap between prediction and ground truth is very low. To complement this behavior, we additionally employ the Cross-Entropy (CE) loss, a distribution-based loss criterion that measures the difference between the model’s predicted probability distribution over classes and the true probability distribution.⁵¹ It is mathematically expressed as:

L_{C E} = - \frac{1}{N} \sum_{i = 1}^{N} \sum_{c = 1}^{M} y_{i, c} \log (p_{i, c})

(9)

Where

N

is the total number of pixels,

M

is the number of classes,

y_{i, c}

is the binary indicator for pixel

i

belonging to class

c

, and

p_{i, c}

is the corresponding predicted probability.

To tackle the challenge of class imbalance in tongue segmentation while ensuring training stability, we adopt a composite loss function:

L_{c} = ω_{1} L_{D S C} + ω_{2} L_{C E}

(10)

Where

ω_{1}

and

ω_{2}

are the weighting factors that balance the contribution of each loss component to the total loss

L_{c}

. In this work, we set

ω_{1}

and

ω_{2}

to the same value of 1.0 to provide a good trade-off between robustness to class imbalance and optimization stability in our experiments.

Fissured-tongue feature recognition

Following the segmentation of the tongue region, the system proceeds to the recognition of specific morphological features. In this study, we focus on the automated identification of the fissured-tongue feature. We propose the ResKAN, a hybrid architecture that integrates the robust feature extraction capabilities of ResNet with the adaptive non-linear modeling of KANs, as illustrated in Figure 6(a).

Figure 6.

Architecture of ResKAN for fissured-tongue feature recognition. (a) The overall pipeline comprises a convolutional stem, four residual stages, a global average pooling (GAP) extracts a compact descriptor, and a two-layer KAN head; (b) The downsampling bottleneck block uses a 1×1 convolution with stride $s$ in the skip path to align dimensions; (c) The identity bottleneck block utilizes a direct skip connection to preserve feature geometry; (d) The KAN recognition head, where learnable univariate edge functions $ϕ (\cdot)$ , parameterized with B-splines, replace fixed linear weights.

Feature extraction backbone

We employ a ResNet-50 backbone to extract high-level semantic representations from the input images. The backbone comprises a sequence of residual bottlenecks, including downsampling blocks for dimension reduction (Figure 6(b)) and identity blocks for depth expansion (Figure 6(c)) that effectively mitigate the vanishing gradient problem, enabling deep hierarchical feature learning.⁴⁶ Finally, GAP aggregates the spatial feature maps into a compact descriptor $v \in R^{D}$ for subsequent recognition.

KAN-based recognition head

Instead of the standard linear fully-connected classifier, we employ a two-layer KAN head to map the global descriptor to recognition logits (Figure 6(d)). Different from conventional MLP heads that use fixed node activations with scalar edge weights, KAN parameterizes each edge by a learnable univariate function, typically implemented with spline bases, while nodes mainly perform summation.

Given the global feature vector $v \in R^{D}$ after GAP, one KAN layer computes the $j - th$ output as:

y_{j} = \sum_{i = 1}^{D} ϕ_{j, i} (v_{i})

(11)

Where

ϕ_{j, i}

represents the learnable activation function on the edge connecting the

i ‐ th

input feature to the

j ‐ th

output node. Each function

ϕ_{j, i}

is parameterized as a combination of a fixed base nonlinearity function

b (\cdot)

and a set of learnable B-spline basis functions

B_{m} (\cdot)

ϕ_{j, i} (x) = w_{j, i}^{(b)} b (x) + \sum_{m = 1}^{M} w_{j, i, m}^{(s)} B_{m} (x)

(12)

Where

m

is the number of B-spline basis functions

w^{(b)}

and

w^{(s)}

are learnable coefficients. This design provides a more flexible nonlinear mapping from deep features to recognition scores than a rigid linear projection.

Qualitative interpretability analysis

To provide qualitative interpretability analysis of the recognition model, Gradient-weighted Class Activation Mapping (Grad-CAM) was applied to visualize the image regions contributing most to ResKAN predictions.⁵² Heatmaps were generated from the final convolutional feature maps to support qualitative inspection of whether the model focused on fissure-related regions.

Dataset construction

Image source and acquisition setting

The tongue images used in this study were captured using smartphones by clinicians during routine diagnosis and treatment at Maoming Hospital of Traditional Chinese Medicine. Images were acquired in routine clinical settings using smartphones, rather than under fully standardized imaging equipment or tightly controlled acquisition conditions.

Annotation protocol

For tongue segmentation, 1012 images were manually annotated by experienced researchers at Guangzhou University of Chinese Medicine using the LabelMe tool. For each image, a closed polygon was drawn along the tongue boundary to delineate the tongue region of interest (ROI), as illustrated in Figure 7.

Figure 7.

Illustration of the tongue annotation process. (a) Original tongue image; (b) Manual polygon-based annotation of the tongue boundary; (c) Extracted tongue region after annotation.

Based on the extracted tongue ROIs, a subset of 684 samples was selected to construct the fissured-tongue feature recognition dataset, comprising 336 images with fissured-tongue features and 348 without fissured-tongue features. The labeling of fissured-tongue features was performed according to the Chinese National Standard GB/T 40665.1-2021. Three professional TCM physicians participated in a consensus-based labeling procedure. One physician first assigned the preliminary label, and the other two physicians then independently reviewed it. Samples with discordant opinions were excluded from the final classification dataset. The retained labels were further reviewed and confirmed by senior experts from Maoming Hospital of Traditional Chinese Medicine.

Data splitting For each task, the corresponding dataset was randomly split into training, validation, and test sets at a ratio of 0.70:0.15:0.15 using each random seed in {3, 33, 42}. The reported results were averaged across the three runs.

Experimental setup

Tongue segmentation experiments

Tongue segmentation was conducted on a workstation equipped with an NVIDIA Quadro RTX 5000 GPU, an Intel Xeon Silver 4210R CPU, and 128 GB RAM, running Ubuntu 22.04 with Python 3.10.14 and CUDA 12.1.

The hyperparameters adopted for the tongue segmentation model are summarized in Table 2. The initial learning rate was set to 0.0001. The input images were also resized to a uniform dimension of 448×448 pixels. To enhance model robustness and generalization, a comprehensive suite of data augmentation techniques was applied to the training set. This included diverse geometric manipulation, adjustments to image intensity and contrast, the introduction of Gaussian noise, applications of blurring and sharpening filters, and alterations in color space.

Table 2.

Hyperparameters for training tongue segmentation model.

Parameter	Value
Input Image Size	448 × 448
Initial Learning Rate	0.0001
Batch Size	4
Epochs	100
Number of Workers	8
Optimizer	Adam
Optimizer Betas	(0.9, 0.999)

Fissured-tongue feature recognition

The fissured-tongue recognition experiments were run on a separate machine equipped with an NVIDIA RTX 3090 GPU, an AMD EPYC 7K62 48-Core CPU, and 60 GB RAM, running Ubuntu 22.04 with Python 3.12.4 and CUDA 12.9.

Input tongue ROI images were resized to 448×448 pixels. We trained the classifier for 300 epochs using AdamW with an initial learning rate of 1e-4 and a batch size of 4. Cross-entropy loss was adopted for optimization. Table 3 summarizes the main settings.

Table 3.

Hyperparameters for training the fissured-tongue recognition model.

Parameter	Value
Input Image Size	448 × 448
Initial Learning Rate	0.0001
Batch Size	4
Epochs	300
Number of Workers	4
Optimizer	AdamW
Weight Decay	0.0001
Loss Function	Cross-Entropy

During training, we applied data augmentation including random horizontal flipping and color jittering (brightness/contrast/saturation), as well as random sharpness adjustment to improve robustness to illumination and appearance variations.

Evaluation metrics

To comprehensively evaluate the performance of our system, we employed distinct sets of standard metrics for the segmentation and recognition tasks, respectively.

Metrics for tongue segmentation

The performance of our tongue segmentation model was quantified using three widely-adopted metrics. These metrics are calculated based on the number of pixels correctly or incorrectly classified: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

• Intersection over Union (IoU): As a primary metric for segmentation, IoU measures the overlap ratio between the predicted masks and the ground truth for each class. It is defined as:

IoU = \frac{TP}{TP + FP + FN}

(13)

• Dice Coefficient (Dice): The Dice Coefficient is especially common in medical image segmentation. It measures the overlap between the predicted and ground truth regions and is sensitive to the completeness of the segmentation. It is calculated as:

Dice = \frac{2 \times TP}{2 \times TP + FP + FN}

(14)

• Pixel Accuracy (PA): This metric provides an overall measure of correctness by calculating the percentage of correctly classified pixels in the entire image:

PA = \frac{TP + TN}{TP + FP + TN + FN}

(15)

Metrics for fissured-tongue feature recognition

For the downstream fissured-tongue feature recognition task, the model’s performance was evaluated at the image level. We treat fissured tongue as the positive class and normal tongue as the negative class. Accordingly, TP, FP, TN, and FN are defined as follows:

• True Positive (TP): A fissured tongue image correctly identified as “fissured”.

• False Positive (FP): A normal tongue image incorrectly identified as “fissured”.

• True Negative (TN): A normal tongue image correctly identified as “normal”.

• False Negative (FN): A fissured tongue image incorrectly identified as “normal”.

Based on these counts, we report the following metrics:

• Accuracy (ACC): The overall proportion of correctly classified images.

ACC = \frac{TP + TN}{TP + FP + TN + FN}

(16)

• Precision of the positive class ( $P_{p o s}$ ): The reliability of fissured predictions.

P_{p o s} = \frac{TP}{TP + FP}

(17)

• Sensitivity (Sens.): The ability to correctly identify fissured-tongue images.

Sens = \frac{TP}{TP + FN}

(18)

• Specificity (Spec.): The ability to correctly identify normal tongue images.

Spec = \frac{TN}{TN + FP}

(19)

• F1-score of the positive class ( ${F 1}_{p o s}$ ): The ability to correctly identify normal tongue images.

{F 1}_{p o s} = 2 \times \frac{P_{p o s} \times Sens}{P_{p o s} + Sens}

(20)

Results

Tongue segmentation results

Comparison with representative models

To evaluate our model, we compared TSegNet with representative segmentation models, including U-Net, U-Net++, U2-Net, DeepLabV3+, TransUNet, and Swin-Unet.

Table 4 summarizes the quantitative results. TSegNet achieved the highest scores across all metrics, reaching a Dice score of 98.16%, an IoU of 96.42%, and a Pixel Accuracy of 98.31%. These results indicate that TSegNet provided the best overall segmentation performance among the compared models under the present experimental setting.

Table 4.

Comparative segmentation performance (mean ± std over three random seeds).

Model	Dice (%)	IoU (%)	PA (%)
U-Net⁵³	97.73±0.21	95.64±0.38	97.91±0.11
U-Net++⁵⁴	97.66±0.38	95.53±0.67	97.95±0.19
U2-Net⁵⁵	97.97±0.06	96.12±0.14	98.22±0.08
DeepLabV3+⁵⁶	97.92±0.09	95.99±0.15	98.20±0.06
TransUNet⁵⁷	97.88±0.10	95.96±0.18	98.13±0.10
Swin-Unet⁵⁸	97.24±0.15	94.92±0.35	97.69±0.13
TSegNet	98.16±0.10	96.42±0.18	98.31±0.08

Figure 8 presents qualitative comparisons of tongue body segmentation results produced by different models, where panel (a) shows the input image and panels (b)-(h) show the predicted masks of U-Net, U-Net++, U2 -Net, DeepLabV3+, TransUNet, Swin-Unet, and our TSegNet, respectively. The red overlay denotes the predicted tongue region.

Figure 8.

Comparative visualization of tongue segmentation. (a) Input image; (b) U-Net; (c) U-Net++; (d) U²-Net; (e) DeepLabV3+; (f) TransUNet; (g) Swin-Unet; (h) TSegNet. The red overlay denotes the region predicted as the tongue body by each model.

The first two rows present scenarios where the tongue target occupies a dominant proportion of the image. In these samples, the tongue body shares similar chromatic characteristics with the surrounding lips and facial skin, creating low-contrast boundaries. Under these conditions, some baselines produce less coherent masks: for instance, TransUNet (col.∼f) in row 1 shows locally inconsistent predictions within the tongue region, while U-Net (col.∼b) in row 2 exhibits slight over-segmentation near the mouth boundary.

The third row illustrates a small-target scenario with background interference. In this case, Swin-Unet (col.∼g) produces noticeable false positives on the red clothing region, and several methods show minor spurious activations around facial areas, indicating sensitivity to contextual distractors.

The last two rows further demonstrate challenges under unconstrained geometric and lighting conditions. The fourth row depicts an oblique viewing angle where shadows blur the distinction between the tongue and surrounding tissues. Multiple baselines show mislocalized predictions near the upper lip region. In the fifth row, the low color contrast between the tongue and lower lip leads to boundary leakage for some models.

In contrast, TSegNet generates more complete masks with cleaner and more consistent boundaries across these challenging scenarios, indicating improved robustness for real-world tongue image segmentation.

Ablation study of TSegNet

Effect of composite loss. We first evaluated the impact of different loss functions on TSegNet. As shown in Table 5, while training TSegNet employing only DSC loss or only CE loss yields strong results, the composite loss provided the strongest overall trade-off across the evaluated metrics. Qualitative results are shown in Figure 9. In these examples, training with CE or DSC alone occasionally produces small isolated false positives in non-tongue regions (e.g., near surrounding tissues or clothing) and slightly less consistent boundaries under challenging backgrounds, while the composite loss produces cleaner and more accurate tongue boundaries. These results suggest that combining DSC and CE loss provided a better overall optimization trade-off for the present segmentation task. Consequently, we adopt the composite loss for all subsequent experiments.

Table 5.

Ablation study on loss functions for TSegNet (mean ± std over three random seeds).

Loss function	Dice (%)	IoU (%)	PA (%)
DSC only	98.02±0.02	96.18±0.01	98.22±0.01
CE only	98.10±0.02	96.34±0.02	98.32±0.03
DSC+CE	98.16±0.10	96.42±0.18	98.31±0.08

Abbreviations: DSC, Dice loss; CE, cross-entropy loss.

Figure 9.

Qualitative comparison of segmentation results using different loss functions. (a) Input image; (b) Ground truth; (c) DSC loss only; (d) CE loss only; (e) Composite loss. The red overlay denotes the predicted tongue region.

Effect of attention modules. We further examine attention choices within ResEMA by replacing EMA with Coordinate Attention (CA⁵⁹), Efficient Channel Attention (ECA⁶⁰), and CBAM.⁴⁸ The results (Table 6) show that EMA consistently provides a marginal yet clear advantage over the others in our specific application.

Table 6.

Ablation study on attention mechanisms for TSegNet (mean ± std over three random seeds).

Attention	Dice (%)	IoU (%)	PA (%)
CA	98.03 ± 0.18	96.23 ± 0.29	98.21 ± 0.14
CBAM	98.10 ± 0.13	96.32 ± 0.23	98.25 ± 0.11
ECA	98.10 ± 0.07	96.32 ± 0.12	98.25 ± 0.07
EMA	98.16 ± 0.10	96.42 ± 0.18	98.31 ± 0.08

Effect of ResEMA and CSAF Gate. We quantify the contribution of ResEMA and CSAF–Gate in TSegNet by disabling each component while keeping all other settings unchanged. Specifically, we evaluate: removing the attention operation inside ResEMA (w/o ResEMA-attn), and removing CSAF–Gate from the skip connection (w/o CSAF–Gate).

As shown in Table 7, removing ResEMA results in the largest performance degradation (Dice 97.94%, IoU 96.04%, PA 98.01%), indicating that deep-stage attention plays an important role in suppressing non– tongue distractions. In contrast, removing CSAF–Gate yields a moderate performance drop (Dice 98.15%, IoU 96.41%, PA 98.29%). The full TSegNet with ResEMA and CSAF–Gate achieves the best overall performance (Dice 98.16%, IoU 96.42%, PA 98.31%), demonstrating the complementary benefits of these components.

Table 7.

Ablation study on TSegNet architecture (mean ± std over three random seeds).

Model variant	Dice (%)	IoU (%)	PA (%)
w/o ResEMA–attn	97.94 ± 0.10	96.04 ± 0.15	98.01 ± 0.07
w/o CSAF–Gate	98.15 ± 0.06	96.41 ± 0.12	98.29 ± 0.10
Full TSegNet	98.16 ± 0.10	96.42 ± 0.18	98.31 ± 0.08

Abbreviations: ResEMA-attn, residual modulation with deep-stage attention enhancement; CSAF-Gate, channel-spatial attention fusion gate; w/o, without.

Fissured-tongue feature recognition results

Quantitative performance comparison

Table 8 presents a comprehensive comparison of fissured tongue classification performance across several deep learning networks, including MobileNet, VGG16, ResNet–50, ConvNeXt, ResNeXt–50, DenseNet121, and our proposed ResKAN.

Table 8.

Fissured-tongue feature recognition performance (mean ± std over three random seeds).

Model	ACC (%)	Ppos (%)	Sens. (%)	Spec. (%)
MobileNetV3⁶¹	90.85 ± 2.45	91.26 ± 2.36	90.00 ± 4.32	91.67 ± 2.40
VGG16⁶²	86.93 ± 3.61	83.12 ± 5.19	92.67 ± 3.40	81.41 ± 7.25
ResNet–50	86.60 ± 4.55	84.32 ± 4.64	89.33 ± 5.25	83.97 ± 4.80
ConvNeXt⁶³	91.18 ± 1.60	91.40 ± 6.15	91.33 ± 4.98	91.03 ± 6.54
ResNeXt–50⁶⁴	87.25 ± 3.48	88.35 ± 4.14	90.67 ± 3.12	88.79 ± 2.45
DenseNet121⁶⁵	85.29 ± 2.12	88.38 ± 8.26	82.00 ± 4.89	88.46 ± 8.31
ResKAN	92.48 ± 1.22	92.05 ± 0.20	92.67 ± 2.49	92.31 ± 0.00

Overall, ResKAN achieves the highest accuracy (92.48%) and F1pos (92.34%), while maintaining a favorable balance between sensitivity (92.67%) and specificity (92.31%). Compared with ConvNeXt, ResKAN improves accuracy by 1.30 percentage points and F1pos by 1.29 percentage points, demonstrating stronger discrimination for fissure–related patterns. Notably, VGG16 attains a high sensitivity (92.67%) but a substantially lower specificity (81.41%), indicating a tendency to over–predict the positive (fissured) class and thus produce more false alarms. In contrast, ResKAN maintains a better balance between high sensitivity and high specificity. Relative to ResNet–50, ResKAN improves accuracy and F1pos by 5.88 and 5.62 percentage points, respectively. These results demonstrate the effectiveness and robustness of the proposed ResKAN for fissured–tongue recognition.

Error analysis and threshold-independent evaluation

To gain deeper insights into the error patterns of different models, we present the pooled, row-normalized confusion matrices across the three test splits. As shown in Figure 10, the matrices reveal two typical error modes. The first is a false alarm (FP), where a normal image is incorrectly predicted as fissured. The second is a missed detection (FN), where a fissured image is incorrectly predicted as normal. VGG16 shows a noticeably higher FP rate, misclassifying 29 normal samples as fissured, which is consistent with its lower specificity. Conversely, DenseNet121 suffers from a relatively high FN rate, missing 27 fissured cases (18.0%). ResKAN mitigates both error types, reducing FP to 12 (7.7%) and FN to 11 (7.3%), suggesting a more balanced behavior between sensitivity and specificity.

Figure 10.

Comparison of pooled confusion matrices. (a) MobileNetV3; (b) VGG16; (c) ResNet--50; (d) ConvNeXt; (e) ResNeXt--50; (f) DenseNet121; (g) ResKAN. For each model, confusion counts are aggregated over three random splits by summation, and each row is normalized by the number of samples in the corresponding true class. Each cell reports the pooled count and the corresponding percentage.

We further evaluate the threshold-independent discrimination capability using pooled receiver operating characteristic (ROC) curves in Figure 11. The ROC curve plots the trade-off between the true positive rate (TPR) and false positive rate (FPR) as the decision threshold varies. The Area Under the Curve (AUC) provides a scalar summary and can be interpreted as the probability that the model ranks a randomly chosen fissured sample higher than a randomly chosen normal sample. ResKAN achieves the highest AUC (0.968), indicating superior overall discrimination capability across various classification thresholds. ConvNeXt follows with an AUC of 0.955, while DenseNet121 lags at 0.900. The ROC-AUC results suggest that ResKAN’s performance advantage is robust and not dependent on a specific operating point.

Figure 11.

Comparison of ROC curves and AUC scores. The curves illustrate the trade--off between TPR and FPR across varying decision thresholds, and the AUC for each model is listed in the legend.

Grad-CAM visualization results

Grad-CAM heatmaps were used to qualitatively examine the image regions contributing to ResKAN predictions (Figure 12). In the generated heatmaps, warmer colors indicate greater contribution to the predicted class, whereas cooler colors indicate lower relevance. The visualizations show that ResKAN tends to focus on the fissure regions, including cases with varying illumination or subtle fissure presentations. These qualitative results suggest that ResKAN primarily relies on texture cues consistent with fissured-tongue features.

Figure 12.

Grad--CAM visualization of ResKAN for fissured-tongue feature recognition. Top row: input ROI images. Bottom row: Grad--CAM heatmaps, which highlight the regions of the input image that most influenced the prediction.

Integrated prototype demonstration

Figure 13 presents an end-to-end demonstration of the integrated prototype workflow. The prototype supports guided tongue-image acquisition at the terminal, followed by segmentation-based quality gating and server-side analysis. After image submission, TSegNet is first used to determine whether a valid tongue region can be extracted. Cases with a valid segmentation result proceed to the analysis dashboard for ROI visualization and fissured-tongue recognition, whereas unsuccessful cases are intercepted and returned with a recapture prompt. For valid cases, the system further provides user-facing feedback based on the predicted fissured-tongue feature status and confidence level.

Figure 13.

End-to-end demonstration of the integrated prototype workflow. (a) Guided image acquisition. The user completes profile setup and captures a tongue image through the guided interface; (b) Segmentation-based quality gate. TSegNet is used to verify whether a valid tongue region can be extracted. (1) If a valid tongue region is obtained, the system proceeds to the analysis dashboard, where the acquired image, segmented ROI, and model outputs are displayed. (2) If no valid tongue region is detected, the inference is interrupted and a recapture prompt is returned; (c) Confidence-stratified user feedback for valid cases. After valid segmentation, the ResKAN module performs fissured-tongue feature recognition and returns user-facing feedback with different confidence levels, including (1) feature-oriented feedback for high-confidence cases, (2) more cautious feedback for moderate-confidence cases, and (3) uncertainty-aware feedback for low-confidence or normal cases.

Discussion

Principal findings

This study developed an integrated portable system for non-contact tongue imaging and automated analysis in community and home settings. By combining a guided acquisition terminal with server-side deep learning models, the proposed framework supports image capture, tongue ROI extraction, fissured-tongue feature recognition, and user-facing feedback within a unified workflow. Taken together, these findings suggest that portable tongue-image analysis is technically feasible under relatively unconstrained acquisition conditions.

Interpretation of tongue segmentation performance

A notable finding of this study is the strong segmentation performance achieved by TSegNet on tongue images acquired in routine clinical settings rather than under highly standardized imaging conditions. Reliable tongue ROI extraction is an essential prerequisite for downstream analysis, particularly in portable deployment scenarios where background interference, pose variation, and illumination inconsistency may be more common. In this context, the quantitative and qualitative results suggest that the proposed segmentation strategy can provide a stable basis for subsequent analysis of tongue images acquired outside tightly controlled laboratory settings.

Interpretation of fissured-tongue feature recognition

The recognition results suggest that ResKAN is effective for the current recognition task targeting fissured-tongue features. Compared with the other evaluated models, ResKAN achieved the best overall mean performance and showed a more balanced error pattern in terms of false positives and false negatives. This may be related to the combination of residual feature extraction and KAN-based nonlinear modeling, which may help capture the subtle local texture variations associated with fissured-tongue features. In addition, the Grad-CAM visualizations provided qualitative support that ResKAN tended to attend to fissure-related regions, including cases with varying illumination and relatively subtle fissure presentations.

From a practical perspective, fissured tongue is a visually recognizable morphological sign and therefore provides a reasonable initial target for automated tongue-image analysis in portable deployment scenarios. Focusing first on a single and relatively explicit visual feature also allowed the feasibility of the overall workflow to be evaluated in a more controlled manner.

Limitations and future work

Despite these encouraging results, several limitations should be acknowledged. First, the usability and illumination robustness of the acquisition terminal require further quantitative validation across broader user demographics and diverse environments. Second, the current framework was limited to fissured-tongue feature recognition and did not jointly consider other tongue signs relevant to broader tongue analysis. In addition, the present model did not further characterize fissure-related properties such as severity, depth, number, or spatial distribution. Finally, several subject-related and acquisition-related factors may affect the visual appearance of tongue fissures, including hydration status, mouth breathing, smoking, oral hygiene, nutritional status, fungal infection, illumination, and tongue posture. These factors were not systematically controlled or analyzed in the present study. Accordingly, the current system should be interpreted as providing preliminary automated recognition of fissured-tongue imaging features rather than an independent clinical diagnosis.

Future work will therefore focus on broader real-world validation of the portable system and on extending the analytical scope of the framework. In particular, additional tongue signs, such as tongue body color, coating characteristics, and moisture, may be incorporated into a more comprehensive analysis pipeline. We will also investigate fissure-related morphology in greater detail, including finer-grained characterization of fissure severity and structural patterns through severity grading and quantitative morphological analysis.

Conclusion

This paper presents a portable non-contact tongue imaging system that integrates guided image acquisition with server-side automated analysis for community and home settings. The proposed framework combines TSegNet for tongue ROI extraction and ResKAN for fissured-tongue feature recognition, and the experimental results support the feasibility of this integrated approach under relatively unconstrained acquisition conditions. These findings provide an initial step toward more accessible and deployable automated tongue-image analysis.

Footnotes

Acknowledgements

The authors acknowledge Guangdong Polytechnic Normal University for providing the research platform and resources, and Guangzhou University of Chinese Medicine for its crucial data support.

Ethical considerations

This study was approved by the Ethics Committee of Maoming Hospital of Traditional Chinese Medicine (approval number: 2024061301). All participants provided informed consent before participating in the study.

Author contributions

The authors confirm contribution to the article as follows: Shaoyang Men, Peipei Zhou and Jiehan Wei did conceptualization; Jiehan Wei, Jun Song and Weiliang Lu did methodology; Jiehan Wei did software and writing---original draft; Jiehan Wei, Shaoyang Men, Peipei Zhou, Chuangquan Lin and Jun Song did writing---review and editing; Shaoyang Men and Peipei Zhou did supervision; Shaoyang Men did project administration; Peipei Zhou did funding acquisition. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (NSFC) (Grant Nos. 82575258 and T2341009).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are not publicly available due to ethical and privacy restrictions, but are available from the corresponding author upon reasonable request.*

References

Zhang

Huang

Gao

, et al. Deep sparse transfer learning for remote smart tongue diagnosis. In: Mathematical biosciences and engineering : MBE. AIMS Press, 2021. Epub ahead of print 2021. https://doi.org/10.3934/mbe.2021063

Wang

Yan

Guo

, et al. All around suboptimal health—a joint position paper of the suboptimal health study consortium and european association for predictive, preventive and personalised medicine. EPMA Journal 2021; 12: 403–433. https://doi.org/10.1007/s13167-021-00253-2

Tsang

Huang

Koehler

. Integration of Chinese medicine and western medicine in clinical practice (patient care): Past, present, and a proposed model for the future. Chinese journal of integrative medicine 2013; 19: 83–85. https://doi.org/10.1007/s11655-013-1350-9

Jiang

Q-Y

Zheng

, et al. Constitution of traditional Chinese medicine and related factors in women of childbearing age. Journal of the Chinese Medical Association 2018; 81: 358–365. https://doi.org/10.1016/j.jcma.2018.01.005

Zhao

Guo

Fan

, et al. Medical conditions and preference of traditional Chinese medicine: Results from the China healthcare improvement evaluation survey. Patient preference and adherence 2023; 17: 227–237. https://doi.org/10.2147/PPA.S398644

Liu

Yang

, et al. A survey of artificial intelligence in tongue image for disease diagnosis and syndrome differentiation. DIGITAL HEALTH 2023; 9: 20552076231191044. https://doi.org/10.1177/20552076231191044

Zhou

Zhang

. constitution identification of tongue image based on CNN. In: 2018 11th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Beijing, China, 13–15 October 2018, pp. 1–5. IEEE.

Zuo

Wang

, et al. TonguExpert: A deep learning-based algorithm platform for fine-grained extraction and classification of tongue phenotypes. Phenomics 2025; 5: 1–14. https://doi.org/10.1007/s43657-024-00210-9

Yan

Jiang

. Classification of fissured tongue images using deep neural networks. Technology and Health Care 2022; 30: 271–283. https://doi.org/10.3233/THC-228026

10.

Wen

Wang

, et al. Complexity perception classification method for tongue constitution recognition. Artificial intelligence in medicine 2019; 96: 123–133. https://doi.org/10.1016/j.artmed.2019.03.008

11.

Liu

Wang

, et al. The relationship between abnormal tongue features and non-malignant upper gastrointestinal disorders: A hospital-based cross-sectional study. European Journal of Integrative Medicine 2021; 47: 101379. https://doi.org/10.1016/j.eujim.2021.101379

12.

Chen

W-J

, et al. The study on the agreement between automatic tongue diagnosis system and traditional Chinese medicine practitioners. In: Evidence-based complementary and alternative medicine. Hindawi Publishing Corporation, 2012. Epub ahead of print 2012. https://doi.org/10.1155/2012/505063

13.

Wang

Liu

, et al. Artificial intelligence in tongue diagnosis: Using deep convolutional neural network for recognizing unhealthy tongue with tooth-mark. Computational and structural biotechnology journal 2020; 18: 973–980. https://doi.org/10.1016/j.csbj.2020.04.002

14.

Segawa

Iizuka

Ogihara

, et al. Objective evaluation of tongue diagnosis ability using a tongue diagnosis e-learning/e-assessment system based on a standardized tongue image database. Frontiers in Medical Technology 2023; 5: 1050909. https://doi.org/10.3389/fmedt.2023.1050909

15.

Jiamsanguanwong

Luo

Kaingam

, et al. Redesigning telemedicine for traditional Chinese medicine: Service design approach to digital transformation. JMIR Human Factors 2025; 12: e76752. https://doi.org/10.2196/76752

16.

Ezeamii

Okobi

Wambai-Sani

, et al. Revolutionizing healthcare: How telemedicine is improving patient outcomes and expanding access to care. Cureus 2024; 16: e63881. https://doi.org/10.7759/cureus.63881

17.

X-Z

Liu

Lin

X-X

, et al. An AI-powered tongue image model for home-based monitoring of liver fibrosis. npj Digital Medicine 2025; 9: 67.

18.

Lin

Ning

Zhang

, et al. Computerized tongue image analysis for non-invasive disease screening: A review. Chinese Medicine 2025; 20: 196. https://doi.org/10.1186/s13020-025-01242-7

19.

Xie

Jing

Zhang

, et al. Digital tongue image analyses for health assessment. Medical Review 2021; 1: 172–198. https://doi.org/10.1515/mr-2021-0018

20.

Zhang

. Tongue images acquisition system design. In: Zhang

Zhang

(eds) Tongue Image Analysis. Springer, 2017; pp. 19–44.

21.

Tania

Lwin

Hossain

. Advances in automated tongue diagnosis techniques. Integrative Medicine Research 2019; 8: 42–56. https://doi.org/10.1016/j.imr.2018.03.001

22.

Qiu

Zhang

Wan

, et al. A novel tongue feature extraction method on mobile devices. Biomedical Signal Processing and Control 2023; 80: 104271. https://doi.org/10.1016/j.bspc.2022.104271

23.

M-C

Lan

K-C

Fang

W-C

, et al. Automated tongue diagnosis on the smartphone and its applications. Computer Methods and Programs in Biomedicine 2019; 174: 51–64. https://doi.org/10.1016/j.cmpb.2017.12.029

24.

M-C

Cheng

M-H

Lan

K-C

. Color correction parameter estimation on the smartphone and its application to automatic tongue diagnosis. Journal of medical systems 2016; 40: 18. https://doi.org/10.1007/s10916-015-0387-z

25.

Xian

Xie

Yang

, et al. Automatic tongue image quality assessment using a multi-task deep learning model. Frontiers in Physiology 2022; 13: 966214. https://doi.org/10.3389/fphys.2022.966214

26.

Cajita

Hodgson

Lam

, et al. Facilitators of and barriers to mHealth adoption in older adults with heart failure. CIN: Computers, Informatics, Nursing 2018; 36: 376–382. https://doi.org/10.1097/CIN.0000000000000442

27.

Murabito

Faro

Zhang

, et al. Smartphone app designed to collect health information in older adults: Usability study. JMIR Human Factors 2024; 11: e56653. https://doi.org/10.2196/56653

28.

Wilson

Byrne

Rodgers

, et al. A systematic review of smartphone and tablet use by older adults with and without cognitive impairment. Innovation in Aging 2022; 6: igac002. https://doi.org/10.1093/geroni/igac002

29.

Tian

Liu

, et al. Tongue image segmentation algorithm based on deep convolutional neural network and attention mechanism. Journal of Intelligent & Fuzzy Systems 2023; 45: 1473–1480. https://doi.org/10.3233/jifs-221411

30.

Ning

Zhang

, et al. Automatic tongue image segmentation based on gradient vector flow and region merging. Neural Computing and Applications 2012; 21: 1819–1826. https://doi.org/10.1007/s00521-010-0484-3

31.

Zhang

Wang

You

, et al. Tongue color analysis for medical application. Evidence-Based Complementary and Alternative Medicine 2013; 2013: 264742. https://doi.org/10.1155/2013/264742

32.

Liu

, et al. Tongue image segmentation via thresholding and gray projection. KSII Transactions on Internet and Information Systems (TIIS) 2019; 13: 945–961.

33.

Zhang

. Robust tongue segmentation by fusing region-based and edge-based approaches. Expert Systems with Applications 2015; 42: 8027–8038. https://doi.org/10.1016/j.eswa.2015.06.032

34.

Zhou

Fan

. TongueNet: Accurate localization and segmentation for tongue images using deep neural networks. IEEE Access 2019; 7: 148779–148789. https://doi.org/10.1109/access.2019.2946681

35.

Huang

Lai

Wang

. TU-net: A precise network for tongue segmentation. In: Proceedings of the 2020 9th International Conference on Computing and Pattern Recognition, New York, NY, USA, 30 October–1 November 2020, pp. 244–249. Association for Computing Machinery.

36.

Tang

Tan

, et al. RTC_TongueNet: An improved tongue image segmentation model based on DeepLabV3. Digital Health 2024; 10: 20552076241242773. https://doi.org/10.1177/20552076241242773

37.

Zhang

Zhao

. Hyperspectral-cube-based mobile face recognition: A comprehensive review. Information Fusion 2021; 74: 132–150. https://doi.org/10.1016/j.inffus.2021.04.003

38.

Huang

Zhang

, et al. Attention guided tongue segmentation with geometric knowledge in complex environments. Biomedical Signal Processing and Control 2025; 104: 107426. https://doi.org/10.1016/j.bspc.2024.107426

39.

Yang

Wang

Liew

AW-C

. Fine-grained lip image segmentation using fuzzy logic and graph reasoning. IEEE Transactions on Fuzzy Systems 2023; 32: 349–359. https://doi.org/10.1109/tfuzz.2023.3298323

40.

Balasubramaniyan

Jeyakumar

Nachimuthu

. Panoramic tongue imaging and deep convolutional machine learning model for diabetes diagnosis in humans. Scientific Reports 2022; 12: 186. https://doi.org/10.1038/s41598-021-03879-4

41.

Liu

Wang

Vaidya

, et al. Kan: Kolmogorov-arnold networks. 24-28 April 2025; ?International Conference on Learning Representations (ICLR 2025), Singapore.

42.

Wang

Dong

Zhang

. KAN-HyperMP: An enhanced fault diagnosis model for rolling bearings in noisy environments. Sensors 2024; 24: 6448. https://doi.org/10.3390/s24196448

43.

Al-qaness

MAA

Tcnn-Kan

. Optimized CNN by kolmogorov-arnold network and pruning techniques for sEMG gesture recognition. IEEE Journal of Biomedical and Health Informatics 2025; 29: 188–197.

44.

Zheng

Chen

Liu

, et al. Milling cutter wear state identification method based on improved ResNet-34 algorithm. Applied Sciences 2024; 14: 8951. https://doi.org/10.3390/app14198951

45.

Kumar

Singh

Sharma

, et al. Hybrid CNN–KAN models for benchmark image classification. In: International conference on intelligent vision and computing, Agartala, India, 23-24 November 2024, pp. 157–166. Springer.

46.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778. IEEE.

47.

Ouyang

Zhang

, et al. Efficient multi-scale attention module with cross-spatial learning. In: ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023, pp. 1–5. IEEE.

48.

Woo

Park

Lee

J-Y

, et al. Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018, pp. 3–19.

49.

Oktay

Schlemper

Folgoc

, et al. Attention u-net: Learning where to look for the pancreas. 4‐6 July 2018; Medical Imaging with Deep Learning (MIDL 2018), Amsterdam, the Netherlands.

50.

Sudre

Vercauteren

, et al. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Cardoso

Arbel

Carneiro

, et al. (eds) Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Springer International Publishing, 2017, pp. 240–248.

51.

Milletari

Navab

Ahmadi

S-A

. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 2016 Fourth International Conference on 3D Vision (3DV), Stanford, CA, USA, 25–28 October 2016, pp. 565–571.

52.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In: IEEE international conference on computer vision, ICCV 2017, venice, italy, 22-29 october 2017, 2017, pp. 618–626. IEEE Computer Society.

53.

Ronneberger

Fischer

Brox

. U-net: Convolutional networks for biomedical image segmentation. In: Navab

Hornegger

Wells

, et al. (eds) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. Springer International Publishing, pp. 234–241.

54.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. UNet++: Redesigning skip connections to exploit multiscale features in image segmentation. IEEE Transactions on Medical Imaging 2020; 39: 1856–1867. https://doi.org/10.1109/TMI.2019.2959609

55.

Qin

Zhang

Huang

, et al. U2-net: Going deeper with nested u-structure for salient object detection. Pattern Recognition 2020; 106: 107407. https://doi.org/10.1016/j.patcog.2020.107404

56.

Chen

L-C

Zhu

Papandreou

, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation. In: Ferrari

Hebert

Sminchisescu

, et al. (eds) Computer Vision – ECCV 2018. Springer International Publishing, 2018, pp. 833–851.

57.

Chen

, et al. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306. 2021; https://doi.org/10.48550/arXiv.2102.04306

58.

Cao

Wang

Chen

, et al. Swin-unet: Unet-like pure transformer for medical image segmentation. 2023; ECCV 2022 Workshops, Tel Aviv, Israel, 205–218. https://doi.org/10.1007/978-3-031-25066-8_9

59.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Virtual, 19–25 June 2021, pp. 13713–13722.

60.

Wang

Zhu

, et al. ECA-net: Efficient channel attention for deep convolutional neural networks. Epub ahead of print 7 April 2020. https://doi.org/10.48550/arXiv.1910.03151

61.

Howard

Sandler

Chu

, et al. Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, Korea, 27 October–2 November 2019, pp. 1314–1324.

62.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. May 7-9, 2015; International Conference on Learning Representations 2015 (ICLR 2015), San Diego, CA, USA, 1–14.

63.

Liu

Mao

C-Y

, et al. A ConvNet for the 2020s. https://arxiv.org/abs/2201.03545 (2022).

64.

Xie

Girshick

Dollár

, et al. Aggregated residual transformations for deep neural networks, 2017. https://arxiv.org/abs/1611.05431

65.

Huang

Liu

van der Maaten

, et al. Densely connected convolutional networks. 21‐26 July 2017; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2261–2269. https://arxiv.org/abs/10.1109/CVPR.2017.243