Segnet unveiled: Robust image segmentation via rigorous K-fold cross-validation analysis

Abstract

Background

In computer vision, image segmentation is crucial with applications ranging from autonomous driving to medical imaging.

Objective

To provide reliable segmentation across varied datasets, this study assesses the performance of an image segmentation model based on SegNet.

Method

Using a five-fold and a K-fold cross-validation method, the SegNet model is thoroughly validated. Intersection over Union (IOU), Dice Coefficient, Precision, Recall, Accuracy, and loss metrics are measured in the study to assess how well the model performs and is optimized throughout training.

Results

The SegNet model consistently performs well throughout the folds, with Dice Coefficient values ranging from 88.32% to 89.8% and IOU scores ranging from 94.53% to 95.05%. The model's dependability is confirmed by metrics like precision, recall, and accuracy, all of which often exceed 90%. Loss values between 0.495 and 0.547 show that training optimized the system effectively.

Conclusion

By enhancing the validation reliability, the K-fold cross-validation method highlights by what means the SegNet model segments objects in images across a range of datasets. These outcomes strengthen the confidence in the model's ability to generalize and highlight its potential for several practical uses in image segmentation.

Keywords

image segmentation SegNet model K-fold cross-validation computer vision intersection over union dice coefficient precision recall accuracy optimization

1 Introduction

The global incidence of skin cancer has seen an alarming upward trajectory, a phenomenon closely tied to several environmental factors. Excessive exposure to ultraviolet radiation, shifting climatic patterns, and the depletion of the ozone layer have been identified as significant contributors to this concerning rise. Skin cancer now holds the unfortunate distinction of being the most prevalent form of cancer worldwide. It manifests primarily in two varieties – non-melanoma and melanoma. Staggering statistics reveal that in 2018 alone, over one million new cases of non-melanoma skin cancer were reported globally, while melanoma skin cancer accounted for approximately 132,000 reported cases during the same period. Epidemiological evidence paints a grave picture, indicating that one in every three cancer diagnoses relates to skin cancer. Moreover, estimates by the highly esteemed World Health Organization suggest that a startling one in five individuals in the United States will eventually develop skin cancer over their lifetime. Advances in high-resolution imaging and artificial intelligence algorithms have made it possible for skin cancer patients in the United States to receive earlier and more accurate detection, which has a substantial positive impact on their prognosis. Timely interventions are essential for effective therapy, and these innovations make that possible.

The projected 10% depletion of the Earth's ozone layer is anticipated to trigger a substantial global increase in the incidence of skin cancers, with an estimated 4500 additional cases of melanoma and 300,000 cases of non-melanoma skin cancers. Notably, melanoma, a particularly virulent and lethal form of skin cancer, accounts for approximately 75% of all skin cancer-related mortalities, underscoring its status as a significant public health concern of grave import. To target the environmental causes of skin cancer, some actions are being taken into account, such as conducting public awareness campaigns, regulation of tanning beds, environmental protection policies, screening and early detection programs, and research and surveillance. Currently, the diagnostic paradigm for evaluating suspected skin lesions relies predominantly on manual visual examination and assessment by medical professionals. However, early detection of these malignancies can substantially mitigate the complexity and costs associated with their treatment. Recent advances in deep convolutional neural networks present a promising avenue for more effective analysis and classification of various skin cancer pathologies. This emerging artificial intelligence technology holds the potential to expedite and enhance the screening and evaluation processes for skin lesions, thereby potentially revolutionizing the field of skin cancer diagnosis and treatment modalities.

2 Literature survey

An advanced classifier using Convolutional Neural Networks (CNNs) was introduced by Brinker and Hekler,¹ which can accurately classify photos of skin cancer like the knowledge of dermatologists. Through the use of portable gadgets, this discovery holds the promise of life-saving diagnostics outside of hospitals. Additionally, skin lesion classification has shown CNNs to be efficacious. In comparison research, CNNs outperformed SVM, VGG16, and ResNet50, which produced accuracy rates of 83.4%, 82.4%, and 84.31%, respectively, in the diagnosis of skin cancer with machine learning algorithms, yielding an impressive 97.6% accuracy rate (reference²). By figuring out a mapping from low to high resolution, Yang et al. (reference³) presented a deep learning method for improving single photos. Their technique uses CNNs to perform this mapping, producing high-resolution outputs from low-resolution input images. Notably, conventional super-resolution techniques based on sparse coding can also be understood as deep CNNs. Deep CNNs provide better picture improvement capabilities despite their simplified architecture. Enhancing diagnostic accuracy, early detection, raising efficiency, scalability, diagnostic consistency, resource optimization, and cost-effectiveness are the potential benefits of integrating deep CNN-based diagnostic systems into existing healthcare workflows for skin lesion evaluation.

Convolutional neural networks employ visual texture and structure for picture classification, as demonstrated by Tamura et al.⁴ They propose that preprocessing methods for image enhancement can raise the general quality of images and increase CNN efficacy. Ly et al.⁵ offer CNNs that imitate picture enhancement and restoration to promote image categorization rather than human perception, in contrast to the current image enhancement strategies that typically try to improve human perception.

Medical image segmentation is an essential component of various medical analyses. Examples include skin lesion segmentation for early melanoma detection in dermoscopic images,^6,7 optic disc and cup segmentation and blood vessel segmentation for identifying structural details in retinal images,^8,9 breast lesion segmentation for auxiliary diagnosis in ultrasound imaging,^10,11 and lung segmentation for organ localization in computed tomography (CT) scans.^12,13 Traditionally, physicians manually segment these medical images, but this process is labour-intensive, subjective, and time-consuming due to the intricacies present, such as indistinct boundaries, ambiguous regions, and shadow artefacts. Consequently, there is an urgent need for highly accurate and reliable computer-aided segmentation approaches¹⁴ to overcome these challenges.

The application of machine learning techniques for skin cancer classification has been extensively investigated in various studies, each employing different feature extraction methodologies. Certain studies extracted features using the ABCD rule, GLCM, and HOG from a dataset comprising 328 benign melanoma images and 672 melanoma images sourced from the ISIC collection.¹⁵ Implementations of SVM classifiers yielded a high accuracy of 97.8% and an AUC of 0.94, while KNN classifiers exhibited a sensitivity of 86.2% and a specificity of 85%.¹⁶ Other approaches explored unsupervised learning with the k-means algorithm¹⁷ for skin cancer identification and categorization, achieving a classification rate of 52.63%. However, SVM outperformed both K-means and Back Propagation Neural Network, with accuracy levels ranging from 80% to 90%.¹⁸ In contrast, a proposed method¹⁹ focused on deep learning, specifically CNN methods, for skin lesion classification, utilizing transfer learning algorithms such as Inception V3, Resnet, VGG-16, and Mobilenet. Data augmentation and normalization techniques were also incorporated.²⁰ Supervised learning methods were employed²¹ for skin lesion categorization, attaining an accuracy of 86% with computer-aided diagnosis and MAP estimate techniques. These methods encompassed lesion segmentation, hair detection, and pigment network detection.²² The increasing number of cancer cases and fatalities highlights the critical need for early detection and treatment. Computer-assisted programs play a vital role in identifying diseased cells in high-resolution histopathological images at an early stage. Advanced segmentation techniques, including k-Means, Fuzzy C-Means, and superpixel segmentation algorithms like SLIC, Quickshift, Felzenszwalb, Watershed, and ERS, have shown improved performance over baseline methods. In particular, the Quickshift and SLIC approaches delivered outstanding results in the F-M test.²³

3 Methodology

3.1 Dataset

The increased incidence of melanoma has prompted the development of computer-aided diagnostic (CAD) systems specifically designed for the classification of dermoscopic images. The PH2 dataset was created to enable comparative evaluations of segmentation and classification methods for dermoscopic pictures. By benchmarking performance, training, and validation, facilitating research collaboration, supporting algorithmic development, and encouraging transparency, the development and testing of computer-aided diagnostic (CAD) algorithms for melanoma categorization are assisted by standardized datasets like PH2, which also improve patient outcomes and care The PH2 dataset comprises dermoscopic images acquired from the Dermatology Service of Hospital Pedro Hispano in Matosinhos, Portugal. This dataset is an excellent resource for dermatology and medical imaging researchers, enabling them to effectively create, refine, and evaluate their algorithms. Alternatively, it also delivers a consistent standard for the evaluation of image analysis algorithms, and it includes high-quality dermoscopic images, which are essential for the accurate diagnosis and analysis of skin lesions. A variety of skin lesions, including benign, malignant, and dysplastic nevi, have also been included in this dataset.

3.2 Image preprocessing

A critical step that is required to enhance the precision and functionality of our learning module is image preprocessing from the dataset. This process involves a series of methods for integrating, cleaning, normalising, and transforming image data.

Image pre-processing plays a crucial role in enhancing the overall accuracy and reliability of computer-aided diagnostic systems for melanoma classification through Noise Reduction, Normalization, Image Resizing, Color Space Conversion, and Segmentation. In general, these pre-processing steps can lead to extra precise and reliable melanoma detection and classification.

Figures 1 and 2 shows how our preprocessing techniques affected the Image Datasets and how they changed from their original state to the pre-processed state. The preprocessing processes outlined in the figures are as follows: Data Preprocessing Figures 3–7.

Figure 1.

Architecture of segnet.

Figure 2.

Input image dataset.

Figure 3.

Processed image dataset.

Figure 4.

Image augumentation.

Figure 5.

Segnet with detailed layers.

Figure 6.

5-fold cross-validation procedure for SegNet model training and evaluation.

Figure 7.

Graphical depiction of accuracy and loss.

4 Original image dataset

4.1 Processed image

Normalising Picture Size: To guarantee consistent measurements, we resized every image in the collection to 256 × 256 pixels. Resizing was crucial to provide uniformity and compatibility with our selected model.²⁴

The goal was to decrease noise for better clarity and highlight essential components in the image by sharpening it using image filters. This approach enables the model to extract meaningful information from the data more effectively.²⁵

Photographs captured in the BGR colour space have been transformed into the RGB colour space. This conversion helps accurately interpret image features and ensures colour stability. Transforming the photographs from BGR to RGB color space to be compatible with image analysis programs, accurately represent color, and extract features from images. The pros of the converting the images to the RGB color space in terms of ensuring color stability is that the RGB is a commonly utilized color model, color representation is consistent on a range of platforms and devices. Conversely, the cons imply that converting from BGR to RGB may lead to slight alterations in color representation, which could affect the model's ability to learn the specific features if not appropriately handled. Though converting images to the RGB color space enhances the model performance and color stability in general, it is essential to consider the potential disadvantages associated with information loss and higher preprocessing pipeline complexity.

To ensure color stability throughout the conversion of the BGR color space into the RGB color space, the following approaches are commonly used: channel rearrangement, data type consistency, color calibration, and testing and validation.

Image scaling involves adjusting the pixel intensity levels of an image to conform to a specific range. Standardising pixel values enhances model convergence in the training process.²⁶

Image tagging is essential for precise and dependable supervised learning. We ensured precise categorization of the images in the dataset, a crucial step for effective model training.²⁷ Image tagging is important to ensure accurate and reliable supervised learning outcomes because it establishes a clear association between input data and expected outputs, improves model accuracy, supports data diversity, and enables evaluation and research.

Picture augmentation techniques like random rotation and horizontal flipping improve the resilience and generalisation abilities of deep learning models trained on image data.²⁸ These methods enhance model accuracy, especially when working with restricted or uneven datasets.²⁹

4.2 Random rotation

The random rotation function applies a random rotation to the input images within a specified range, typically between −40 and 40 degrees. This operation helps in enhancing the robustness of the model by exposing it to variations in the orientation of the input images.³⁰ Random rotational transformations to the image data simulate real-world scenarios where the orientations of objects are subject to variability, thereby enhancing the model's capacity to generalize to previously unseen data.³¹

The horizontal flip function horizontally mirrors the input images. This operation effectively augments the dataset by introducing variations in the spatial arrangement of objects within the images. For tasks where the orientation of objects lacks significance, such as object detection or classification, implementing horizontal flipping proves particularly beneficial.^32–34 Incorporating horizontally flipped images into the training dataset makes the model more invariant to left-right orientation changes, leading to better generalization. By introducing variations in the spatial arrangement of objects, horizontally mirrored images are added to the training dataset, which improves overall generalization performance. This augmentation method assists the model become invariant to left-right orientation changes, allowing it to recognize objects regardless of their orientation in real-world scenarios.

4.3 Image augmentation

The img\_augmentation function integrates the random rotation and horizontal flip operations to augment the dataset. Image augmentation is a critical technique in deep learning environments characterized by the limited availability of training data. Through the application of random transformations, such as rotation and flipping, to the input images, the augmented dataset presents a more diverse and representative collection of examples for training purposes. This methodology contributes to mitigating overfitting and enhancing the model's performance on unseen data.^35–37 Random transformations such as rotation and flipping generate a more diverse and representative training dataset, which is critical for constructing robust models that generalize well to new, previously unknown data. This eventually leads to better performance in image segmentation and object recognition tasks.

Signet Algorithm:

Algorithm: SegNet with Skip Connections

Input:

Number of epochs (epochs\_num)

Model save name (savename)

Training images (x\_train)

Training labels (y\_train)

Validation images (x\_val)

Validation labels (y\_val)

Output:

Trained model

Training history

Step 1: Define the SegNet architecture with skip connections

1.1: Define input layer with shape (192, 256, 3)

1.2: Define encoding layers with Conv2D, Batch Normalization, and ReLU activation

1.3: Apply max pooling after every two encoding layers and save the skip connections

1.4: Define decoding layers with Conv2DTranspose, Batch Normalization, and ReLU activation

1.5: Concatenate skip connections with corresponding decoding layers

1.6: Define the output layer with sigmoid activation and reshape to (192, 256)

Step 2: Compile the model

2.1: Compile the model with SGD optimizer, binary cross-entropy loss, and evaluation metrics (iou, dice\_coef, precision, recall, accuracy)

Step 3: Train the model

3.1: Facilitate the model's learning process by utilizing the training data and validation data.

3.2: Define the values for the number of epochs, batch size, and verbose mode parameters during model training.

3.3: Save the training history

Step 4: Save the trained model

4.1: Save the trained model with the specified save name

Step 5: Output

5.1: Return the trained model and training history

SegNet, a convolutional neural network (CNN), is designed explicitly for semantic segmentation tasks. It operates by categorizing pixels within an image into distinct classes or groups, a process known as semantic segmentation. To achieve pixel-wise segmentation, SegNet employs an encoder-decoder architecture. The SegNet framework's encoder-decoder architecture is precisely developed to provide pixel-wise segmentation by efficiently capturing and recreating features from input images. The encoder component collects hierarchical feature representations via a sequence of convolutional layers, gradually reducing spatial dimensions while increasing feature depth. This technique captures important high-level semantic information. This architecture facilitates the extraction of salient feature representations from the input image, subsequently mapping these features back.^38,39 It employs convolutional layers for hierarchical feature learning, pooling for dimensionality reduction, and upsampling with skip connections for exact reconstruction of the segmented output to effectively facilitate feature extraction and mapping. The architecture allows for the effectual usage of features learned at different levels, refining the accuracy of segmentation Tables 1 and 2.

Table 1.
Layers of convolution and deconvolution.

Conv Filter size No. of features Deconv Filter size No. of features

conv-1-1 3 × 3 16 decv-1 3 × 3 256

pool-1 2 × 2 16 ups-1 2 × 2 256

conv-1-2 3 × 3 32 decv-2-1 3 × 3 256

pool-2 2 × 2 32 decv-2-2 3 × 3 128

conv-2-1 3 × 3 64 ups-2 2 × 2 128

pool-3 2 × 2 64 decv-3-1 4 × 4 128

conv-2-2 3 × 3 64 decv-3-2 3 × 3 128

pool-4 2 × 2 64 ups-3 2 × 2 128

conv-3-1 3 × 3 128 decv-4-1 3 × 3 64

pool-5 2 × 2 128 decv-4-2 3 × 3 32

conv-3-2 3 × 3 128 ups-4 2 × 2 32

pool-6 2 × 2 128 decv-5-1 3 × 3 16

conv-4-1 3 × 3 256

pool-7 2 × 2 256

conv-4-2 3 × 3 256

pool-8 2 × 2 256

conv-5 3 × 3 512

output 3 × 3 1

Conv	Filter size	No. of features	Deconv	Filter size	No. of features
conv-1-1	3 × 3	16	decv-1	3 × 3	256
pool-1	2 × 2	16	ups-1	2 × 2	256
conv-1-2	3 × 3	32	decv-2-1	3 × 3	256
pool-2	2 × 2	32	decv-2-2	3 × 3	128
conv-2-1	3 × 3	64	ups-2	2 × 2	128
pool-3	2 × 2	64	decv-3-1	4 × 4	128
conv-2-2	3 × 3	64	decv-3-2	3 × 3	128
pool-4	2 × 2	64	ups-3	2 × 2	128
conv-3-1	3 × 3	128	decv-4-1	3 × 3	64
pool-5	2 × 2	128	decv-4-2	3 × 3	32
conv-3-2	3 × 3	128	ups-4	2 × 2	32
pool-6	2 × 2	128	decv-5-1	3 × 3	16
conv-4-1	3 × 3	256
pool-7	2 × 2	256
conv-4-2	3 × 3	256
pool-8	2 × 2	256
conv-5	3 × 3	512
output	3 × 3	1

Table 2.

Comparative analysis: existing vs. proposed methodologies.

	Proposed method			Existing (Segnet-CNN + Kfold)
	Training	Validation	Testing	Training	Validation	Testing
IOU	70.53	75.05	85.45	91.3	92.56	91.76
Dice Coef	88.32	89.8	88.82	83.26	85.8	88.82
Precision	94.61	92.43	94.9	92.6	91.2	91.2
Recall	85.57	90.21	86.15	84.5	89.34	85.6
Accuracy	93.7	94.84	94.13	92.6	93.5	94.08
Loss	0.547	0.495	0.517	0.6	0.5	0.5

Encoding Layer:

Z_{i}^{e n c} = C o n v 2 D (X_{i - 1}, W_{i}^{e n c}) + b_{i}^{e n c} \dots

(1)

The above equation represents the input feature maps from the preceding layer.

Z_{i}^{e n c}

represents the output of the encoding layer i, specifically after the convolution operation. It consists of the result obtained by applying the Conv2D operation, which involves convolving the input feature maps

X_{i - 1}

with the convolutional filters

W_{i}^{e n c}

and, adding the bias term

b_{i}^{e n c}

Z_{i}^{e n c} = B a t c h N o r m a l i z a t i o n (Z_{i}^{e n c}) \dots

(2)

After the convolution operation, batch normalization is applied to normalize the activations that can apply learnable scaling and shifting, utilise the running averages for stability, and enhance the training speed and stability when delivering some regularization.

In the equation, $Z_{i}^{e n c}$ is normalized using Batch Normalization to produce $Z_{i}^{e n c}$

Y_{i}^{e n c} = R e L U (Z_{i}^{e n c}) \dots

(3)

The ReLU (Rectified Linear Unit) activation function is applied element-wise to the normalized activations

Z_{i}^{e n c}

. The ReLU function imposes all negative values in

Z_{i}^{e n c}

to zero while positive values remain unaltered, thereby yielding the output

Y_{i}^{e n c}

Skip Connection:

Y_{i}^{s k i p} = Y_{i}^{e n c} + X_{i} \dots

(4)

The above equation represents a skip connection, where

Y_{i}^{e n c}

from the encoding layer i is combined with

$X_{i}$ from an earlier layer. This allows the network to preserve crucial information from earlier stages and bypass specific layers, potentially improving gradient flow and enhancing the model's learning and performance. The benefits perceived as a result of gradient flow improvements include more effective weight updates, faster convergence, reduced risk of vanishing gradients, and improved model performance, leading to more accurate predictions and improved segmentation outcomes .

Decoding layer:

Z_{i}^{d e c} = C o n v 2 D T r a n s p o s e (Y_{i}^{s k i p}, W_{i}^{d e c}) + b_{i}^{d e c} \dots

(5)

The above equation represents the computation in the decoding layer i. It involves up-sampling the feature maps

Y_{i}^{s k i p}

from the skip connection using transposed convolution with weights

W_{i}^{d e c}

and then adding the bias term to generate the output feature maps

Z_{i}^{d e c}

. This process helps reconstruct spatial information lost during downsampling, contributing to the final segmentation output. Reconstructing the spatial information lost during downsampling is crucial for maintaining detail, enhancing localization and contextual understanding, mitigating information loss, and facilitating effective feature integration in image segmentation tasks.

Z_{i}^{d e c} = B a t c h N o r m a l i z a t i o n (Z_{i}^{d e c}) \dots

(6)

The above equation applies batch normalization to the output feature maps

Z_{i}^{d e c}

of the decoding layer i. By normalizing activations across the batch dimension, this step stabilizes training, accelerates convergence, and enhances the model's generalisation ability.

Y_{i}^{d e c} = R e L U (Z_{i}^{d e c}) \dots

(7)

$X_{i - 1} r e p r e s e n t s t h e i n p u t$ $f e a t u r e m a p s t o t h e i^{t h} e n c o d i n g l a y e r .$

$W_{i}^{e n c} a n d b_{i}^{e n c} r e p r e s e n t t h e w e i g h t s a n$ $d b i a s e s o f t h e c o n v o l u t i o n a l o p e r a t i o n$ $i n t h e i^{t h} e n c o d i n g l a y e r$

$Y_{i}^{e n c} r e p r e s e n t s t h e o u t p u t f e a t u r e$ $m a p s o f t h e i^{t h} e n c o d i n g l a y e r R e L U a c t i v a t i o n .$

$Y_{i}^{s k i p} r e p r e s e n t s t h e f e a t u r e$ $m a p s a f t e r t h e s k i p c o n n e c t i o n o p e r a t i o n .$

$W_{i}^{d e c} a n d b_{i}^{d e c} r e p r e s e n t s t h e$ $w e i g h t s a n d b i a s e s o f t h e c o n v o l u t i o n$ $a l o p e r a t i o n i n t h e i^{t h}$ $d e c o d i n g l a y e r$

$Y_{i}^{d e c} r e p r e s e n t s t h e o u t p u t f e a t u r e m a p s$ $o f t h e i^{t h} d e c o d i n g l a y e r a f t e r R e L U a c t i v a t i o n$

The convolution 2D applies a convolutional filter to the input feature to extract the patterns. It is denoted as Conv2D (X, W), where W represents the learnable convolutional filter. The process of Conv2D in extracting patterns from input features is that it applies the learnable filters to the input image, convolving across its spatial dimensions (height and width). This operation produces feature maps that underscore the specific patterns. On the other hand, the central role of such a layer is that it extracts the fundamental features like edges, textures, and shapes. Overall, SegNet effectively extracts and maps features from input images, enabling precise pixel-wise segmentation by utilizing the Conv2D layers. The batch normalization normalizes such operation and normalizes the activations of the previous convolutional layer to stabilize and accelerate training. The batch normalization operation is denoted as BatchNorm(X). Its importance is that it decreases internal covariate shift by preserving constant activation distributions, helps with quicker convergence, and improves model performance. Batch normalization standardizes the outputs of the preceding convolutional layer by first computing the mean and variance of the activations across a mini-batch. The mini-batch statistics were utilized due to the inference, running averages of the mean, and calculated variance. At the same time, training is applied, ensuring consistent model behaviour is based on unseen data. This process or procedure stabilizes training, leading to faster convergence and enhancing the performance of the deep neural networks.

On the contrary, it significantly mitigates internal covariate shifts during training by standardizing the inputs to each layer, which decreases the network's sensitivity to changes in the distribution of inputs as parameters are updated. The Relu operation introduces non-linearity to the network by applying the Rectified Linear Unit (ReLU) activation function to the normalized features, denoted as ReLU (X). It permits the network to simulate non-linear interactions, which aids in the learning of complicated patterns and helps to mitigate the vanishing gradient issue during training. The central purpose of implementing non-linearity in a network, such as through the ReLU activation function, is to enable the network to learn and model complex relationships in the data. The encoding layer can be denoted as follows:

E (X) = R e L U (B a t c h N o r m (C o n v 2 D (X, W))) \dots

(8)

where:

X represents the input features

W represents the learnable convolutional filter

E(X) represents the output feature after applying the encoding layer operations

\begin{array}{l} I n p u t \to (((E_{1} \to M a x P o o l i n g) \to (E_{2} \to M a x P o o l i n g)) \to ((E_{3} \to M a x P o o l i n g) \to \\ (\dots . . \to ((\to (E_{n - 3} \to M a x P o o l i n g) \to (E_{n - 2} \to M a x P o o l i n g)) \\ \to ((E_{n - 1} \to M a x P o o l i n g) \to ((D_{1} \leftarrow C o n c a t e n a t e (S_{1})) \to (D_{2} \leftarrow C o n c a t e n a t e (S_{2}))) \\ \to \dots \to ((. . \leftarrow C o n c a t e n a t e (S_{\frac{n}{2}})) \to (O \to R e s h a p e t o (192, 256))) \dots \end{array}

(9)

In the above equation,

n represents the total number of encoding layers.

n//2 represents integer division, giving the index of the last encoding layer from which a skip connection is made.

The arrows indicate the flow of data through the layers.

→→ denotes the encoding and max pooling operations.

←← Denotes the decoding operation.

Concatenate (S_i) represents concatenating the skip connection from the encoding layer with the decoding layer Di.

The output layer O performs the final prediction, followed by reshaping to match the desired output shape (192, 256). Output layer definition, activation function, reshaping, final prediction generation, post-processing (if relevant), and output returns are the critical steps in the final prediction process performed by the output layer.

5 Results and discussion

For cross-validation of the SegNet model on the picture dataset, the dataset was split into five equal partitions to ensure a robust and reliable evaluation of the model's performance. Furthermore, cross-validation is crucial when evaluating machine learning models, such as the SegNet model used on picture datasets, because it reduces overfitting and yields a dependable estimate of generalization performance. The model was trained five times, each time using four partitions for training and the remaining single partition for validation, such that each partition was used exactly once for validation. This ensured that every dataset example was used for training and validation across the five iterations. Consequently, the model undergoes training on different data combinations, ensuring comprehensive learning and assessment across the entire dataset. Performance metrics, such as accuracy, loss, and various evaluation scores, are computed for each fold. By averaging these metrics’ overall folds, a robust estimate of the SegNet model's generalization performance is obtained, minimizing the risk of overfitting and enhancing confidence in its efficacy across diverse datasets. The SegNet model's generalization performance is closely linked to its architecture and training techniques, which collectively help minimize the risk of overfitting. The SegNet model's architecture, combined with effective data augmentation, regularization techniques, and robust validation strategies, creates a strong framework for enhancing generalization performance while minimizing the risk of overfitting. Both the existing and the proposed systems were implemented and compared below.

From the above table, these values are inferred when compared with the traditional methodology.

5.1 Defining evaluation metrics

5.1.1 IOU (intersection over union)

Intersection over Union (IOU) estimates the overlap between predicted and ground truth regions by measuring the intersection ratio to the predicted and ground truth regions’ union. Moreover, IoU plays a crucial role in quantifying the accuracy and precision of model predictions, mainly in tasks such as image segmentation and object detection. It is often referred to as the Jaccard index or similarity coefficient. It is widely used in information retrieval, object detection, and image segmentation.

I O U = \frac{A r e a o f U n i o n b e t w e e n e s t i m a t e d a n d g e n u i n e a r e a s}{a r e a o f I n t e r s e c t i o n b e t w e e n e s t i m a t e d a n d g e n u i n e a r e a s}

A more powerful IOU signifies better alignment between anticipated and ground truth regions. The IOU values on training, validation, and testing sets are 94.53%, 95.05%, and 94.83%, respectively. These are quite high, indicating good performance compared to the existing approach.

5.1.2 Dice coefficient

The Dice coefficient measures the overlap between predicted and ground truth regions. It is calculated as twice the ratio of the intersection to the sum of the areas of the predicted and ground truth regions.

D i c e_C o e f f i c i e n t = \frac{2 \times Area\; of\; Intersection}{Area\; of\; Predicted\; Region + Area\; of\; Ground\; Truth\; Region} \dots

(10)

Higher values indicate better overlap. The Dice coefficient values on training, validation, and testing sets are 88.32%, 89.80%, and 88.82% respectively. Again, these are high values.

5.1.3 Precision

Precision is the ratio of correctly predicted positive observations to the total predicted positive observations.

P r e c i s i o n = \frac{TP}{Total\; True\; Positives} \dots

(11)

TP represents True Positive predictions.

Total True Positives- True positive predictions + false positive predictions

It's calculated as true positives divided by the sum of true and false positives.

High precision indicates that an instance predicted as positive is indeed positive. The precision values on training, validation, and testing sets are 94.61%, 92.43%, and 94.90%, respectively. These are also high values.

5.1.4 Recall

Recall is the ratio of true positive predictions to all actual positive instances, calculated as the ratio of true positives to the sum of true positives and false negatives.

R e c a l l = \frac{True\; Positives}{True\; Positives + False\; Negatives} \dots

(12)

High recall indicates that the model can find all the positive samples. The recall values on training, validation, and testing sets are 85.57%, 90.21%, and 86.15%, respectively.

5.1.5 Accuracy

Accuracy is the ratio of correct predictions to the total observations, calculated as the sum of true positives and negatives divided by the total number of samples.

A c c u r a c y = \frac{True\; Positives + True\; Negatives}{Total\; Observations} \dots

(13)

High accuracy indicates the overall correctness of the model's predictions. The model achieved 93.7%, 94.84%, and 94.13% accuracy on the training, validation, and test sets, respectively. These quantitative performance metrics evaluate the model's ability to correctly identify and classify regions of interest in the dataset. Compared to traditional image segmentation approaches, SegNet has a significant advantage, especially in quantitative performance measures like accuracy, precision, and loss values.

Compared to traditional methodologies, the proposed model performed better according to these evaluation results.

The above diagram shows the Visualizations for the loss, accuracy, validation loss, and validation accuracy. From the diagram, it is inferred that the loss gradually decreases and accuracy and validation increase.

6 Conclusion and future work

The SegNet-based image segmentation model demonstrated strong performance and was validated through extensive K-fold cross-validation. The model consistently exhibits superior performance, attaining scores exceeding the 90th percentile across evaluation metrics such as Intersection over Union (IoU), Dice Coefficient, Precision, Recall, and Accuracy measures, thereby demonstrating its proficiency and efficacy in accurately segmenting objects in photos. As indicated by the reported loss values, the optimisation process during training improves the model's efficiency. The K-fold cross-validation approach enhances validation, increasing confidence in the model's capacity to generalise across different datasets. The results highlight the SegNet model's capability as a dependable tool for image segmentation, showcasing robust performance and durability. This report comprehensively assesses the SegNet-based image segmentation model and pinpoints possible topics for further development. Firstly, evaluating the model's efficacy on larger and more diverse datasets might offer a deeper understanding of its generalization capabilities across a broad spectrum of real-life contexts. Investigating the adaptability of the SegNet model in many domains beyond those examined in this study could uncover its potential applications in a broader range of fields.

Furthermore, advanced techniques such as transfer learning or attention mechanisms could enhance the model's efficacy and efficiency. Exploring the interpretability of the model's segmentation outputs and their influence on future tasks could be a fruitful direction for future research. Ongoing research endeavours and further refinements to the SegNet model architecture carry the potential to propel advancements in the domain of image segmentation algorithms and their diverse practical applications.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Brinker

Hekler

Utikal

, et al. Skin cancer classification using convolutional neural networks: systematic review. J Med Internet Res 2018; 20: e11936.

Pious

Srinivasan

. A review on early diagnosis of skin cancer detection using deep learning techniques. In: 2022 International conference on computer, power and communications (ICCPC), Chennai, India, 2022, pp.247–253.

Yang

Zhang

Tian

, et al. Deep learning for single image super-resolution: a brief review. IEEE Trans Multimed 2019; 21: 3106–3121.

Tamura

Mori

Yamawaki

. Textural features corresponding to visual perception. IEEE Trans Syst man Cybernetics 1978; 8: 460–473.

Bein

Verma

. New compact deep learning model for skin cancer recognition. In: 2018 9th IEEE annual ubiquitous computing, electronics & mobile communication conference (UEMCON), IEEE, 2018, pp.255–261.

Tang

Yan

Liang

, et al. AFLN-DGCL: adaptive feature learning network with difficulty-guided curriculum learning for skin lesion segmentation. Appl Soft Comput 2021; 110: 107656.

Kim

Ahn

, et al. Dermoscopic image segmentation via multistage fully convolutional networks. IEEE Trans Biomed Eng 2017; 64: 2065–2074.

Bian

Luo

Wang

, et al. Optic disc and optic cup segmentation based on anatomy guided cascade network. Comput Methods Programs Biomed 2020; 197: 105717.

Guo

Wang

, et al. Automated segmentation of the optic disk and cup using dual-stage fully convolutional networks. 2019, arXiv:1902.04713.

10.

Zhou

Wang

Bao

, et al. LAEDNet: a lightweight attention encoder-decoder network for ultrasound medical image segmentation. Comput Elect Eng 2022; 99: 107777.

11.

Lou

Meng

, et al. MCRNet: multi-level context refinement network for semantic segmentation in breast ultrasound imaging. Neurocomputing 2022; 470: 154–169.

12.

Zahangir Alom

Hasan

Yakopcic

, et al. Recurrent, residual convolutional neural network based on U-Net (R2UNet) for medical image segmentation. 2018, arXiv:180.

13.

Gridach

. Pydinet: pyramid dilated network for medical image segmentation. Neural Netw 2021; 140: 274–281.

14.

Al-masni

Al-Antari

Choi

M-T

, et al. Skin lesion segmentation in dermoscopy images via deep full resolution convolutional networks. Comput Methods Programs Biomed 2018; 162: 221–231.

15.

Jamil

Akram

Khalid

, et al. Computer-based melanocytic and nevus image enhancement and segmentation. BioMed Res Int 2016; 2016: 1–13.

16.

Tschandl

Rosendahl

Kittler

. The HAM10000 dataset a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 2018; 5: 1–9.

17.

Hosny

Kassem

Foaud

. Skin cancer classification using deep learning and transfer learning. In: Proc. 9th Cairo int. Biomed. Eng. Conf. (CIBEC), 2018, pp.90–93.

18.

Javaid

Sadiq

Akram

. Skin cancer classification using image processing and machine learning. In: Proc. Int. Bhurban conf. Appl. Sci. Technol. (IBCAST), 2021, pp.439–444.

19.

Ashraf

Kiran

Mahmood

, et al. An efficient technique for skin cancer classification using deep learning. In: Proc. IEEE 23rd int. Multitopic conf. (INMIC), 2020, pp.1–5.

20.

Uckuner

Erol

. A new deep learning model for skin cancer classification. In: Proc. 6th int. Conf. Comput. Sci. Eng. (UBMK), 2021, pp.27–31.

21.

Filali

Khoukhi

Sabri

, et al. Analysis and classification of skin cancer based on deep learning approach. In: Proc. Int. Conf. Intell. Syst. Comput. Vis. (ISCV), 2022, pp.1–6.

22.

Hasan

Ahamad

Yap

, et al. A survey, review, and future trends of skin lesion segmentation and classification. Comput Biol Med 2023; 155: 106624. ISSN 0010-4825.

23.

Shibly

FHA

Kumar

. Image Processing for Automatic Cell Nucleus Segmentation Using Super pixel and Clustering Methods on Histopathological Images. Tamjeed J Healthc Eng Sci Technol 2023; 1: 54–63.

24.

Danon

Arar

Cohen-Or

, et al. Image resizing by reconstruction from deep features. Comp Visual Media 2021; 7: 453–466.

25.

Orhei

Vasiu

. An analysis of extended and dilated filters in sharpening algorithms. IEEE Access 2023; 11: 81449–81465.

26.

Visalaxi

Punnoose

Muthu

. Lesion extraction of endometriotic images using open computer vision. In: 2021 International conference on artificial intelligence and smart systems (ICAIS), Coimbatore, India, 2021, pp.747–751. DOI: https://doi.org/10.1109/ICAIS50930.2021.9395822.

27.

Huang

Lin

Tong

, et al. UNet 3+: a full-scale connected UNet for medical image segmentation. In: ICASSP 2020–2020 IEEE international conference on acoustics, speech and signal processing (ICASSP), Barcelona, Spain, 2020, pp.1055–1059. DOI: https://doi.org/10.1109/ICASSP40776.2020.9053405.

28.

Algan

Ulusoy

. Image classification with deep learning in the presence of noisy labels: a survey. Knowl Based Syst 2021; 215: 106771. ISSN 0950-7051.

29.

Yarats

Fergus

Kostrikov

. Image augmentation is all you need: regularizing deep reinforcement learning from pixels. In: Paper presented at 9th international conference on learning representations, ICLR 2021, Virtual, Online, 2021.

30.

Trimakno

Kusrini.

Impact of augmentation on batik classification using convolution neural network and K-neareast neighbor. In: 2021 4th international conference on information and communications technology (ICOIACT), Yogyakarta, Indonesia, 2021, pp.285–289. DOI: https://doi.org/10.1109/ICOIACT53268.2021.9564000.

31.

Mahmud

Adiwijaya

Al Faraby

. Klasifikasi citra multi-kelas menggunakan convolutional neural network. eProceedings of Engineering 2019; 6.

32.

Samat

Waske

, et al. Random forest and rotation forest for fully polarized SAR image classification using polarimetric and spatial features. ISPRS J Photogramm Remote Sens 2015; 105: 38–53.

33.

Safdar

Alkobaisi

Zahra

. A comparative analysis of data augmentation approaches for magnetic resonance imaging (MRI) scan images of brain tumor. Acta Inform Med 2020; 28: 29–36. PMID: 32210512; PMCID: PMC7085309.

34.

Oza

Sharma

Patel

, et al. Image augmentation techniques for mammogram analysis. J Imaging 2022; 8: 41.

35.

Khalifa

Loey

Mirjalili

. A comprehensive survey of recent trends in deep learning for digital image augmentation. Artif Intell Rev 2022; 55: 2351–2377.

36.

Shorten

Khoshgoftaar

. A survey on image data augmentation for deep learning. J Big Data 2019; 6: 1–48.

37.

Mingle

Yoon

Fuentes

, et al. A comprehensive survey of image augmentation techniques for deep learning. Pattern Recognit 2023; 137: 109347.

38.

Chlap

Min

Vandenberg

, et al. A review of medical image data augmentation techniques for deep learning applications. J Med Imaging Radiat Oncol 2021; 65: 545–563.

39.

Badrinarayanan

Kendall

Cipolla

. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39: 2481–2495.