MSAM-YOLO: An improved YOLO v8 based on attention mechanism for grape leaf disease identification method

Abstract

This paper proposes an improved YOLO v8 grape leaf disease identification method MSAM-YOLO based on attention mechanism to address the problem of multiple types and similar small target features in grape leaf disease images. This method introduces a multi-scale convolution attention module (MSAM) in the feature extraction network to enhance the focus on grape leaf disease. By performing convolution operations on features at different scales and using attention mechanism to emphasize the features of the diseased area, the model can better capture the subtle features of the disease. Experimental results show that MSAM-YOLO improves the original model by 4% in grape leaf disease identification task, with higher accuracy and real-time performance. This method provides a new perspective for the detection of plant leaf diseases, contributing to the improvement of the quality and efficiency of agricultural production.

Keywords

grape leaf disease identification YOLOv8 attention mechanism multi-scale convolution object detection deep learning

Introduction

Agriculture is crucial for the development of a country; however, the diseases and pests of crops, as a major component of biological disasters, severely restrict the sustainable development of agriculture, causing significant economic losses to agricultural production. Direct losses caused by diseases and pests account for 20%–40% of agricultural production losses globally.¹ Furthermore, these diseases and pests not only affect the yield but also the nutritional value of crops, posing a threat to human health.² As for grapes, the main diseases affecting grape leaves include powdery mildew, downy mildew, grape leaf spot, and grape yellow spot disease, among which downy mildew alone results in a yield reduction of over 15%.³ Therefore, it becomes particularly important to manage the health of crops. The symptoms of grape diseases mainly appear on the leaves, and compared to expensive and complex laboratory pathogen identification methods such as pathogen isolation, molecular biology, and immunology,⁴ computer vision and deep learning provide an efficient, rapid, and relatively low-cost diagnostic method.

In the process of utilizing computer vision technology to identify diseases in crops, researchers use image segmentation techniques to separate the area of interest from the background, and then extract features such as the size, shape, and color distribution of the diseased areas on the leaves. Finally, machine learning algorithms or deep learning models are used for disease identification.⁵

Previous studies have been conducted on the identification of grape leaf diseases, primarily using traditional machine learning methods such as SVM.⁶ These methods perform well under specific conditions, but their performance is often limited when dealing with complex or changing environments. As data volumes grow and computing capabilities enhance, deep learning models, especially convolutional neural networks, have gradually become more effective alternatives.⁷

A convolutional neural network model was used to analyze leaf images for early identification of grape diseases, outperforming traditional machine learning methods,⁸ it utilized an improved CNN model to detect grape diseases, achieving a 96.5% accuracy rate.⁹ Although these deep learning methods have achieved good accuracy, they still face challenges such as limited receptive fields, long-range dependencies, and poor noise resistance.

The YOLO series algorithm, introduced by Joseph Redmon and colleagues in 2016, stands as a quintessential example of an object detection algorithm. The main idea is to transform the object detection problem into a regression problem by predicting bounding boxes and class probabilities on the image to detect the objects. Compared to traditional object detection methods, YOLO’s main feature is its high real-time performance. The entire image serves as input, and through one network forward propagation, the positions of all objects in the image, their corresponding categories, and their respective confidence probabilities can be obtained.

In recent years, YOLO has achieved significant advancements and widespread applications in the field of target detection and object recognition. Since its proposal by Joseph Redmon in 2016, it has continuously improved, undergoing major developments such as YOLOv3 (2018), which introduced a multi-scale prediction mechanism by incorporating additional paths into the network to enhance the model’s ability to detect small targets. YOLOv5 adopted the CSPDarkNet53+Focus for the backbone network and SPP + PAN for the neck, while YOLOv7 combined the neck and head layers into the head layer, incorporating an Extended-ELAN efficient long-range attention network. YOLOv8, the latest version, utilizes a lightweight network structure, enabling smoother operation on mobile devices. Additionally, it employs dual-path prediction to better handle objects of different sizes, further enhancing the algorithm’s efficiency.

When using YOLOv8 for detection on the grape leaf disease dataset, there has been a significant improvement in speed. However, the accuracy is limited due to the following issues: (1) The disease targets on grape leaves are usually small and difficult to detect. (2) Some grape leaf diseases are concentrated in the entire leaf area, requiring inference from global information. Based on the characteristics of the grape leaf disease data mentioned above, in order to achieve higher recognition performance, this paper proposes the MSAM-Yolo improved model, which combines the attention mechanism of feature channels and feature space to improve the recognition accuracy and speed of grape leaf diseases. Experimental results on the Plant Village dataset and self-collected grape leaf disease dataset show that MSAM-Yolo outperforms other models.

Relevant work

YOLOv8 network structure

YOLO v8 network is a major update version on the basis of v5, comprising of Backbone and Head. The Head part combines the original Neck and Head.

(1) The Backbone still uses the CSPDarkNet structure. The kernel of the first convolutional layer changes from 6 × 6 to 3 × 3, and all C3 structures are replaced by C2f structures. The number of blocks in C2f changes from 3-6-9-3 to 3-6-6-3.

(2) The Neck part also uses a PANet structure similar to YOLOv5, which is efficient and fast. The Backbone goes through an SPPF, then passes through 8x downsample layer and 16x downsample layer as inputs to the PANet structure. After upsampling and channel fusion, the three output branches of PANet are sent to the Head for loss calculation or result computation. PANet is a bidirectional path network. Compared with FPN, PANet introduces a bottom-up path, making it easier for low-level information to be transmitted to the top.

(3) Unlike the Coupled Head of v5, the Head part of YOLOv8 uses a Decoupled Head, which was used in v6, separating the regression branch and the prediction branch. The DFL strategy is used for the regression branch, transforming the coordinates into a distribution.

The YOLOv8’s loss function directly uses the TaskAlignedAssigner to identify positive samples by leveraging weighted scores across classification and regression tasks. This computation encompasses distinct branches for classification and regression: employing BCE Loss for classification, while utilizing Distribution Focal Loss and CIoU Loss for regression. The integration of these three losses, each weighted appropriately, underpins the algorithm’s efficacy.

Attention mechanism

The attention mechanism is a core feature of human intelligence, allowing us to focus our attention on the most important tasks or information at hand while ignoring less significant stimuli. In the research of deep learning, various attention models have been developed by simulating the principles of human attention mechanism, demonstrating outstanding performance in areas such as image recognition.¹⁰ The attention mechanism can be further divided into channel attention mechanism (CAM) and spatial attention mechanism (SAM). CAM calculates the importance of each channel to weight the feature maps, enabling the model to focus on the most important features. SAM performs self-attention calculation on each position of the feature map to obtain an attention weight matrix, which is then multiplied with the original feature map to obtain the weighted feature map. CBAM is an attention integrator that combines CAM and SAM,¹¹ as shown in Figure 1.

Figure 1.

Overview of CBAM.

CBAM separately conducted channel and spatial attention, not only saving parameters and computational resources, but also ensuring that it can be integrated as a plug-and-play module into existing network structures.

On public datasets, these additional attention mechanisms can effectively improve network performance. However, in experiments on the grape leaf disease dataset, due to the variations of grape leaf diseases in different growth stages and different disease types, the ideal effect was not achieved. Therefore, this paper proposes an optimized method for CBAM called MSAM, which replaces the original channel attention mechanism with multi-scale convolution attention. This method constructs a grape leaf disease recognition network.^12–18

Grape leaf disease detection

In order to improve the identification rate of grape leaf diseases on Plant Village, this paper proposes a network architecture MSAM-YOLO based on YOLO v8 as shown in Figure 2 for grape leaf disease detection.

Figure 2.

Network architecture of MSAM-YOLO.

Network architecture of MSAM-YOLO

As shown in Figure 2, In the Backbone section, we adopt the idea of YOLOv8’s Cross Stage Partial Networks (CSP), which is responsible for extracting features from the input image, reducing the number of model parameters and computational cost without sacrificing performance, thus helping to improve the model’s operational speed.

The head section utilizes the multi-scale feature fusion of the image, while YOLOv8 uses decoupled heads, separating the classification head and the measurement head. The improvements in this paper are focused on the Head section.

MSAM

MSAM is a multi-scale convolutional attention module that uses the multi-scale convolutional attention (MSCA) to replace the original CBMA’s CAM, enabling channel attention to have multi-scale capabilities.

As shown in the left of Figure 3, MSCA consists of three parts, including a deep convolutional layer for aggregating local information, a multi-branch deep stripe convolutional layer for capturing multi-scale contextual information, and a 1 × 1 convolutional layer for establishing relationships between different channels.

Figure 3.

Depicts the structure and explanation of MSCA.

In mathematics, MSCA can be represented as:

\begin{array}{l} Att = C o n v_{1 \times 1} (\sum_{i = 0}^{3} S c a l e_{i} (D W_C o n v (F))) \\ O u t = A t t \otimes F \end{array}

The variables Att and Out represent the attention map and the output, F denotes the input features, $\otimes$ indicates element-wise matrix multiplication. DW_Conv refers to depthwise convolution, Scalei, i ∈ {0, 1, 2, 3} represents the ith branch in the right of Figure 3, and Scale0 corresponds to identity connection.

By integrating convolutional operations at different scales, we can simultaneously capture both the local details and the overall structure of grape leaf diseases, effectively extracting and utilizing the characteristic information of grape leaf diseases, thereby improving the accuracy of disease identification. Through experiments, we can conclude that the model with MSAM added can better identify grape leaf diseases.

Using the Grad-CAM tool to visualize the effect of feature extraction before entering the UpSample layer. This article uses black rot disease as an example, as shown in Figure 4. Before introducing MSAM, the YOLOv8 network extracted features from the samples relatively randomly, with insufficient focus on the feature points of the leaf disease area. After introducing MSAM, during the forward propagation process, the important feature channels gradually occupy a larger proportion, allowing the network to learn the parts that need special attention, enabling the improved MSAM-YOLO model to more efficiently extract difficult-to-distinguish feature information from the images.

Figure 4.

Comparison of Grad-CAM results. (a) The causal organism of leaft blight, (b) Grad-CAM image of YOLOv8, and (c) Grad-CAM image of MSAM-YOLOv8.

From Figure 4, it can be observed that there are recognition errors such as 1 and failures to recognize like 2 and 3 before incorporating the MSAM module. So with MSAM module its pays more attention to the details of the disease area on grape leaves, which helps the model in identifying grape leaf diseases.

Experimental results and analysis

Dataset and preprocessing

This article uses the Windows 11 Professional operating system, Python 3.8, and the PyTorch 2.1 deep learning framework. It employs an Intel(R) Xeon(R) CPU E5-26800v2, NVIDIA GeForce GTX 1080, and CUDA 12.3 framework.

The dataset includes data collected from PlantVillage and other networks, this paper acquired four common diseases original images (black_root, black_measles, blight and healthy).

The training set consists of 7279 images, the validation set has 1479 images, and the test set comprises 485 images, all with a resolution of 640×640. Also use the labeling tool to annotate images in YOLO format.

Identification result

The improved MSAM-YOLOv8 model was used for the identification of grape leaf diseases, and the results are shown in Figure 5. It can be seen from the figure that MSAM-YOLOv8 is capable of accurate localization and identification of grape leaf diseases.

Figure 5.

Comparison of identification results before and after improvement. (a) Identification result of YOLOv8. (b) Identification result of MSAM-YOLOv8.

Comparison of different models

We conducted a thorough comparative analysis on a dataset to assess the capabilities of various models, including YOLOv3, Faster R-CNN, YOLOv5, and YOLOv8, in grape leaf disease identification tasks. As depicted in Table 1, the experimental results reveal that YOLOv8 stands out with remarkable advantages across multiple crucial metrics.

Table 1.

Comparison of evaluation metrics for different models.

Model	mAP	Recall	Parameters
YOLOv3	0.881	0.85	48,725,284
Faster R-cnn	0.744	0.826	40,604,403
YOLOv5s	0.88521	0.76413	12,183,321
YOLOv8n	0.91027	0.88767	3,157,200

Firstly, in the realm of average accuracy and recall, YOLOv8 demonstrates exceptional performance, particularly in scenarios demanding precise recognition of grape leaf diseases. Its performance is truly noteworthy in these cases.

Secondly, a noteworthy aspect is that YOLOv8 boasts a substantial reduction in the number of parameters compared to its counterparts. This reduced parameter count translates to a lighter model, making it ideal for deployment on mobile devices or in resource-constrained environments. Given that grape leaf disease identification often necessitates on-site operations, a lightweight model such as YOLOv8 not only enhances real-time recognition but also minimizes hardware requirements, thereby facilitating wider adoption and application of the technology.

Ablation experiment

To objectively evaluate the performance of various network models, a comparison was made between YOLOv8n and the improved MSAM-YOLOv8n proposed in this paper. Evaluation metrics included Precision, Recall, and mAP as shown in Table 2. It can be observed from the table that the performance of the model improved significantly after the application of MSAM.

Table 2.

Comparison of evaluation metrics of Yolov8 and MSAM-YOLOv8n.

Model	Precision	Recall	mAP
YOLOv8n	0.87716	0.88767	0.91027
MSAM-YOLOv8n	0.91717	0.88993	0.93784

The F1_curve, P_curve, PR_curve, and R-curve of each model are shown in Figure 6.

Figure 6.

The F1_curve, P_curve, PR_curve, and R-curve of each model.

From the F1_curve and P_curve of the two models in the Figure 6, it can be seen that the new model performs better. The PR_curve of the new model is closer to the top right corner at all thresholds, which means it has higher precision at all levels of recall. Also, the R_curve of the new model rises faster than the old model at all thresholds, indicating that the new model can achieve higher recall rates over a wider range of thresholds.

Conclusions

The MSAM-YOLOv8 model proposed in this paper introduces the multi-scale convolutional attention module (MSAM) into the feature extraction network. By focusing more on the diseased parts, it helps to improve the detection rate of grape leaf diseases. The MSAM, by performing convolution operations at different scales and using attention mechanisms to emphasize the features of the diseased area, enables the model to capture subtle features of diseases more effectively. This design, which focuses on the diseased parts, is significant in the task of detecting grape leaf diseases.

Furthermore, the introduction of the MSAM module not only improves the detection rate of grape leaf diseases but also has a certain universality. It can be applied to disease detection tasks in other types of plant leaves. This makes the MSAM-YOLOv8 model a powerful tool in the field of plant leaf disease detection, helping to improve the quality and efficiency of agricultural production.

Despite the excellent performance of the MSAM-YOLOv8 model in grape leaf disease detection tasks, there may be some challenges, such as the model’s high computational complexity, which prevents effective deployment on end devices with limited computing resources. In future research, we may explore how to optimize the structure and parameters of the model to reduce computational complexity and improve the model’s real-time performance and practicality.

Statements and declarations

Footnotes

Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study is supported by the Zhejiang Provincial Colleges and Universities Domestic Visiting Scholars “Teacher Professional Development Program” Projects, FX2023114.

References

Tang

Yang

, et al. Grape disease image classification based on lightweight convolution neural networks and channelwise attention. Comput Electron Agric 2020; 178: 105735. DOI: 10.1016/j.compag.2020.105735.

Rani

Thapa

Kanojia

, et al. An extensive review on the consequences of chemical pesticides on human health and environment. J Clean Prod 2021; 283: 124657.

Keinath

de Figueiredo Silva

. Economic impacts of reduced fungicide efficacy against downy mildew on slicing cucumber. Crop Prot 2022; 155: 105934.

Solairaj

Legrand

NNG

Yang

, et al. Isolation of pathogenic fungi causing postharvest decay in table grapes and in vivo biocontrol activity of selected yeasts against them. Physiol Mol Plant Pathol 2020; 110: 101478.

Tan

Jiang

. Review on Convolutional Neural Network (CNN) applied to plant leaf disease classification. Agriculture 2021; 11(8): 707.

Jiang

Chen

, et al. Image recognition of four rice leaf diseases based on deep learning and support vector machine. Comput Electron Agric 2020; 179: 105824.

Hussain

Khan

Tariq

, et al. Multiclass cucumber leaf diseases recognition using best feature selection. Comput Mater Continua (CMC). 2022; 70(2): 3281–3294.

Geetha

Samundeswari

Saranya

, et al. Plant leaf disease classification and detection system using machine learning. J Phys: Conf Ser 2020; 1712(1): 012012.

Abd Algani

Caro

OJM

Bravo

LMR

, et al. Leaf disease identification and classification using optimized deep learning. Measurement: Sensors 2023; 25: 100643.

10.

Niu

Zhong

. A review on the attention mechanism of deep learning. Neurocomputing 2021; 452: 48–62.

11.

Woo

Park

Lee

J-Y

, et al. CBAM: convolutional block attention module. In: Ferrari

Hebert

Sminchisescu

, et al. (eds) Computer vision–ECCV 2018. Cham: Springer, 2018, pp. 3–19. DOI: 10.1007/978-3-030-01234-2_1.

12.

Yang

Zhou

, et al. A hybrid model of ghost-convolution enlightened transformer for effective diagnosis of grape leaf disease and pest. Journal of King Saud University-Computer and Information Sciences 2022; 34(5): 1755–1767.

13.

Wang

Huang

, et al. Enhancing YOLOv7-based fatigue driving detection through the integration of coordinate attention mechanism. In: 2023 IEEE international conference on image processing and computer applications, ICIPCA 2023, Changchun, China, 11–13 August 2023, pp. 725–729. DOI: 10.1109/ICIPCA59209.2023.10257685.

14.

Bao

Meng

, et al. Reformation of coal reservoirs by microorganisms and its significance in CBM exploitation. Fuel 2024; 360: 130642. DOI: 10.1016/j.fuel.2023.130642.

15.

Cai

Jiang

. An improved plant disease detection method based on YOLOv5. In: Proceedings - 2023 15th international conference on intelligent human-machine systems and cybernetics, IHMSC 2023, Hangzhou, China, 26–27 August 2023, pp. 237–240. DOI: 10.1109/IHMSC58761.2023.00062.

16.

Guo

Hou

, et al. Segnext: rethinking convolutional attention design for semantic segmentation. Adv Neural Inf Process Syst 2022; 35: 1140–1156.

17.

Demilie

. Plant disease detection and classification techniques: a comparative study of the performances. J Big Data 2024; 11(1): 5.

18.

Chen

Zhu

Liu

, et al. MSCA-UNet: multi-scale channel attention-based UNet for segmentation of medical ultrasound images. Cluster Comput 2024; 27: 1–18.