Improving arm segmentation in sign language recognition systems using image processing

Abstract

BACKGROUND:

For a traditional vision-based static sign language recognition (SLR) system, arm segmentation is a major factor restricting the accuracy of SLR.

OBJECTIVE:

To achieve accurate arm segmentation for different bent arm shapes, we designed a segmentation method for a static SLR system based on image processing and combined it with morphological reconstruction.

METHODS:

First, skin segmentation was performed using YCbCr color space to extract the skin-like region from a complex background. Then, the area operator and the location of the mass center were used to remove skin-like regions and obtain the valid hand-arm region. Subsequently, the transverse distance was calculated to distinguish different bent arm shapes. The proposed segmentation method then extracted the hand region from different types of hand-arm images. Finally, the geometric features of the spatial domain were extracted and the sign language image was identified using a support vector machine (SVM) model. Experiments were conducted to determine the feasibility of the method and compare its performance with that of neural network and Euclidean distance matching methods.

RESULTS:

The results demonstrate that the proposed method can effectively segment skin-like regions from complex backgrounds as well as different bent arm shapes, thereby improving the recognition rate of the SLR system.

Keywords

Static sign language recognition image segmentation bent arm shape transverse distance geometric feature

1. Introduction

Sign language involves communication through hand gestures and body language, which can convey the same degree of information as the spoken or written word. Sign language recognition (SLR) provides a simple and natural mode of human-computer interaction that has gradually developed various applications, such as motion sensing games, [1] robotic control, [2] and intelligent buildings [3]. Moreover, for people with hearing impairments, sign language is their predominant method of communication.

Currently, SLR methods can be categorized based on the method of data acquisition into sensor-based or vision-based approaches. Sensor-based approaches directly capture accurate spatial information of the hands, wrists, fingers, and other body parts using sensors such as power gloves, [4] surface electromyogram signal sensor, [5] cyber gloves, [6] and data gloves [7]. For example, previous studies using various types of sensor data have achieved high gesture recognition rates of over 90% [8, 9]. However, the sensor devices used in sensor-based approaches can restrict the movements of the signer and may require complicated and costly setup.

Conversely, vision-based approaches only require a camera to identify sign language gestures, which has the advantage of being more user friendly. Vision-based approaches typically comprise hand region acquisition, hand-arm segmentation, classification, and recognition. Currently, the typical methods of hand region acquisition are background subtraction [10] and methods based on a color space such as RGB [11]. For example, Zhang et al. achieved robust hand detection by precisely extracting features using background subtraction with a recognition rate of up to 92.5% [12]. Lee et al. also applied background subtraction to extract moving gestures using the difference between the foreground and the background [13]. Lim et al. proposed a background detection method based on a combination of median and template filters [14]. To obtain the foreground object, the absolute difference between the image and the current background were calculated; then, the absolute difference between two subsequent frames was obtained to segment the hand motion region14. For the background subtraction method, it is vital to determine the difference between the foreground and background in order to avoid substantial errors.

Due to the unique color of human skin, methods based on a color space are commonly used for hand detection. Rautaray and Agrawal implemented an effective hand tracking approach based on the $L\times a\times b$ color space. This method has the advantage of a fast tracking speed but is only effective against a simple background [15]. In order to improve the adaptability of such methods to different environments, Zhang and Huang proposed a hand tracking method based on the hue, saturation, value (HSV) color space, in which the central location of the hand was tracked by a super pixel cluster, integrated multiple orientation gradient feature and histogram feature, and skin-like mask [16]. Their method was able to reliably track hand gestures even in complex environments with different lightning conditions. Shao et al. proposed a feature learning method based on the SAE-PCA network to recognize human gestures in RGB-depth (RGB-D) images, in which SAE was used to learn features from RGB and deep channels [17]. Their proposed feature learning model improved the recognition rate from 75% to 99%. Moreover, Bilal et al. proposed a face and hand detection method based on YCbCr color space, which employed Haar-like features and the AdaBoost technique to detect human skin-like pixels [18]. However, when other skin-color regions (e.g., face or arms) or skin-like regions are present in the image, these methods cannot accurately detect the hand region.

The process of arm segmentation removes the arm from the image and eliminates all redundant information. For example, Liu et al. used a method that detected the spindle of the minimum enclosing rectangle of the hand in order to determine the wrist line; however, this method does not work when the hand is rotated and it is not applicable to the case of a bare elbow or a bent arm [19]. Zare et al. proposed a wrist cropping algorithm based on the longest horizontal line of the hand to the hand-arm segment that requires the arm direction to be adjusted prior to the calculation [20]. Gao et al. used the narrow boundary width to determine the location of the horizontal wrist line and remove the area below this line [21]. However, these methods require geometric correction to adjust the image when the arm is tilted. Therefore, it is difficult to segment bare or bent arms using existing SLR systems. As such, signers must ensure that their arms are covered to avoid complex segmentation of the hand-arm region. Moreover, existing methods are only suitable for forearm segmentation and not for segmentation of the upper arm when displaying different bent arm shapes.

In this study, a segmentation method is proposed for different bent arm shapes based on image processing. Morphological reconstruction is then employed to design a static SLR system based on the proposed segmentation method. After skin segmentation and denoising, the hand-arm region is obtained based on both the area operator and the location of the mass center. For different bent arm shapes, the transverse distance is calculated to distinguish the arm shape. Then, images containing a hand plus small bent arm are segmented using the vertical cutting line and images containing a hand plus large bent arm or a hand plus long arm are segmented based on the Euclidean distance and the cutting line. Finally, the geometric features of the spatial domain are extracted and the gesture is identified using the support vector machine (SVM) model. In contrast to existing segmentation methods, the proposed method can resolve the problem of multiple skin and skin-like regions in an image, effectively segment different bent arm shapes, and accurately recognize all 26 individual English letters. Furthermore, this method has wide applicability as it is effective regardless of how the arm is rotated, the size of the hand, or the color of the skin.

The remainder of this paper is arranged as follows. Section 2 describes the proposed SLR system including acquisition of the hand-arm region, hand-arm segmentation, classification, and recognition. The experimental results are presented and discussed in Section 3. Finally, the conclusions are presented in Section 4.

2. SLR system design

A static SLR system was designed to extract the hand region from a complex background and identify the sign language image. After skin segmentation and image denoising, the image containing the hand-arm region was obtained by hand-arm localization. However, different types of hand-arm regions may exist in the image; therefore, if the image is segmented with the same method, the quality of the segmented image will be affected. As a result, different segmentation algorithms were designed to obtain the hand region for different hand-arm images. A vertical cut line was employed to segment hands plus slightly bent arms and both the Euclidean distance and a cutting line were employed to segment hands plus highly bent arms or hands plus long arms. The geometric features of the spatial domain were then extracted. Consequently, the sign language image was identified with a SVM model. The detailed algorithm is shown in Fig. 1.

Figure 1.

Structure of the proposed sign language recognition system.

2.1 Acquisition of the hand-arm region

2.1.1 Skin segmentation

Because luminance and chrominance are separate in YCbCr color space, it was employed in this study to extract skin regions. An example sign language image from the dataset used in the experiment is shown in Fig. 2a, in which the signer is wearing a skin-like color and has a bare arm. Furthermore, the color of the background is similar to that of the signer’s clothing. 100 sign language images were randomly selected from this dataset. The ranges of Cb and Cr values with variable brightness (Y) are shown in Fig. 2b for YCbCr color space, which were used to set the Y, Cb, and Cr value ranges for the experiments; i.e.:

$\displaystyle 50<Y<200$ $\displaystyle 77<Cb<125$ $\displaystyle 110<Cr<163$ (1)

According to Eq. (2.1.1), the skin regions in the sign language image were detected in YCbCr color space, as shown in Fig. 2c, which shows three skin or skin-like regions after skin segmentation. However, there is also noise in the image, which will influence the accuracy of subsequent hand-arm recognition.

Figure 2.

(a) Original image, (b) distribution of Cb and Cr with brightness (Y), and (c) image after skin segmentation using the values derived from (b).

2.1.2 Image denoising

The median filter was used to remove the noise. Performing image denoising before or after skin segmentation results in a difference in image quality, as shown in Fig. 3. The image filtered after skin segmentation is superior to that filtered before skin segmentation. Hence, image denoising was performed after skin segmentation in the experiments.

Figure 3.

A hand-arm region image filtered (a) before skin segmentation and (b) after skin segmentation.

Figure 4.

Images of the three types of hand-arm region extracted from the complex background: (a) a hand plus slightly bent arm, (b) a hand plus highly bent arm, and (c) a hand plus long arm. Left-hand images are after binarization and right-hand images are after positioning.

2.1.3 Localization of the hand-arm region

After skin segmentation and image denoising, three skin or skin-like regions are still apparent in the example filtered image (Fig. 3b), which will influence segmentation of the image; thus, all other regions except for the hand-arm region must be eliminated prior to segmentation. Therefore, a positioning method based on both the area operator and the location of the mass center was employed to obtain the hand-arm region. First, the coordinates of the mass centers of each region were calculated based on the zero moment and first moment, as shown in Eqs (2.1.3) and (3). Two regions had mass center coordinates that were smaller in the $X$ direction; these were the neck and the hand-arm. Then the hand-arm region occupying largest area was selected from these regions. As a result, three types of hand-arm shape were extracted from the complex background, regardless of the specific hand-arm shape (Fig. 4).

$\displaystyle m_{00}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{V(i,j)}},$ $\displaystyle m_{10}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{i\cdot V(i,j)% }},$ (2) $\displaystyle m_{01}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{j\cdot V(i,j)}}$ $\displaystyle\bar{x}=\frac{m_{10}}{m_{00}},\bar{y}=\frac{m_{01}}{m_{00}}$ (3)

Here, $V(i,j)$ is the grayscale value of the image at point $(i,j)$ , and $I$ and $J$ are the width and height of the image, respectively.

Figure 5.

Illustrations of the $t d$ descriptor for (a) a hand plus slightly bent arm, (b) a hand plus highly bent arm, and (c) a hand plus long arm.

Figure 6.

Images of a hand plus slightly bent arm (a) before segmentation and (b) after segmentation.

2.2 Hand-arm segmentation

For traditional vision-based SLR, the signers typically wear long-sleeved clothes to avoid complex segmentation of the arm. However, bare arms are inevitable in natural interactions. To address this problem, different segmentation algorithms were designed for different bent arm shapes. First, the hand-arm contour was obtained using a chain code tracking algorithm. The leftmost point and the rightmost point of the contour were selected, respectively, and the abscissa difference between them was defined as the transverse distance and denoted as $t d$ (Fig. 5). The transvers distance was used to differentiate between the different bent arm shapes; i.e., a value of $t d$ less than a certain threshold ( $\alpha)$ defined a hand plus slightly bent arm, which corresponds to an arm with slight flexing and a close hand (Fig. 5a). A value of $t d$ greater than $\alpha$ corresponds to a hand plus highly bent arm or a hand plus long arm. The hand plus highly bent arm is an arm with a large amount of flexing (Fig. 5b) and the hand plus long arm describes an arm that is longer in the horizontal direction than the other types (Fig. 5c).

2.2.1 Segmentation of a hand with a slightly bent arm

For the hand plus slightly bent arm shape, the point whose ordinate is largest among all points in the upper boundary was selected and denoted as $P g$ , as shown in Fig. 6a. The abscissa of the reference point, denoted as $Pg_{1}$ , was the same as that of $P g$ , and its ordinate was the sum of the $Pg^{\prime}$ ordinate and the width of the wrist. The line segment between $P g$ and $Pg_{1}$ was defined as the vertical cutting line. The point whose ordinate was smallest among all points of the hand-arm region was denoted as $P u$ , which was used as the seed for the morphological reconstruction. Using the contour of the hand-arm region as a template, the region of the hand plus slightly bent arm was divided into two parts based on the cutting line. All the points in the region containing the seed $P u$ were set to white and all others were set to black. The result of the morphological reconstruction is shown in Fig. 5b. The arm was removed, allowing the hand to be accurately extracted. This segmented image was then used to extract features (Section 2.3).

2.2.2 Segmentation of other hand-arm shapes

For the hand plus highly bent arm or hand plus long arm cases, the minimum distances between each point inside the hand-arm region and the points on the hand-arm contour were calculated based on the Euclidean distance. Then, the set including the minimum distances was built and denoted as $S_{1}$ and the maximum distance in set $S_{1}$ was selected and denoted as $R_{1}$ . The point inside the hand region corresponding to $R_{1}$ is the center of the palm ( $P c$ ), as shown in Fig. 7a. Finally, the distances between point $P c$ and each point on the hand-arm contour were calculated and the maximum distance was selected (maxDistance). A maxDistance of less than a certain threshold ( $\beta)$ , indicated a hand plus highly bent arm; otherwise, it was a hand plus long arm.

For the hand plus highly bent arm case, the inscribed circle of the palm with a center $P c$ and a radius $R_{1}$ was drawn, as shown in Fig. 7a. The radius $R_{2}$ was defined as the product of the radius $R_{1}$ and 1.75, resulting in an outer circle with the same center but a larger radius, as shown in Fig. 7b. The distances between each point on the boundary of the outer circle and the points on the hand-arm contour were selected from set $S_{1}$ . A set including the distances was also built and denoted as $S_{2}$ . Then, the maximum distance in set $S_{2}$ was selected. The point on the boundary of the outer circle corresponding to the above maximum distance was denoted as Pmax (Fig. 7c) and the line segment between $P c$ and Pmax was denoted as PcPmax (Fig. 7c). The line perpendicular to PcPmax through Pmax was then calculated ( $L_{2}$ ). The two intersections between $L_{2}$ and the contour of the hand-arm region were denoted as $P a$ and $P b$ (Fig. 7d) and the line segment between $P a$ and $P b$ was defined as the cutting line. Finally, $P c$ was employed as a seed for the morphological reconstruction. Using the contour of the hand-arm as a template, the hand plus highly bent arm region was divided into two parts based on the cutting line. All pixels in the region containing the seed $P c$ were set to white and all others were set to black. The result of the morphological reconstruction is shown in Fig. 7e.

Figure 7.

Images of hand plus highly bent arm: (a) the inscribed circle, (b) the outer circle, (c) Pmax, (d) the cutting line, and (e) the morphological reconstruction.

For the hand plus long arm case, the set including the minimum distances was built according to the same method as that for set $S_{1}$ and denoted as $S_{3}$ . The center $P d$ and the radius $R_{3}$ were also calculated using the same method. However, the location of $P d$ was inside the elbow; therefore, the inscribed circle of the elbow with a center $P d$ and a radius $R_{3}$ was drawn as shown in Fig. 8a. The radius $R_{4}$ was defined as the product of the radius $R_{3}$ and 1.75. Then, the outer circle with a center $P d$ and a radius $R_{4}$ was drawn (Fig. 8b). For every point on the boundary of the outer circle, set $S_{4}$ was also built using the same method as that for set $S_{2}$ . Finally, the maximum distance in set $S_{4}$ was selected. The point on the boundary of the outer circle corresponding to the above maximum distance was denoted as $P_{1\max}$ (Fig. 8c). The line segment between $P d$ and $P_{1\max}$ was denoted as $PdP_{1\max}$ and the line perpendicular to $PdP_{1\max}$ through $P_{1\max}$ was denoted as $L_{3}$ . The two intersections between $L_{3}$ and the contour of the hand-arm region were denoted as $P e$ and $P f$ (Fig. 8d). The line segment between $P e$ and $P f$ was defined as the outer cutting line. Finally, $P d$ was used as a seed for the morphological reconstruction. Using the contour of the hand-arm as a template, the hand plus long arm region was divided into two parts based on the outer cutting line. All pixels in the region containing the seed $P d$ were set to black; otherwise, they were set to white. The result of the morphological reconstruction after the first segmentation is shown in Fig. 8e, in which the elbow has been removed but the forearm still exists. Therefore, a second segmentation was required to obtain the hand region alone. The result of the morphological reconstruction after the second segmentation is shown in Fig. 8f.

Figure 8.

Images of a hand plus long arm: (a) the inscribed circle, (b) the outer circle, (c) $P_{1\max}$ , (d) the cutting line, (e) after the first segmentation, (f) the morphological reconstruction (after the second segmentation), and (g) the extracted hand contour of (f).

2.3 Classification and recognition

2.3.1 Hand contour detection

Due to their complexity, a chain code tracking algorithm was employed to extract the hand contours. An example of an extracted contour is shown in Fig. 8e. The extracted hand contours were continuous and integrated; thus, they could be used to extract features.

2.3.2 Hand feature extraction

The performance of the SLR system can be significantly improved by extracting good quality features. In this study, the geometric features of the spatial domain were used to describe the two-dimensional projections of the hand image. Due to the invariant characteristics of translation, scaling, and rotation, the Hu moment was employed to obtain the image area, center of gravity, symmetry, and other characteristics. Seven invariant moments of the hand contour were extracted: $\phi_{1},\phi_{2},\phi_{3},\phi_{4},\phi_{5},\phi_{6},\phi_{7}$ . The zero-order center distances of the image were calculated based on Eq. (2.3.2) and used to derive seven invariant moments according to Eq. (2.3.2):

$\displaystyle\mu_{pq}=\sum\limits_{j=1}^{J}\sum\limits_{i=1}^{I}({i-\bar{x}})^% {p}\times(j-\bar{y})^{\rm q}\times V(i,j),p,q=0,1,2,3$ $\displaystyle\eta_{pq}=\frac{\mu_{pq}}{\mu_{00}^{r}},r=\frac{p+q}{2},p,q=0,1,2,3$ (4)

$\displaystyle\varphi_{1}=\eta_{20}+\eta_{02},$ $\displaystyle\varphi_{2}=(\eta_{20}-\eta_{02})^{2}+4\times\eta_{11},$ $\displaystyle\varphi_{3}=(\eta_{30}-3\times\eta_{12})^{2}+(3\times\eta_{21}-% \eta_{03})^{2},$ $\displaystyle\varphi_{4}=(\eta_{30}+\eta_{12})^{2}+(\eta_{21}+\eta_{03})^{2},$ $\displaystyle\varphi_{5}=(\eta_{03}-3\times\eta_{12})(\eta_{30}+\eta_{12})[{(% \eta_{30}+3\times\eta_{12})^{2}-3\times(\eta_{21}+\eta_{03})^{2}}]+(3\times% \eta_{21}-\eta_{03})(\eta_{21}+\eta_{03})[{3\times(\eta_{30}+\eta_{12})^{2}-(% \eta_{21}+\eta_{03})^{2}}],$ (5) $\displaystyle\varphi_{6}=(\eta_{20}-\eta_{02})[{(\eta_{30}+\eta_{12})^{2}-(% \eta_{21}+\eta_{03})^{2}}]+4\times\eta_{11}\times(\eta_{30}+\eta_{12})\times(% \eta_{21}+\eta_{03}),$ $\displaystyle\varphi_{7}=(3\times\eta_{21}-\eta_{03})(\eta_{30}+\eta_{12})[{(% \eta_{30}+\eta_{12})^{2}-3\times(\eta_{21}+\eta_{03})^{2}}]+(3\times\eta_{21}-% \eta_{30})(\eta_{21}+\eta_{03})[{3\times(\eta_{30}+\eta_{12})^{2}-(\eta_{21}+% \eta_{03})^{2}}]$

where $V(i,j)$ is the grayscale value of the image at the point $(i,j)$ and $I$ and $J$ are the width and height of the image, respectively.

Sign language gestures with similar hand shapes cannot be accurately classified based on the Hu moment. Therefore, two hand descriptors were designed to further distinguish similar gestures. First, the sum of all pixels in the hand contour was calculated (Fig. 9a). The polygon was defined as the convex hull connecting the outermost points of the hand contour and was colored in cyan (Fig. 9b). The sum of all pixels in the convex hull was also calculated. The first descriptor was the ratio of the sum of pixels, as shown in Eq. (6) and the second descriptor was the ratio of the long axis to the short axis of the ellipse, as shown in Eq. (7). The orientation angle was calculated using both the zero moment and the second moment, as shown in Eq. (2.3.2). Here, the ellipse has the same orientation angle as the hand contour and contains all points on the hand contour. The long axis and the short axis of the ellipse were calculated using Eq. (9).

$\displaystyle\gamma_{1}=\frac{N_{1}}{N_{2}}$ (6) $\displaystyle\gamma_{2}=\frac{x}{y}$ (7)

Figure 9.

The convex hull of the hand contour.

Here, $N_{1}$ is the sum of all pixels in the hand contour, $N_{2}$ is the sum of all pixels in the convex hull, and $x$ and $y$ are the long and short axes of the ellipse, respectively.

$\displaystyle m_{20}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{i^{2}\cdot V(% i,j)}},$ $\displaystyle m_{02}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{j^{2}\times V% (i,j)}},$ (8) $\displaystyle m_{11}=\sum\limits_{j=1}^{J}{\sum\limits_{i=1}^{I}{i\times j% \times V(i,j)}},$ $\displaystyle\theta=\frac{1}{2}\times\arctan\frac{\mu_{11}}{\mu_{20}-\mu_{02}}$

Here, $V(i,j)$ is the grayscale value of the image at the point $(i,j)$ , $I$ and $J$ are the width and height of the image, respectively, and $\mu_{11}$ , $\mu_{20}$ , and $\mu_{02}$ can be calculated according to Eq. (2.3.2).

$\displaystyle x=\sqrt{\frac{2\left(\mu_{20}+\mu_{02}+\sqrt{(\mu_{20}-\mu_{02})% ^{2}+4\mu_{11}^{2}}\right)}{\mu_{00}}},$ (9) $\displaystyle y=\sqrt{\frac{2\left(\mu_{20}+\mu_{02}-\sqrt{(\mu_{20}-\mu_{02})% ^{2}+4\mu_{11}^{2}}\right)}{\mu_{00}}}$

Therefore, the feature vector $f v$ of any sign language image is represented as follows:

$\displaystyle fv=\{\varphi_{1},\varphi_{2},\varphi_{3},\varphi_{4},\varphi_{5}% ,\varphi_{6},\varphi_{7},\gamma_{1},\gamma_{2}\}$ (10)

2.3.3 Recognition

The most common techniques used for the classification and recognition of SLR are SVM, the Hidden Markov Model (HMM), neural networks, and Euclidean distance matching. Because SVM boasts simple calculation and rapid operation, we employed it in this study to segment and identify sign language images.

3. Validation experiments

3.1 Dataset and threshold settings

The SLR image dataset was collected under natural light conditions according to the English sign language standard library for all 26 letters. A camera was placed directly in front of the signer’s face for image acquisition. Five signers were used, all of whom were wearing skin-colored clothing. The dataset consisted of 6240 images that included the face, neck, and other skin regions (Fig. 10). All signers used their right hand to make gestures.

The settings of the two key thresholds are described here. The first threshold, $\alpha$ , was used to determine the amount of bending of the hand-arm region based on variable $t d$ . The values of $t d$ were calculated for all hand-arm shapes; for example, $t d$ values for the hand plus slightly bent arm images shown in Fig. 10a–d were 39, 56, 58, and 63, respectively, whereas the value of $t d$ for all images corresponding to other hand-arm shapes were equal to or greater than 64 (Fig. 10e–j). Thus, the threshold $\alpha$ was set to 64 in the experiments. The second threshold $\beta$ was used to distinguish between the hand plus highly bent arm and hand plus long arm shapes. The variable maxDistance for the hand plus highly bent arm case was consistently less than 60 whereas that for the hand plus long arm case was consistently greater than 60. Therefore, the threshold $\beta$ was set to 60 in the experiments.

Figure 10.

Images of (a)–(d) a hand plus slightly bent arm, (e)–(g) a hand plus highly bent arm, and (h)–(j) a hand plus long arm.

3.2 Feature extraction

In order to verify the validity of our extracted features, we calculated and compared the recognition rate of single letters and the average recognition rate for the entire dataset, derived using Hu, Hu and $\gamma_{1}$ , Hu and $\gamma_{2}$ , and our extracted features (Fig. 11). The recognition rate obtained using our extracted features is quantitatively superior to that obtained using other methods, indicating that the proposed feature extraction method is reliable and efficient.

Figure 11.

Comparison of the recognition rate for (a) the letter A and (b) the entire dataset obtained using Hu, Hu $+\gamma$ 1, Hu $+\gamma$ 2, and our extracted features.

Figure 12.

Change in recognition rates with an increasing number of images in the dataset.

Figure 13.

Comparison of recognition rates for the three classification and recognition methods.

3.3 Experimental results

To verify the feasibility and validity of the proposed method, two experiments were conducted. The first experiment was the feasibility test. The sign language recognition rates were calculated for different numbers of images in the dataset (Fig. 12). For the lowest number of images (3900), the recognition rate was only 82.14%, which increased gradually as more images were included in the dataset to 93.94% with 6240 images. The recognition rate did not improve notably with more than 6240 images. Therefore, 6240 images were used in the experiments.

The second experiment was a comparison with other common classification and recognition methods; i.e., the neural network and Euclidean distance matching. All methods were used to classify and identify 26 English sign language letters. First, different ratios between the training set and testing set were analyzed: 3:1, 4:1, 5:1, and 6:1. Then, the recognition rates were obtained using the above methods (Fig. 13). The recognition rate obtained by the neural network did not improve significantly with increasing ratio whereas Euclidean distance matching was effective with only a few training samples but the recognition rate decreased with increasing sample size. Conversely, the recognition rate of the proposed method improved substantially with increasing ratio. Furthermore, the recognition rate obtained by our proposed method was clearly superior to those of the other methods, but only for a ratio between training and testing sets of 5:1. Prior to this ratio, no obvious improvement was observed in the recognition rate with increasing ratio. Therefore, the ratio between training and testing sets was set to 5:1 in the experiments and the recognition rate and computational time were calculated for the above three methods (Table 1).

Table 1
Comparisons of the recognition rate and computational time for the three classification and recognition methods

Methods	Recognition rate (%)	Computational time (s)
Neural network	91.28%	0.510
Euclidean distance matching	89.16%	0.011
Proposed method	93.94%	0.012

The recognition rate obtained using our proposed method was quantitatively superior to that obtained by the neural network. In addition, the computational time was significantly shorter using the proposed method. Compared with Euclidean distance matching, the computational time of the proposed method was slightly longer; however, the recognition rate was substantially higher. These results indicate that our proposed hand-arm classification and recognition method is both feasible and effective.

4. Conclusions

In order to solve the problems of existing methods related to background interference and arm redundancy, a static sign language recognition system was proposed for different bent arm shapes based on image segmentation and arm removal. The system consists of hand-arm acquisition, hand-arm segmentation, contour detection, hand feature extraction, and recognition. The proposed segmentation algorithm can be adapted to different hand-arm shapes and guarantees the integrity and accuracy of the extracted hand region. The results of the proposed method were compared with existing classification and recognition methods according to the recognition accuracy and computational time. The experimental results proved the efficacy of the proposed method However, the long computational time of the proposed method should be improved through future research.

Footnotes

Acknowledgments

This work was supported by the National Natural Science Foundation of China (51405448).

Qiuhong Tian acknowledges financial support from the doctoral research start-up funding of Zhejiang Sci-Tech University (18032117-Y).

Qiaoli Zhuang acknowledges financial support from the doctoral research start-up funding of Zhejiang Sci-Tech University (19032141-Y).

Zhejiang Sci-Tech University 2019 National University Students Innovation and Entrepreneurship Training Program (201910338012).

Conflict of interest

There are no conflicts of interest to declare.

References

Tie

Zheng

. Design and implementation of static sign language-spoken translation software based on motion perception and 3D virtual simulation technology. Software Guide. 2016; 15(7): 67-69.

Trigueiros

Ribeiro

Reis

. Generic system for human-computer gesture interaction: Applications on sign language recognition and robotic soccer refereeing. Journal of Intelligent and Robotic Systems. 2015; 80(3): 573-594.

Quan

Min

. Application of sign language recognition and synthesis technology in intelligent buildings. Microcomputer information. 2007; 23(24): 219-221.

Mohandes

Aburaiky

Halawani

Albaiyat

. Automation of the Arabic sign language recognition. Proceedings International Conference on IEEE. 2004; 479-480.

Sun

Jafari

. A wearable system for recognizing american sign language in real-time using IMU and surface EMG sensors. IEEE Journal of Biomedical and Health Informatics. 2016; 20(5): 1281-1290.

Mohandes

. Recognition of two-handed arabic signs using the cyberglove. Arabian Journal for Science and Engineering. 2013; 38(3): 669-677.

Mohandes

Deriche

Johar

Ilyas

. A signer-independent Arabic Sign Language recognition system using face detection, geometric features, and a Hidden Markov Model. Computers & Electrical Engineering. 2012; 38(2): 422-433.

Kosmidou

Hadjileontiadis

. Using sample entropy for automated sign language recognition on sEMG and accelerometer data. Medical & Biological Engineering & Computing. 2010; 48(3): 255-267.

Hassan

Assaleh

Shanableh

. User-dependent sign language recognition using motion detection. 2016; 852-856.

10.

Lim

Tan

AWC

Tan

. A feature covariance matrix with serial particle filter for isolated sign language recognition. Expert Systems With Applications. 2016; 54: 208-218.

11.

Murthy

GRS

Jadon

. A review of vision based hand gestures recognition. International Journal of Information Technology and Knowledge Management. 2009; 2(2): 405-410.

12.

Zhang

Chen

Fang

Chen

Gao

. A vision-based sign language recognition system using tied-mixture density HMM. 2004; 198-204.

13.

Lee

Shih

Lin

. Computer-vision based hand gesture recognition and its application in iphone. 2013; 487-497.

14.

Lim

Tan

AWC

Tan

. Block-based histogram of optical flow for isolated sign language recognition. Journal of Visual Communication and Image Representation. 2016; 40: 538-545.

15.

Rautaray

Agrawal

. A real time hand tracking system for interactive applications. International Journal of Computer Applications. 2011; 18(6): 28-33.

16.

Zhang

Huang

. Hand tracking algorithm based on superpixels feature. 2013; 629-634.

17.

. Feature learning based on SAE-PCA network for human gesture recognition in RGBD images. Neurocomputing. 2015; 151: 565-573.

18.

Bilal

SMOS

Akmeliawati

Salami

MJE

Shafie

. Dynamic approach for real-time skin detection. Journal of Real-time Image Processing. 2015; 10(2): 371-385.

19.

Liu

Liao

Yang

. Research on algorithm and model of hand gestures recognition based on HMM. 2016; 81-90.

20.

Zare

Zahiri

. Recognition of a real-time signer-independent static Farsi sign language based on fourier coefficients amplitude. International Journal of Machine Learning and Cybernetics. 2018; 9(5): 727-741.

21.

Gao

. Research on gesture recognition technology in the human-computer interaction. MS Thesis, Xidian University. 2013.

Improving arm segmentation in sign language recognition systems using image processing

Abstract

BACKGROUND:

OBJECTIVE:

METHODS:

RESULTS:

Keywords

1. Introduction

2. SLR system design

2.1.1 Skin segmentation

2.2.1 Segmentation of a hand with a slightly bent arm

2.2.2 Segmentation of other hand-arm shapes

2.3.1 Hand contour detection

2.3.2 Hand feature extraction

3. Validation experiments

3.1 Dataset and threshold settings

Table 1 Comparisons of the recognition rate and computational time for the three classification and recognition methods

Footnotes

Acknowledgments

Conflict of interest

References

Table 1
Comparisons of the recognition rate and computational time for the three classification and recognition methods