Abstract
The use of command and control (C2) servers in cyberattacks has risen considerably, attackers frequently employ the domain generated algorithm (DGA) technique to conceal their C2 servers. Various machine learning models have been suggested for binary identification of domain names as either benign or DGA domain. The Existing techniques are inefficient and have real-time detection issues and are also very data hypersensitive, therefore, they can be circumvented by the attackers. The main problem this article addresses is how to automatically detect DGA in a way that does not rely solely on reverse engineering, not strongly affected by data size, and allows detection of this DGA in real time. This paper presents DTFS-DGA model that combine neural networks models with traditional machine learning models and maintains its performance even if the data size changes to detect DGA in real time. The model uses 15 linguistics and networks features with the features extracted by long short-term memory and convolutional neural network to classify domain names using random forest and support vector machines. The comprehensive experimental findings confirm the suggested model’s accuracy. To be precise, the model achieve an average accuracy of
Introduction
The main tools that allow cybercriminals to carry out their destructive operations are botnets. Botnet is a malware system that affects the machine and it centrally managed by a botmaster across one or more command and control servers (C2). The C2 server is a computer used by an attacker to send commands to system infected with malware and to receive stolen data from this targeted system. To communicate with its C2 server, the malware needs to know the IP address of that server, an easy way to hide that C2 server is to hard code a list of IP addresses or domain names into the malware, but this method mitigates access to a particular address, so the domain can be trivially blocked and quickly seized. Moreover, it makes the C2 server vulnerable to a simple blacklisting of the IP address or domain. The latest analyzes have shown that there are over a million bots in some bonnets, That Highlights the size of their risk. By executing take down attempts, security researchers also attempt to interrupt active botnets. The key object of these activities is the bonnet’s Command and control contact infrastructure. As a result, the malwares are prohibited from connecting with the initial C2 servers. Botmasters have continued inventing new strategies to secure the networks of their malwares as a reaction to these efforts and continue to pose a significant danger to individuals and companies by infecting systems in order to perform illegal malicious behaviors. In order to accomplish its objectives effectively, the malware must be able to link to the command and control center (C2). Many malware reaching the (C2) center hosted behind domains created by a domains generated algorithm (DGA), DGAs generate random pseudo domain names based on a seed, which can be anything from alphanumeric characters to dates and times. Malware that uses a C2 server uses such domain names to infiltrate computers since malware with fixed domain names is easily blocked.
For example as shown in Fig. 1, it is considered that an attacker infects a host with malware and the attacker has to hide their C2 server, in this case the malware uses the DGA to produce tens of thousands of random domain names, including the C2 server address which will already be registered by the malware. The other domain names are only used for camouflage and disguise purposes. After, it infiltrates and connects the target device to obtain more instructions.

Camouflage technique using DGAs.
One active DGA can generate up to a few hundred domains every day, so over time and with a lot of DGA, it is impossible for a human analyst to sort through the thousands of benign domains occurring simultaneously. However, machine learning and deep learning algorithms can solve automatically these classification problems.
The identification the DGA in network traffic facilitates the identification of malware-infected devices. Several researchers are recently starting to apply machine learning methods to automatically detect the DGA domain name, especially since the traditional methods used for DGA detection, like blacklisting methods and reverse engineering, have become inefficient, error-prone and time-consuming task due to the large number of domain names to be blocked and its variation over time. The artificial intelligent approaches using machine learning and deep learning algorithms can learn the fundamental structure and patterns of DGAs domain name from a given data-set, these fundamental structure and pattern can discriminate DGA domain name from a list of domains.
A single active DGA can generate up to a few hundred domains per day, making it impossible for a human analyst to detect the DGA in real time. Thus, the main problem posed is how to automatically detect the DGA in a way that does not depend exclusively on reverse engineering and allows real-time detection.
The focus in this paper is the development of a binary classifier that can detect DGA domain with high accuracy. Since there are shortcomings and strengths of traditional machine learning, and there are shortcomings and strengths also in neural networks models on the other hand, and given that each model includes deficiencies and strengths that differ from the other model, it is obvious that there is a method to use in which the two models complement each other and provide high accuracy. The major contribution of this article is to propose a hybrid model that applies a combination of a Deep Feature Selection, precisely Long Short Term Memory (LSTM) and Convolution Neural Network(CNN) with traditional machine learning models precisely, Random Forest(RF) and Support Vector Machine(SVM) to detect DGA.
The remainder of this article is structured as follows: Section 2 will briefly review previous work. Section 3 is a statement of the DGAs problem. Section 4 presents the methodological formulation and the architecture of the proposed model. Section 5 is an experimental evaluation. In the last section, conclusion and further suggestions are given.
The world of the internet is growing as the internet of things (IoT) [18] [36], the network of connected vehicles and mobile devices are expanding. at the same time, threat agents can threaten a rising number of possible targets on a regular basis as present internet and end-user applications are at risk for vulnerabilities, intrusions, and malware. Many of such systems are susceptible to attacks due to mismanagement issues, poor patching behaviors, and dangerous 0-day attacks. The issue of separating legitimate domain names from those created by algorithms is not new, and it has been studied for many years. With the release of [14] and [25] in 2009, DGAs became well-known in the community. DGAs have become increasingly common in malware since then. The early attempts to combat this threat with machine learning approaches were hampered by a lack of training data [3]. As a result, the first methods and techniques proposed were somewhat statistical and the methods of machine learning and deep learning were used at a later time after sufficient data became available. Machine learning advances have resulted in models that efficiently identify and cluster traffic for network security purposes in general and for the detection of DGAs in particular.
Work in [17] combined 18 lexical features (such as N-gram statistics, Shannon entropy, length and number of vowels per consonant, among others) with 17 network features (such as TTL statistics and the number of IP subnets...) derived from a passive DNS system using Random Forest (RF) to detect DGAs. This research uses supervised learning to train a classifier using traditional machine learning tools, to train the classifier, researchers first create a training set of domain names that can be classified as malicious or benign. They extract a set of the 36 features of their classifier from each domain name. These characteristics attempt to categorize domain names so that malware-related domain names are distinguished from legitimate domain names.
Authors in [33] proposed a traditional machine learning model with statistical features. N-gram analysis and principal component analysis (PCA) are used to investigate features such as Shannon entropy criteria, known word rate in a dictionary, domain name, length, consonant, and vowel rate, as well as domain analysis strings. Different machine learning method classifiers, such as decision tree, support vector machine, random forest, and logistic regression, were used to assess the performance of the suggested model. The random forest method can be successfully used in DGA identification and provides the best detection accuracy, according to experimental data.
Work in [31] presents a method of identifying and classifying DGA domain names, the extraction of features is based on resolved IP addresses. This approach is also based on two features commonly used in linguistics: significant character ratios and the standard N-gram. Both attributes include a meaning metric and a pronunciation ability to identify random lines. The authors evaluated their model and it correctly distinguished DGA- vs. non-DGA-generated domains in 94.8 percent of the cases.
Authores in [32] relies on machine-learning-based sorting of NXDs into DGA-related and positive NXDs (i.e. domain names used in negative DNS responses). The classification features are drawn exclusively from the individual NXD to be classified. This work tests the system with malicious data created by 59 DGAs from the DGArchive, data from a big university’s campus network, and data from a significant company’s internal network. The authors demonstrate that the approach has good classification accuracy with a low false positive rate, it is extremely generalizable, and can discover previously unknown DGAs.
Work in [15] uses a machine learning framework that encompasses multiple feature extraction techniques and to classify the DGA domains from normal domains, cluster the DGA domains and predict DGAs. Over the course of a year, this work collects real-time threat data from live traffic. A deep learning model is also proposed to classify a large number of DGA domains. The proposed machine learning framework includes a two-level model and a prediction model. In the two-level approach, the authors first distinguish DGAs domains from normal domains, and then utilize the clustering method to discover the algorithms that create these DGAs. Based on the hidden Markov model (HMM), a time-series model is built to predict incoming domain features in the prediction model. This work achieves a classification accuracy of 95.89 percent and a DNN model accuracy of 97.79 percent, the accuracy of second level clustering is 92.45 percent, while the accuracy of HMM prediction is 95.21 percent.
The authors in [37] propose an implementation of an Long Short-Term Memory Networks (LSTM) used for nonspecific DGA analysis, this method learns features automatically, thus offering the potential to bypass the human effort of feature engineering. Their experiments show that their deep learning approach outperforms a character-level HMM and a random forest model that utilize features such as the character distribution and entropy.Their analysis and implementation have led to a great success in identifying the majority of DGA families.
The work in [38] suggested a deep learning-based DGA domain name detection method. The classification model was built using the CNN and LSTM algorithms with a large real data collected from the real DNS traffic that are more representative instead of small fictitious data sets. This work also used deep learning to leverage the benefits of automatic feature extraction and the potential for online learning to keep up with changes in DGA domain patterns. For comparison, this work used traditional methods and found that the deep networks are also superior DGA detectors.
The authors in [30] Implemented a Convolutional neural network model that takes common short strings as its input and trains to determine even if they are anomaly activity indicators. The basic distinction between this model and the CNN model is that layers in CNN are parallels rather than layered in this model, and that the pooling often takes place over the whole domain name rather than inside a specific pooling window.
The authors of [2] analyzed the Real-World Applicability of DGA Classifiers and propose two DGA detector classifiers based on residual neural networks (ResNets), one for binary and another for multiclass classification. By offering a robust binary and multiclass classifier they can distinguish between benign domains from domains generated by DGAs and also capable of identifying previously unknown DGAs.
It was claimed by the previous models to have a high level of efficiency. The findings presented are not without flaws. on the one hand, the features extracted by reverse human-engineered [5] features and their use with traditional machine learning methods (Decision Tree, Random Forest, SVM, etc) [17,33] rest insufficient, so it cannot count on it only to detecting DGAs, especially since the difficult categories of DGA appear day after day. On the other hand, deep neural networks (LSTM, CNN, etc) [30,38] are known to be hungry for data, any lack of data leads to a lack of training, which inevitably affects the results. Each model has its own flaws which are different from the others and combining many models can increase the detection accuracy. This article proposes a model that combines different models to compensate for these flaws and achieve better results in detecting DGA.
Problem statement
Most malware families have implemented DGAs to avoid detection and removal servers over the internet, the use of DGAs is an effective step that has gained broad popularity in recent years. A DGA is employed to dynamically generate a huge number of seemingly random domain names using unique parameters such as the year, the time, or input as seeds for random initialization and then select a small subset of these domains for Command and Control Communication.
DGAs are often linked to malicious network activities. The detection of a DGA poses a significant problem at the following levels.
Domain generation algorithms offer process for making massive groups of pseudo-random domain names. Therefore, analyzing and understanding the behavior of a malware sample manually only through reverse engineering is a very error-prone and time-consuming task. As the filter norms are expanded by multiple inputs sources, the DGA blacklist continues to grow. The sequences of a DGA, on the other hand, may not be recognized quickly from these inputs sources, this means that malicious domain names are detected retrospectively after they have performed malicious actions and not in the real time. With the massive number of DGAs, in addition that relying on reverse engineering to extract features has become an error-prone and time-consuming task, using these features with traditional machine learning models does not yield good performance as shown in Fig. 2 and explained by the work [29].
An illustration of the performance comparison between deep learning (DL) and other machine learning (ML) algorithms, where DL modeling from large amounts of data can increase the performance [29]. Models that rely only on deep neural networks (LSTM, CNN, etc.) need very large data and are known to be data-hungry as shown in Fig. 2, i.e. lack of data leads to lack of training, which inevitably affects the results. Thus, the main problem is how to automatically detect DGA in a way that is not solely based on reverse engineering, not greatly affected by lack of data and enables the detection of this DGA in real-time. This work aims to counteract algorithmically generated domains in an automatic way that enables real-time DGA detection using a combination of deep feature selection and traditional machine learning models. Specifically, three main components make up the proposed model. The first is the data preprocessing, followed by the feature extractor and finally the classifier. Each of these components will be introduced in depth in the next section.

This section presents the classification model designed to detect DGAs, This model consists of three main components as shown in the Fig. 3, these component are:

Architecture of the hybrid model.
In the remainder of this article, the term hybrid model(H.MODEL) is used to designate the model with the architecture in Fig. 3 with the necessity to use at least one language feature and one network feature and to extract feature using LSTM and CNN, and the term Deep and Traditional Feature Selection Model To Detect DGA (DTFS-DGA) is used to designate the proposed hybrid model with all features referred to in Section 4.2.
Real-world data is frequently in an unsuitable format that cannot be utilized directly for machine learning models. Data preprocessing is a necessary operation to prepare data for a machine learning model, which improves the precision and performance of the model. It involves the below steps:
Getting the dataset Finding Missing Data Encoding Data Splitting dataset into training data and test data Feature scaling
Getting the dataset:
There have been some noteworthy researches in the last decade that have provided significant assistance to the cybersecurity community by studying creative strategies for dealing with network threats. In our experiment, we train and evaluate the model proposed to detect DGA on a data set with 10000, 20000, 50000 and 100000 domain names from [41] UMUDGA data set with a similar number of DGAs domains and legit domains. UMUDGA is a full-fledged machine-learning-ready labelled dataset containing over 30 million DGAs classified into 50 malware variant levels. This mature dataset attempts to bridge the void in dataset used to train and evaluate machine learning models to detect DGA.
Finding missing data
The following stage in data preparation is to deal with missing data in the datasets. If the dataset has some missing data, it may provide a significant challenge to the machine learning model. As a result, handling missing values in the dataset is required. The UMUDUGA data used in this article was analyzed for missing data before and after feature extraction and no missing data was found.
Encoding data
The data might comprise a variety of different forms of data, and it must be encoded before it can be used in Machine Learning models.
Encoding categorical data: Because machine learning models are entirely based on mathematics and numbers, including a categorical variable in our dataset may cause problems while creating the model. As a result, these category variables must be encoded into integers. we will convert the type of domain name into 0 if this domain is legit and 1 if this domain is a DGA domain.
Encode and decode domains names: The process of transforming a set of characters into a specific format for transmission or storage in computers is known as encoding. Decoding is the process of transforming an encoded format to return to its original character sequence. The encode and decode approach is used in contemporary NLP techniques to turn words or characters into vector representations. The character level domain name encoding has been used in a similar manner. The domain name is organized into characters, and a dictionary is created by assigning a unique id to each character, the Fig. 4 represent the algorithm of encode and decode domains names, the first thing in this algorithm after having data is to map over the domains to find the unique characters they contain, then join all unique characters without duplicates and assign to each character an integer. Thus, each character becomes decoded with a number, and it is now possible to easily encode and decode domains using simple functions.

Algorithm to encode and decode domains.
Several previous researchs proposed to study strategies of splitting data, such as [22,23] which concluded that the data split into 70% for training and 30% for testing showed high prediction accuracy. Based on that, the data set used with DTFS-DGA model was divided into a training set and a test set. The data set is splited as shown in the Fig. 5, and always 70% of the data is taken for training and 30% of the data for testing on a dataset with 10000, 20000, 50000 and 100000 without the data being duplicated in both training and testing.This is an important step in data preparation since it allows us to improve the performance of the DTFS-DGA model.

Splitting the dataset into the training set and test set.
It’s a technique for keeping a dataset’s independent variables inside a given range of values. To put it another way, feature scaling restricts the range of parameters so that we may compare them on a level playing field. In Machine Learning models, there are two approaches to scale features. The first is Standardization and we can do this using the Eq. (1) and the second is the Normalization and it can be accomplished using the Eq. (2). The Normalization is the approach we use to scale our features in the Experimental Evaluation section after computing them in the next section.
Component 2: Feature extractor
How to find powerful features is the central question of the problem of classification and pattern recognition. humans have an incredible skill in extracting meaningful features, but that is not possible when the dataset gets complicated. Deep learning feature extractors, on the other hand, have incredible ability in extracting useful features, but they are ineffective when there isn’t enough dataset. A feature extractor was used to extract the features from each domain name of the dataset, either by using deep learning, especially LSTM and CNN, or by using statistical features, which are two types linguistic features and network features.
Linguistic features:
Each domain name is considered as a string. 9 linguistic features was used to extract the linguistic features and they are as following:
Length Shannon entropy F2: Entropy is calculated using the following formula
Significant Word Ratio Number of vowels Number of consonants The maximum sequence consonant The maximum sequence vowels Number of digital number The maximum sequence digits
Network features
DGA domains often include less information than legit domains due to their short lifespan. For that reason, some network features of the domain name have been looked for from the WHOIS [16] and given the value 1 if the feature exists and 0 if it does not. WHOIS is a well-known system for obtaining information on over 280 million registered domain names on the Internet. In the DTFS-DGA, a 6 network features have been used in addition to the previously mentioned 9 language features, which are as follows:
Neural network features extractors
The neural network, also known as the Artificial Neural Network (ANN) [44]. is a human brain-inspired machine learning model, based on several basic computing components called neurons or nodes, each of which measures a simple function. The neurons are closely interlinked in a layered model. Usually, the neurons are arranged in such a way that the first layer is the input layer, the last one is the output layer, and the layers between them are the hidden layers. Every relation between nodes has a weight that is calculated during the training process. The effects of each neuron in the input layer are measured layer by layer before the output layer is reached. Any neuron in each layer (excluding the input layer) accepts the results obtained from the neurons in the preceding layer as input,
The following are the most important terms to understand the neural network.
Bias: In addition to the weights, the input is subjected to a linear component known as the bias. It is added to the input as the result of weight multiplication. The bias is mostly used to alter the range of the weighted input. Activation Function: A non-linear function is applied to the input after the linear component is applied. Sigmoid in Eq. (3), ReLU in Eq. (4) and tanh in Eq. (5) are the most often used activation functions.
LSTM neural network features extractor. Long Short Term Memory Networks (LSTMs) is a kind of RNN that can learn long-term dependencies introduced by [12] and many other researchers have developed and published it in later works. RNNs have received a lot of coverage lately thanks to their successful application.
The Fig. 6 represents a common LSTM unit which consists of a cell, an input gate, an output gate and a forget gate. The three gates control the flow of information into and out of the cell, and the cell remembers values across arbitrary time periods. The cell state

An LSTM model’s internal architecture.
Gates are a means of allowing information to pass through with the option of allowing it to pass through. A sigmoid neural network layer σ and a pointwise multiplication procedure ⊗ are used to create them. The sigmoid layer produces values ranging from zero to one, indicating how much of each component should be allowed to pass. A value of zero indicates that “nothing should be let through,” whereas a value of one indicates that “everything should be allowed through”. The LSTM is made up of state boxes that receive inputs over time. An input vector is fed into LSTM at each time stage, and the output is computed according to:
LSTMs are a natural type of module that can be used to automatically extract features since domain names can be thought of as a series of characters, the LSTM architecture used to extract features from the data consists of four layers, with tanh as the activation function. These are the four layers used:
Embedding layer: is the initial layer of a network’s hidden layers on text data.
LSTM layer consists of a cell, an input gate, an output gate and a forget gate. The three gates regulate the flow of data into and out of the cell, and the cell remembers values over long periods of time. The key to LSTMs is the cell state
The Dropout layer, which helps minimize overfitting, changes input units to 0 at random with a rate frequency at each step during training time.
Dense layer The dense layer is a deep-connected neural network layer, meaning that each neuron in the dense layer gets input from all neurons in the preceding layer. In the models, the dense layer is shown to be the most often used layer.
CNN neural network features extractor. CNN [6] is used to automatically select features from domains names. The list of layers that convert input volume to output volume is not complicated, which is one of the advantages of CNN classifier.
In CNN, there are three main types of layers: convolutional, pooling and fully connected. Each of these layers has its own set of parameters that can be optimized, and each does something different with the input data. The convolutional layers are where filters are applied to the original picture or other feature maps. The number of kernels and the size of the kernels are the most important parameters.The pooling layers are similar to convolutional layers, but they execute a specific function, such as max pooling, which takes the largest value in a certain filter region, or average pooling, which takes the average value in a particular filter region. These are commonly employed to lower the network’s dimensionality. Fully connected layers are used to flatten the results before classification and are inserted before the classification output of a CNN.
The CNN is used to extract features from images, but it is possible to encode a sequence of strings and use CNN as indicated in the Section 4.1.3, only a few distinct layers explained above are used to translate input to output.
Conv1D layer: Convolutional layer is where filters are applied to the original image, This layer generates a tensor of outputs by convolving the layer input with the convolution kernel across a single spatial (or temporal) dimension.
MaxPooling1D layer: The maximum value over a spatial rectangle of size pool size.
The Dropout layer as explained befor in LSTM layers.
Flatten layer that reduces the input’s spatial dimensions to the channel dimension.
The Dense layer as explained befor in LSTM layers.
The last component aims to build a hybrid classifier, using Random Forest (RF) [4] and Support Vector Machine (SVM) [24], using the voting classifier [43] specifically the weighted soft voting [1], with a weight of 1 for SVM and 1 for RF. The training data is simply remembered by the hybrid classifier. The Fig. 7 represented the flowchart of The classifier model used in this component which starts with data entry including all features extracted in component 2, and preprocessed in component 1, and it ends with the training of the model with this data.

Flowchart of the component 3.
A random forest is a set of decision trees [26,28]. It can be thought of as an ensemble model. A random forest model will cast a vote based on all forecasting outcomes from its inner decision trees. Decision trees build models that can identify a given sample by extracting a collection of decision rules from the feature sets of the samples in the training data. A node in the tree can be thought of as an if-then-else decision node. The conditional test used is made over the possible set of features and their respective range of values. As such, to classify a given sample one starts testing the feature value of the root node and proceeds down the tree following the branches corresponding to the value for that feature. This process is then repeated for the sub-tree that starts at the new node until finally reaching a leaf node. The classification result is given by the classification of the samples from the training set that belong to the same leaf node. A Decision Tree can also be thought of as a disjunction of conjunctions over the tests on the values of the features of the samples. Each path from the root to the leaf node represents a conjunction of these constraints and the tree as a whole is the disjunctions of such paths.
Support vector machine classifier
A support vector machine (SVM) is a computer algorithm that learns to mark entities by example. To grasp the meaning of SVM classification, One just has to understand four fundamental concepts:
The separating hyperplane:it is defined by Eq. (6).
The maximum-margin hyperplane: the goal of SVM is to find the separating hyperplane. However, many such lines exist and to find an optimal hyperplane it is necessary to maximize the width of the margin as shown in Eq. (7).
The soft margin: This concept based on a simple premise allow SVM to make a certain number of mistakes while keeping the margin as large as possible to ensure that other points are correctly categorized. SVM’s goal can be changed to accomplish this. The kernel function: To perform the linear separation, the data must be transformed into a higher-dimensional feature space. Take the inner product of the new vectors after mapping data into a new space. The inner product of the data’s images is the representation of the data’s inner product. The following functions are the different types of kernel functions that are widely used for the task of classification. Equation (8) is a Linear kernel, Eqs (9), (10) and (11) are respectively Radial Basis (RBF), Polynomial and Sigmoidal kernels and they are Non-linear kernels used on the data if the class boundaries are overlapping or non-linear.
The kernel function used with SVM in the component 3 is the RBF kernel as expressed in the Eq. (9), the reason for choosing that type of kernel is the studies that carried out about the effect of different kernels on the performance of an SVM Based Classification, the most important of these studies is [13] which illustrates that an RBF kernel leads to better performance than others.
Why RF-SVM classifier
Although the RF algorithm is considered as a good classifier in most cases, it is not without flaws, especially when it comes to more complex classifications. Moreover, any noise or minor modification in the training data has a significant impact on the RF classification [27]. Therefore, relying on it alone may reduce the accuracy of DGA detection. On the other hand, SVM has become a very popular algorithm, its benefit is that it enables more complex relationships between data points to be captured without the need to perform transformations. However, SVM has also drawbacks, just like other machine learning algorithms especially, it is not very good for large data sets [39].
Back to the main idea of this article, which aims to combine several different algorithms in order to compensate for defects and obtain high accuracy to detect DGA. Then, it was sought to increase the detection accuracy of DGA using the RF-SVM classifier, each classifier is given a weight of 1 and the output class label with the highest probability-weighted average is chosen. Therefore, the model achieves high detection accuracy compared to using each algorithm separately as it will be confirmed in the Section 5.3 Evaluation of Model performance.
Experimental evaluation
This section represents the implementation environment of DTFS-DGA and describes the tests performed on this model using a practical DGA dataset, and discusses the experimental results for our models, and compares them to other models such as CNN, LSTM, RF and SVM.
Table 1 represents an example of the data used with all features scaled using the technique of feature scaling talked about it in Section 4.1 after calculate all linguistic and network features and after extract features using LSTM and CNN.
Example of the dataset with all features scaling
Example of the dataset with all features scaling
Apache Spark [34] is a unified open-source data analysis motor for large-scale data processing. Spark provides an interface for programming whole clusters with implicit data parallelism. On a cluster, Spark applications execute separate sets of processes. To operate in a cluster, spark can connect to one of many cluster managers, which distribute resources and deliver tasks to the executors to complete. To facilitate the connection between Spark and the computational resources we use the Databricks platform [21], Databricks is a data analytics platform that combines data engineering, machine learning, and collaborative data science into a single platform. A Databricks workspace is a software-as-a-service (SaaS) environment that allows us to access all of the assets of Databricks. The workspace organizes things into folders and offers access to data and computing resources such as clusters and jobs. To implement the DTFS-DGA, the computational resources used is Amazon Elastic Compute Cloud (Amazon EC2) provided by Amazon Web Services [42] (AWS).
Implementation details
The DTFS-DGA has been implemented using many python and sapark libraries such as pandas [19], numpy, keras [9], scikit-learn [35], enchant, PySpark [7] and MLlib [20], and also the WHOIS model which was very important to extract network features. In this implementation we changed from Spark data type to Pandas data and vice versa whenever needed, because Pandas data contains very useful properties and can work more easily with python libraries.
Evaluation of model performance
The evaluation metrics most widely used in this field are: True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR), False Negative Rate (FNR), Precision, Accuracy and F-score. Table 2 represents these evaluation metrics, all these metrics belong to the interval [0,1], among them some metrics better to be higher and others better to be lower. There are many works that use this evaluation metrics to evaluate their models, including but not limited to: [8,10,11,40] and many others.
The most commonly evaluation metrics
The most commonly evaluation metrics
The detailed experimental results show the accuracy of the DTFS-DGA. Table 3 show the metrics of the DTFS-DGA that achieve an average accuracy of
Our model performance
It should be noted that the DTFS-DGA performed well with the difference in the size of the data used in training and testing, it achieved detection accuracy of
The Fig. 8 show the accuracy of different models depending on the data size, specifically, the accuracy of CNN and LSTM which do not need manually extracted features, also the accuracy of the RF and SVM models with the 9 linguistics feature and the 6 networks features without features extracted by CNN and LSTM trained and tested with different size of data 10000,20000,50000 and 100000 splitted as described in Section 4.1.4. The result showed that the proposed model, as symbolized in the Fig. 8 with DTFS-DGA maintains its performance even if the data volume changes unlike other models that are directly affected by data volume. It should be noted that traditional machine learning models such as RF and SVM perform well when the data size is lower and their performance decreases with the increase in data size and this approves the result shown in Fig. 2 explained by the work [29]. Moreover, deep learning models work unlike previous models, which means that their performance is poor with a small data size and increases with an increasing of data size which also approves the result shown in Fig. 2. Therefore, it can be said that DTFS-DGA is a solution to the problem of data hypersensitivity and its maintains its performance even if the data volume changes and its also can be reliable for real-time DGA detection unlike the existing related works models.

Comparison of performance between models depending on the data.
To measure the impact of the human features extracted using reverse engineering and their performance on related work machine learning models and the H.MODEL presented in the Section 4.1, a study by the case was conducted on a five-classes of features and their impact on the model’s performance. Each of the five categories contains only three features, in order to know the model that is negatively affected by the lack of development of reverse engineering and the inability of a human to extract the features manually, especially since this matter has become very likely. The five categories used are as follows:
Class No. 1: Contains the features
Class No. 2: Contains the features
Class No. 3: Contains the features
Class No. 4: Contains the features
Class No. 5: Contains the features
The Fig. 9 shows the result of case studies conducted on the five classes of features and study the impact of these classes on performance of model using 10000 of data. This situation of reducing the number of features reduced the success of the classification of other studies that use the traditional machine learning models to detect DGA although these models performed well with this volume of data before reducing the number of features extracted by humans. On the other hand, the H.MODEL achieved superior classification success with the same size of data.

Case studies on the categories of features and their impact on model performance using 10000 of data.
Classical machine learning approaches based on the string of the DGA domain name classification are focused on retrieving predefined reverse engineered features. If reverse human-engineered features are used, it is obvious that this allowed for an opponent to intentionally build his DGA to avoid detection by using these features, in addition, the approach of reverse engineering makes the development of machine learning programs labor-intensive and time-consuming, in other words, the use of languages and networks features to detect DGAs has a technical downside since they can be easily bypassed by the malware author, thus creating a new set of features extremely difficult. Deep neural networks, on the other hand, can solve this challenge. Because of the large number of parameters to be determined, deep neural networks are known to be hungry for data, a large number of training examples are needed for learning and any deficiency or defect in the data negatively affects the model. The DTFS-DGA model proposed in this article can avoid the deficiencies of the two machine learning approaches, it is not significantly affected by the lack of data or by the lack of features extracted by reverse engineering as shown in Fig. 8 and Fig. 9, because it takes advantage of the strength of traditional models that work well when the size of data is small, and the strength of deep neural networks when the number of data increases, it also evolves with the increase in the volume of data and the evolution of the reverse engineering.
Conclusion and future works
In this study, a combined approach was suggested and it significantly improves detection performance. Many research in the present DGA domain identification area have mostly employed neural network models, which have the drawback of learning information with limited data. Many other works have mainly used traditional machine learning which has the disadvantage of learning information on the extraction of features. To address these problems, the DTFS-DGA combines different models to compensate for these flaws and achieves better results in detecting DGA. The experimental evaluations showed that the model gave an improved accuracy average of 99.8% for the classification. As a result, the DTFS-DGA is a great way to classify DGAs.
There are many possibilities to improve performance of The DTFS-DGA use many language features as meaningful word ratio from English dictionary, these meaningful word ratios can also be used from many language dictionaries. On the other hand neural networks have incredible ability in extracting useful features in condition to provided him enough data, so collect more dataset can improve performance of our model. in The component 3 the SVM-RF voting classifier was used with the same weights, to improve the classifier model and compare the effects of its components, optimal weights can also be used.
