Building an Efficient System to Detect Computer Worms in Websites Based on Ensemble Ada Boosting and SVM Classifiers Algorithms

SVM and DT classifiers. To select the most important features, we propose to conduct the similar features selected by Correlation and Chi-Square feature selection (since correlation finds the relations between features and classes whereas Chi finds whether features and classes are independent or not). The contribution suggests using SVM in the boosting ensemble algorithm as base estimators instead of DT to efficiently detect various types of worms. The system achieved accuracy, reaching 100% with CFS+Chi2fs and 99.38, 99.89 with correlation and chi-square separately.


H I G H L I G H T S A B S T R A C T
• Union of two feature selection methods strength the NIDS • Performance of the NIDS will increase by using ensemble learning • Bagging and boosting by SVM have much more power than DT • Worm detection is much more strongest by using NIDS with two levels Computer worms perform harmful tasks in network systems due to their rapid spread, which leads to harmful consequences on system security.However, existing worm detection algorithms are still suffered a lot to achieve good performance.The reasons for that are: First, a large number of irrelevant data impacts classification accuracy (irrelevant feature gives estimator new ways to go wrong without any expected benefit also can cause overfitting, which will generally lead to decreased accuracy).Second, the individual classifiers used extensively in the systems do not effectively detect all types of worms.Third, many systems are built based on old datasets, making them less suitable for new types of worms.The research aims to detect computer worms in the network based on data mining algorithms for their high ability to automatically and accurately detect new types of computer worms.The proposal uses misuse and anomaly detection techniques based on the UNSW_NB15 dataset to train and test the ensemble Ada Boosting algorithm using SVM and DT classifiers.To select the most important features, we propose to conduct the similar features selected by

Introduction
The Internet has become an essential part of the modern life of people because it is used in their education, communication, work, entertainment, and storage of their data.The expansion of the use of the Internet has led to several problems that exploit the weakness in a certain aspect to carry out harmful actions, such as using computer worms to shut down the network or steal data, etc. Computer worms are small, self-contained programs that do not require the assistance of others [1].They are designed to carry out destructive activities, steal data from users while they are surfing the Internet, or damage them or their callers.Due to their superior ability to colorize, replicate, and elude detection, they spread quickly and are difficult to eradicate.The worm spreads more widely and faster than viruses because it automatically infects machines linked to the network and without human intervention [2].The danger of worms is that they are independent and not dependent on other software that joins them, rapidly spreading them.Several attempts were used to detect computer worms and size their damages, such as using a firewall, encryption, machine learning techniques, and many other attempts [3].
IDS is software that detects any activity that is normal or malicious.It is one of the most reliable systems for detecting penetrations and attacks [4].IDS is generating several false alarms.This problem has encouraged many researchers to find a solution to distinguish alerts to the less important incident and reduce false alarms, which are false positive (FP) and false negative (FN).Based on data mining technique, IDS can enhance IDS in real time, remove the normal activity from alarm data for focusing on real attacks, and find an abnormal activity that uncovers a real attack.It's a computational framework for finding patterns in data sets that use approaches from artificial intelligence, machine learning, and database systems.Different parameters may be used by data mining applications to analyze various data sets.[5,6].Network Intrusion Detection Systems (NIDS) have become the most important component of recent network infrastructure due to increased security threats nowadays.The intrusion Detection System (IDS) generates a good number of alarms.However, algorithmic procedures are deployed to reduce false positives [7, 8, and 9].Ensemble learning is a machine learning technique that involves training a group of poor learners (models) to solve a problem and then combining their results to produce better results.The basic idea is that combining weak models in the right way can get more accurate and/or robust models.Ensemble approaches are divided into three categories.Bagging that combines homogeneous weak learners, trains and tests them in parallel, and then combines them using voting, average, and other methods.Boosting brings together homogeneous poor learners and trains and tests them sequentially (each iteration depends on the previous ones).Stacking is an ensemble method in which a new model learns how to combine the predictions of numerous existing models in the most effective way possible [10].This article aims to detect computer worms in the network based on data mining algorithms for their high ability to automatically and accurately detect new types of computer worms.The remainder of the paper is laid out as follows: In Section 2, we discuss the related work of worm detection.In Section 3, we introduce the theoretical background of the model.In Section 4, we introduce the worm detection system architecture based on ensemble Ada boosting.Preprocessing is covered in section 4. Section 4.II introduces the train and test model and builds a classifier (DT and SVM).Our comprehensive experiments in assessing the proposed worm detection system are discussed in Section 5. Section 6 concludes by explaining the conclusion.

Related work
Worm detection, as an important tool in computer systems for ensuring the security of cyber systems, regularly draws the research community's attention.While several solutions to improve worm detection efficiency have been suggested, we only consider work that falls under the ML-based IDS umbrella, uses dimensionality reduction or ensemble classification and other data mining techniques, and focuses on hybrid approaches in this section.
Gautam and Doegar [11] suggested an Intrusion Detection Approach based on Ensemble methods.They used three algorithms which are the native Bayes adaptive boost part.Also, they used information gain to remove redundant features.They also combined the results of the three classifiers by using the average or majority of voters.
Yuyang Zhou et al. [12] suggested an IDS based on ensemble classification that uses forest by penalizing attributes algorithms, c4.5 decision tree, and random forest to train and test three datasets which are NSL-KDD, AWID, and CIC-IDS2017.Furthermore, the proposed model uses CFS-BA, which combines correlation and Bat algorithms to remove irrelevant features.Finally, To combine the probability distributions of the base learners, the voting technique was used.
Jing and Chen [13] suggested an intrusion detection system(IDS) by using a support vector machine(SVM) classifier with the UNSW-NB-15 dataset.The authors did not use any feature selection method.Instead, they used nonlinear scaling in preprocessing stage instead of the min-max normalization algorithm.They say it gives better results with the UNSW-NB-15 dataset than the min-max normalization process.
Thanh and Lang [14] suggested a fuzzers detection system using the UNSW-NB15 dataset.The system use ensemble methods such as Bagging, Ada Boost, Stacking, Decorate, and Random Forest for fuzzers detection.The Ada Boost decision tree method has the highest classification quality, with an F Measure of 96.76 percent.
Pelin Yildirim Taser in [15] presents the bagging and boosting method based on six decision tree-based (DTB) classifiers used to predict diabetes based on experimental data.In terms of accuracy rates, a comparison is made between individual implementation, boosting, and bagging of DTB classifiers; experimental results show that AdaBoost with Naive Bayes Tree (NBTree) has the best accuracy score of 98.65 percent.
Shigeyuki et al. [16] examined default loans from a Taiwanese database and compared three learning models (boosting, bagging, and random forest) with different activation functions to eight neural-networks techniques and computed prediction accuracy for each one.The results show that boosting has the best classification power among the other two learning models.The number of middle layers in machine learning neural networks and the activation function used to affect their performance.For the training and testing sets, the maximum accuracy ratio of the original data is 71.01 percent and 69.59 percent, respectively.The maximum accuracy ratio of normalized data for the training and testing sets is 71.14 percent and 68.75 percent.We will discuss this literature in the Experimental Work and Results section.

Theoretical background
In this section, we will explain the theoretical side of the algorithms used in the research as follows

Feature selection
Feature selection techniques are one of the most important preprocessing steps in data mining techniques.They are used to eliminate unnecessary and redundant features from the dataset, improve the model's performance by using the correct features, and minimize the time it takes to process the data.We used correlation features selection and chi2 features selection in this study.

Correlation Feature Selection
CFS(Correlation-based feature selection) uses a heuristic evaluation function based on correlations to rank attributes.The function evaluated attribute vector subsets correlated with the class label but not with each other.The CFS algorithm assumes that irrelevant features have a low correlation with the class and should thus be careless.On the other hand, excessive features should be investigated because they are frequently robustly correlated with more or one of the other attributes.The following is the criterion for evaluating a subset of n features: where M S denotes the assessment of a subset of S containing N features.The average correlation value between characteristics and class labels is denoted by   ���� .The average correlation between two features is denoted by   ���� .[17].

Chi-Square Feature Selection
The Chi2 test is a statistical test.The Chi2 test determines the dependency between a class and a feature, allowing it to identify more pertinent features for a specific dataset successfully.As a result, we can remove features from the feature space that aren't useful for classification [18].For example, we will get observed count A and predicted count E from the data of two features.The Chi-Square test determined how far predicted count E and observed count A deviates from each other.
Where C is the degree of freedom, A is the observed value(s), and E is the expected value(s).After calculation   2 we compare it to the chi2 table value where alpha =0.05 and drop the feature if it is less than the chi2 table value (independent); otherwise, the feature will be accepted.

ADA-boost algorithm
The ADA-boost algorithm is a machine learning algorithm that starts by giving all instances in the training dataset equal weight.The learning algorithm is then used to create a classifier for this data by creating the number of the stumps (nodes with two leaves) as the same number of features.Then only one stump is selected after calculating Gini and Entropy for all trees with the lowest value (Gini or Entropy).thencalculated, the total error (TE) and the performance of the stump is calculated.So we must increase the weight for the misclassified records and decrease the weight for the correctly classified instances and update weights for all instances of the dataset based on the performance of the stump.Then new dataset based on normalized weights is created.The algorithm will again create a new stump depending on this new dataset.It will repeat the same process until it sequentially passes through all trees and finds less error than the normalized weight that we had in the initial stage [19].Instead of using stumps, we suggested using the SVM classifier to update the Ada-Boost algorithm, and all steps after creating stumps are the same.

Support vector machine
SVM is a supervised learning model in which The input data is represented as an n-dimensional feature space.Space is then divided into two parts by an (n-1) dimensional hyperplane.The Yi matrix labels n-dimensional input data xi (i = 1, 2,..., l) as Yi = 1 for class 1 and Yi = 1 for class 2. For linearly separable data, a hyperplane can be defined.
The decision function is Sgn (f(x)), W is an n-dimensional vector, and p is a scalar in Eq. (3).These determine the position of the hyperplane that completely separates the space, and it must adhere to the following constraints: An ideal hyperplane is a hyperplane that produces the maximum limit.The independent variable is Si, and the error penalty is C in the following equation.The hyperplane's minimal solution is: (5)

Based on:
Yi [(W.X) + p]≥1-Si , i=1,2,3…..I (6) The distance between the sample xi and the limit on the other side of the limit is measured by Si.This calculation can be made easier by using the following formula: . ∑ ∝  ∝ (, )  ,=1  =1 (7) Based on: The dot product of the feature space mappings of the original data points is returned by the kernel function Ker(XiXj) [20].

Proposal worms Detection System
To increase the ability to detect worms in networks, we propose an efficient data mining model for worm detection in which both the misuse and anomaly detection techniques are used in the detection of worms.Each instance in a dataset is labeled as "normal" or "attack"(the worms are one type of attack), and a learning algorithm is trained over the labeled data.Figure 1 shows the proposed worm detection model's framework.Which is divided into four main phases: Dataset preprocessing: First, we apply preprocess steps to the original datasets to make data fit for the classification algorithm.Dimensionality reduction: The feature selection approach based on correlation and chi-square features selection is used to select the most relevant features and reduce the dimensionality of the dataset.Classifiers training: We use the ensemble Ada boosting classifier algorithm to build classifiers to improve the accuracy of worm detection.Classification (testing) to predict the results of our model.

Preprocessing
The UNSW-NB15 dataset was created by the Australian Center for Cyber Security (ACCS) in collaboration with several international researchers to use for IDS.The IXIA PerfectStorm program was used to create a rich hybrid collection of normal and abnormal contemporary network traffic.This dataset is used to train the proposed model.This dataset contains 2,540,044 instances [21].These records are distributed in 4 large CSV files.In addition to those files, there are separate training and testing sets.The train contains 175,341records, and the testing contains 82,332 records.It contains 45 columns, 1 for id and 44 for features.The names and Descriptions of the dataset features are shown in Table 1.5000 records from mentioned training and testing records were selected to train the proposed model, contain 154 worms, and the Remaining contains normal and other types of attacks.The UNSW-NB15 dataset is described in detail in Table 2. Normal data and nine types of attacks are included in these training and testing datasets: Backdoors, DoS attack, Exploits attack, Fuzzers attack, Generic attack, Reconnaissance attack, Shellcode attack, and Worms attack.Because the algorithm deals with the binary classification, I changed the last column (45) to column no.(44).Because the latter represents the binary classification, 0 refers to the normal and 1 to the rest of the types of attacks, including the worm.
Because the UNSW-NB15 dataset contains both continuous and discrete features, it is necessary to convert the continuous attributes to discrete to ensure the system's efficiency and deal with the issue of new values appearing in the test dataset that are not present in the training dataset.Following discretization, we used the Min-Max normalization process to improve the model's efficiency and effectiveness by placing attribute values between 0 and 1.After discretization and normalization, we will use correlation feature selection and chi-square feature selection (the work's originality lies in combining these two methods) to exclude unused and redundant features from the dataset (See Algorithm 1).We used these methods in particular because we found them in experiments to be the most effective methods to select the features that fit our proposal and lead to increased accuracy and reduced false alarm rate.After all, the selected features are more relevant to our problem, which enhances the results.Malicious code that replicates itself.There is an excessive amount of system memory and network bandwidth being consumed.As a result, the system's availability has been lowered.

Classification
After feature selection techniques, we will split the dataset into two parts training containing 67% from a total number of records in the dataset and testing containing 33% from a total number of records in the dataset.The two parts are used to train and test the proposed model.It's worth noting that the selection process was selected at random.Then I will use the Ada-Boost algorithm (See algorithm 2) with SVM as the base estimator.Algorithm 2 describes an Ada boost with an SVM classifier algorithm for detecting normal or attack records in the UNSW-NB15 dataset.To train the SVM base classifier via weighted sample, the weight function is measured based on the error value for each training sample.Suppose the weighted sample exceeds the threshold value ∅ .Then, using the Ada booster technique, the SVM classifier becomes more powerful.The strong classifier output results classify it as normal or attack based on the objective function.

Experimental work and Results
As previously stated, the goal of this paper is to create a high-accuracy worm detection system.A model called CFS-chi2 that combines CFS and chi2 is used to determine a subset of the original features to eliminate irrelevant features and improve classification efficiency.In addition, an Ada boost ensemble classifier is trained and tested during the classification stage based on the UNSW NB15 dataset.The experiments are carried out on a desktop PC equipped with a 1.80 GHz Intel Core i3-3217U processor and 4GB RAM.
The two classification models are built after the two chosen classifiers (SVM and DT) have been trained on the training dataset using ensemble Ada Boosting.Then, to ensure that the built models are valid and accurate, apply these two models to test dataset records.True positive (TP) normal, true negative (TN) attack, or false positive (FP) not normal, false negative (FN), not an attack, or unknown, such as user behavior or a new attack are the categorization findings of testing.The results of testing SVM and DT classifiers to classify the testing dataset records are shown in Table3.These results show that the TP is greater than TN, FP, and FN and unknown when selecting all features and feature selection methods.In the SVM classifier, the FP rates decrease when using feature selection methods from 15 to 10 when using CFS and 6 when using Chi2, and 0 when using CFS+Chi2fs.To evaluate our proposed system, we will use several metrics shown in Table 4.These metrics are such as Accuracy ((TP + TN)/(TP + TN + FP + FN)), detection rate (DR=TP/(TP+FN)), false alarm rate (FAR=FP/(TN+FP)), true negative rate (TNR=TN/TN+FP), positive predictive value (PPV=TP/TP=FP), false-positive rate (FPR=FP/FP+TN), false discovery rate (FDR=FP/FP+TP), error rate, and area under the curve AUC which is a summary of the ROC curve that measures the ability of a classifier to distinguish between classes.We will apply all metrics on four different subsets of the UNSW-NB-15 dataset features (all features, correlation select 33 features, chi2 select 33 feature, 27 features from combining chi2 with correlation).Table 4 highlights the findings using the UNSW-NB15 dataset, including the ensemble SVM classifier results.It is suggested that without feature selection, the ensemble classifier is not optimum enough in several criteria.When feature selection methods are used, however, performance improves to the best possible scenario.in detail, our proposed system exhibits the accuracy of 99.15, FAR of 0.0203, TNR of 0.9796, PPV of 0.9838, FPR of 0.0203, FDR of 0.0161, the error rate of 0.0090, AUC of 99.23 without using feature selection methods.The results are optimized when using feature selection methods and reach the best possible case when using CFS+Chi2fs with the highest accuracy of 100, FAR of 0.0, TNR of 1.00, PPV of 1.00, FPR of 0.0, FDR of 0.0, error rate of 0.0, AUC of 100.
Table 5 shows the results of the evaluation Ada boost that uses DT with four different subsets of the UNSW-NB15 features(all features, correlation select 33 features, chi2 select 33 feature,27 features from combining chi2 with correlation).When applying DT classifier with ensemble Ada boosting algorithm Without using feature selection methods, our proposed system has an accuracy of 0.982, DR of 0.973, FAR of 0.0055, a TNR of 0.994, a PPV of 0.995, an FPR of 0.0055, an FDR of 0.004, an error rate of 0.017, and an AUC of 0. 983.When using feature selection methods, the results are optimized.Therefore, when using CFS+Chi2fs, the best case is reached with the highest accuracy of 0.992, DR of 0.989, FAR of 0.0, TNR of 1.00, and PPV of 1.00, an FPR of 0.0, an FDR of 0.0, an error rate of 0.07, and an AUC of 0. 993.To better understand the benefits of the suggested methodology, we compare our suggested system to the related work discussed in section 2. The results of the comparison are shown in Table 6.As shown in Table 6, the comparison comprises the method of classification, the selected dataset, the feature selection techniques, the number of selected features, accuracy, FAR, and DR for intrusion detection.When we compare our system with Naive Bayes, Part, Adaptive Boost, C4.5, RF, Forest-PA, and other comparable approaches in section 2, our suggested system gives the highest accuracy and detection rate, DR equal to 0.100.When comparing our suggested system to the SVM, we can observe that the ensemble method has advantages because the SVM is a single classifier with a high variance.As a result, ensembles frequently minimize the variance component of prediction mistakes made by contributing models, resulting in a significant increase in accuracy (from 0.85 to 0.100) and a decrease in the False alarm rate, FAR (from 15.26 to 0.0).When SVM was employed as a base estimator, our suggested system likewise achieved the highest accuracy and detection rate when compared to other ensemble approaches such as Bagging(SVM), stacking(SVM), and Boosting(SVM).Table 7 compares the work with the contribution and without the contribution.The results show that the SVM algorithm performs better than the DT algorithm in classification.The SVM method using CFS+Chi2fs has a total accuracy of 100%, a DR of 0.100, and a FAR of 0.0.The DT method has an overall accuracy of 0.982 percent, a DR of 0.973, and a FAR of 0.0055.

Conclusions
The proposed system emphasizes the importance of using intrusion detection systems (IDS) in networks to detect worm attacks, which are considered the most dangerous attacks in a network and impact resource availability.Furthermore, the proposed system is more efficient due to the normalization and discretization processes.To improve the accuracy of the proposed system and reduce the amount of time required, the correlation and chi2 algorithms are suggested as feature selection methods.Using these algorithms improves classification accuracy, as shown in Tables IV and V.The accuracy of the Ada boost classifier that uses SVM supported by chi2+corr with 27 features is better than using all features or using the Ada boost Classifier with Corr or chi2 with 33 features.In addition, Chi2+corr reduces the false alarm rate compared to CFS or CHI2, as shown in Table IV.In contrast, when using a decision tree classifier is a base estimator in Ada boost(without our contribution ), the system will be less accurate, detect less, and have a false alarm rate, as shown in Table 7.

Algorithm (1) Preprocessing
input : subset unsw-nb15 Datasets Output: data values ranging from zero to one, independent features with a strong connection to the class, and class-dependent features.

Figure 1 :
Figure 1: flowchart of proposed worm detection

Table 3 :
Classification Results of SVM and DT Classifiers

Table 4 :
Metrics to evaluate ensemble Ada boost with SVM

Table 5 :
Metrics to evaluate ensemble Ada boost with DT

Table 6 :
Compression between the proposed system and the related work

Table 7 :
A comparison between the work with the contribution and without the contribution

Step 1: min-max normalization
establish upper and lower bounds (P, Q) // a particular range determine the minimum and maximum values (  ,  ) for each data item, do _() =

3: Chi-square feature selection For
each unsw-nb15 Dataset feature seek for   2 with class.See equation(1) alpha=0.05from the chi2 table, find X_c^2' where alpha=0.05 and match it to X_c^2 If   2 <   2 ′ the feature is independent (dropped) Else it depends on class (not drop) End For End