Feature Selection using Wrapper approach

This part of research aims to investigate and select critical features from a given dataset (KDD Cup 1999). Wrapper technique can be used to select critical features for both normal and malicious data. Figure 3.6 describes activity M2 in the research framework and shows how to utilize the technique to extract malicious and normal features based on the given dataset to the wrapper.

Consequently, the results of this technique will be critical feature set for each connection type.

General features selection aspects for wrapper approach is illustrated in figure 3.5. This study adopted wrapper approach to investigate critical features in traffic connections. The wrapper uses a classifier to determine the importance of the selected features. Then the wrapper selects and refines the correlated features that represent the behavior of each pattern in the dataset. Furthermore, candidate classification technique used as induction method in the wrapper is decision tree (DT). Moreover, Genetic algorithm (GA) will be highlighted and wrapped with the induction methods as a random search method due to its global search strategy, which is inspired by the natural selection principle. Where, different wrappers will be investigated, each one consists of the Genetic algorithm as a random search method wrapped with one of the suggested induction functions (classifiers). More details about these techniques can be found in chapter 2. Thus, chapter 4 discusses the details of design and experiments for phase 1. And a comparative analysis will compare the enhancements with other published studies.

Feature Selection using Wrapper approach Image 1

Selection method generally consists of four steps described below.

  • Generate candidate subset: The original feature set contains n number of features, the total number of competing candidate subsets to be generated is 2 to the power n, which is a huge number even for Medium-sized n. Subset generation is a search procedure that produces candidate feature subsets for evaluation based on a certain search strategy. The search strategy is broadly classified as complete e.g.

genetic algorithm (GA).

  • Subset evaluation function to evaluate the subset generated in the previous step (generate candidate subset) by using wrapper approach. Wrapper approach strategies for feature selection use an induction algorithm to estimate the merit of feature subsets. Wrappers are tuned to the specific interaction between an induction algorithm and its training data.
  • Stopping Condition: Since the number of subsets can be enormous, some sort of stopping criterion is necessary. Stopping criteria may be based on a generation procedure/ evaluation function.
  • Stopping criteria based on generation procedure include:

- Whether a predefined number of features are selected - Whether a predefined number of iterations reached.

  • Stopping criteria based on an evaluation function can be:

- Whether addition (or deletion) of any feature does not produce a better subset - Whether an optimal subset according to some evaluation function is obtained.

  • Validation procedure to check whether the feature subset selected is valid. Usually the result of original feature set is compared with the feature selected by wrapper as input to some induction algorithm KDD Cup datasets. Another approach for validation is to use different feature selection algorithm to obtain relevant features and then compare the results by using classifiers on each relevant attribute subset.

Stopping criteria based on an evaluation function will be:

* Whether addition (or deletion) of any feature does not produce a better subset

Genetic algorithm mainly composed of three operators:

Reproduction, crossover, and mutation. Reproduction selects good string; crossover combines good strings to try to generate better offspring’s; mutation alters a string locally to attempt to create a better string. In each generation, the population is evaluated and tested for termination of the algorithm. If the termination criterion is not satisfied, the population is operated upon by the three GA operators and then re-evaluated. This procedure is continued until the termination criterion is met. The working of proposed wrapper method is shown in the below figure. GA is used as random search method with one classifier namely decision tree (DT), as induction method wrapped with GA. Further the relevant attributes identified by proposed wrapper is validated by different classifiers.

Feature Selection using Wrapper approach Image 2

For GA, population size is 20, number of generation is 20 as terminating condition, crossover rate is 0.6 and mutation rate is 0.033.