Article Preview
TopIntroduction
Big Data has contributed to the growing complexity of data mining and machine learning techniques (Rong et al., 2019) such as classification which algorithms already require a lot of processing. Feature selection is a step of preprocessing that aims to reduce as much data dimension without disrupting learning performance. In the literature, there are various feature selection approaches including Filter (Fatima Bibi et al., 2015), Wrapper and Hybrid methods. In the Filter methods, there are two types of features that must be deleted: irrelevant and redundant features (Yu et al., 2003), (Song et al., 2013), (Lashkia et al., 2004). This is done through using the statistical measures such a correlation measure or statistical independence. An irrelevant feature is a feature that has weak or no correlation with the target Class, and a redundant feature is one that has a strong correlation with another feature (s). “The first does not contribute to predictive accuracy and the second does not respond to obtain a better indicator. It mainly provides information already existing in one or more other attributes” (Song et al., 2013). Filter methods are characterized by the speed of execution, but good results are not always guaranteed due to the insufficiency of a unified and comprehensive definition of statistical correlation. For example, two variables that are not linearly correlated are not necessarily independent. It is possible that they are non-linearly correlated (Shen et al., 2009). Wrapper methods rely on learning algorithms to evaluate the subset of selected features. These methods are characterized by the quality of their results in most cases, but they are challenged by their complexity and execution time. Hybrid methods use filter and wrappers approaches to combine fast of execution with quality of results. Feature selection is the process of finding a subset of features in a large space, which is known for being an NP-Hard problem. In such instances, the use of evolutionary algorithms is among the most effective solutions. In this work, the authors used the BPSO algorithm (Kennedy, et al., 1997) to find the appropriate set of features with some improvements. Since the BPSO algorithm has limited convergence and restricted inertial weight adjustment, the authors propose using the multiple inertia weight strategy inspired from (Too et al., 2019) to influence the changes in particle motions, so that the search process is more diverse. In the BPSO algorithm, the position of the particle is a sequence of bits, each bit containing either the value 0 or 1. The number of bits equals the dimension of data. So, the second improvement of BPSO algorithm involves modifying the new position of the particle so that one or more bit of its position can take its new value from the best global solution in very limited cases. In PSO-based feature selection algorithms, fitness evaluation is the most time-consuming part. This is because fitness evaluation is often using a classifier. Therefore, we have divided the particles into independent groups that execute simultaneously. Within each group, each particle evaluates its fitness value. Apache Spark is among the best frameworks for big data analytics, so we used it for the distribution and parallel execution.
Therefore, our proposed algorithm, that we named PHFS (Parallel Hybrid Feature Selection), has two steps. As a first step, all irrelevant features are removed (only the most relevant features are selected) (Song et al., 2013) to reduce the search space.