#8 Data Science 👩‍💻 | Data Preprocessing in Python using Scikit-Learn

Binal Kagathara
6 min readOct 27, 2021

--

Datasets nowadays are very detailed, including more features in the model makes the model more complex, and the model may be over fitting the data. Some features can be the noise and potentially damage the model. By removing those unimportant features, the model may generalize better.

The SkLearn website listed different feature selection methods. Here, we will see different feature selection methods on the same data set to compare their performances.

Dataset Used

The dataset used for carrying out data reduction is the ‘Iris’ available in sklearn.datasets library

Importing all required libraries,

Loading the Iris dataset,

Now, lets see the information about the dataset.

Dataset shape,

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Adding noise,

The dataset now has 14 features now. Before applying feature selection method, we need to split the data first. The reason is that we only select features based on the information from the training set, not on the whole data set. We should hold out part of the whole data set as a test set to evaluate the performance of the feature selection and the model. Thus the information from the test set cannot be seen while we conduct feature selection and train the model.

Splitting The Dataset,

We will apply the feature selection based on X_train and y_train.

Variance Threshold

Variance Threshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. By default, it removes all zero-variance features. Our dataset has no zero variance feature so our data isn’t affected here. To read more about Variance Threshold, click here

Variance Threshold,

Uni variate Feature Selection

  • Univariate feature selection works by selecting the best features based on univariate statistical tests.
  • We compare each feature to the target variable, to see whether there is statistically significant relationship between them.
  • When we analyze the relationship between one feature and the target variable we ignore the other features. That is why it is called ‘univariate’.
  • Each feature has its own test score.
  • Finally, all the test scores are compared, and the features with top scores will be selected.
  • These objects take as input a scoring function that returns univariate scores and p-values (or only scores for SelectKBest and SelectPercentile):
  • For regression: f_regression, mutual_info_regression
    For classification: chi2, f_classif, mutual_info_classif
  1. f_classif

Also known as ANOVA, to read about f_classif, click here

ANOVA Test,

2. chi2

To read about chi2, click here

chi2 Test,

3. mutual_info_classif

Comes in 2 types:

  1. for classification: click here
  2. for regression: click here

mutual_info_classif Test,

Recursive Feature Elimination

Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from the current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.

To read more about RFE on Sklearn, click here

RFE using Random Forest Classifier,

Differences Between Before and After Using Feature Selection

a. Before using Feature Selection

Before Feature Selection,

b. After using Feature Selection

Using f_classif,

After using Feature Selection,

There are clear differences in precision, recall, f1-score and accuracy in both outputs. This shows the importance of using feature selection to increase performance of the model.

Principal Component Analysis (PCA)

We can speed up the fitting of a machine learning algorithm by changing the optimization algorithm. A more common way of speeding up a machine learning algorithm is by using Principal Component Analysis (PCA).

If your learning algorithm is too slow because the input dimension is too high, then using PCA to speed it up can be a reasonable choice. This is probably the most common application of PCA. Another common application of PCA is for data visualization.

For a lot of machine learning applications it helps to be able to visualize your data. Visualizing 2 or 3 dimensional data is not that challenging. The Iris dataset used is 4 dimensional. We will use PCA to reduce that 4 dimensional data into 2 or 3 dimensions so that you can plot and hopefully understand the data better.

So, now lets execute PCA for visualization on Iris Dataset

Importing the libraries,

The dataframe after using StandardScalar,

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation

PCA for 2D projection,

4 columns are converted to 2 principal columns,

Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.

Concatenating target column into dataframe,

Now, lets visualize the dataframe, execute the following code:

fig = plt.figure(figsize = (8,8))ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 10)
ax.set_ylabel('Principal Component 2', fontsize = 10)
ax.set_title('2 component PCA', fontsize = 15)targets = ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica']
colors = ['r', 'g', 'b']for target, color in zip(targets,colors):
indicesToKeep = finalDf['target'] == target
ax.scatter(finalDf.loc[indicesToKeep, 'principal component 1']
, finalDf.loc[indicesToKeep, 'principal component 2']
, c = color
, s = 50)
ax.legend(targets)
ax.grid()

2D representation of dataframe,

PCA Projection to 3D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 3 dimensions. The new components are just the three main dimensions of variation

Obtaining 3 principal component columns,

Now lets visualize 3D graph,

Generating 3D graph,

Obtained 3D graph,

Summary

In this blog, I have tried to use different feature selection methods on the same data and evaluated their performances.Comparing using all the features to train the model, the model performs better if we only use the remaining features after the feature selection.After using feature selection, PCA has been used to visualize the dataframe with reduced components in 2D as well as 3D.

Visit the GitHub for code.

Thank You!

--

--

Binal Kagathara

DevOps Engineer| AWS Certified Solution Architect - Associate | IT student