Procedure Involved in Machine Learning
1. Business Problem Understanding & Problem Formulation
2. Data CollectionAccording to Business Problem
3. Exploratory Data
Analysis
a.
Data Loading [ Dealing with different types of file formats and Errors while
Loading]
b.
Lowering the Data
Data Cleaning
c.
Duplicates checking
d.
Treating Missing Values in Dependent and Independent Variables
i.
Checking For Missing Values
ii.
Handle Missing Data
iii.
Deleting Missing Values OR
iv.
Imputing Missing Values
a.
Replacing with Arbitrary Values
b.
Frequent Category Imputation [ Filling with Mode ]
c.
Replacing with Mean
d.
Replacing Median
e.
Replacing with Previous Values – Forward Fill
f.
Replacing with Next Value – Backward Fill
g.
Imputing Using Interpolation
h.
Imputing with SK-Learn’s Uni and Multi variate Approach
e. Outliers
Detection and Treating
1. Visualisation Methods
a.
Outliers Detecting Using Histogram
b.
Outliers Detecting using Boxplot
c.
Outliers Detecting Using Scatter Plot
2. Statistical Methods
a.
Z-Score Method for Normal Distribution of Data
b.
IQR- Method for Skewed Distribution of Data
c.
Percentile Based Approach
d.
Special Case – Capping with Winsorizer method
4. Multicollinearity
Checking for Regression Models
To
detect multicollinearity in a given dataset Here are some commonly used
methods:
a.
Correlation Matrix
b.
Variance Inflation Factor (VIF)
c.
Eigenvalues or Condition Indices
d.
Tolerance
Once
multicollinearity is detected, there are several strategies to address or
correct it:
a.
Feature Selection: You can select variables based on domain
knowledge, statistical tests, or feature selection techniques (e.g., backward
elimination, LASSO, or stepwise regression).
b.
Data Collection: Collect additional data to reduce the
correlation between variables. More diverse and independent data can help
mitigate multicollinearity issues.
c.
Data Transformation: Transform the variables to reduce the
multicollinearity. Common techniques include standardization, normalization, or
applying mathematical transformations (e.g., logarithmic or power
transformations) to the variables.
d.
Principal Component Analysis (PCA): Perform dimensionality
reduction using techniques like PCA to create orthogonal components that
capture most of the variation in the data while minimizing multicollinearity.
e.
Ridge Regression or Lasso Regression: Use regularization
techniques like ridge regression or lasso regression, which can shrink the
regression coefficients and reduce the impact of multicollinearity.
5. Feature
Engineering
1. Encoding The Labelled Data
a.
Label Encoding
b.
One-Hot-Encoding
c.
Special Case – get_dummies()
2. Feature Scaling
a.
Standardization
b.
Normalization
6. Dimensionality
Reduction Techniques
Dimensionality Reduction Techniques for Regression:
1.
Principal Component Analysis (PCA)
2.
Partial Least Squares Regression (PLS)
3.
Ridge Regression
4.
Lasso Regression
5.
Elastic Net
Dimensionality Reduction Techniques for Classification:
1.
Principal Component Analysis (PCA)
2.
Linear Discriminant Analysis (LDA)
3.
Quadratic Discriminant Analysis (QDA)
4.
Partial Least Squares Discriminant Analysis (PLS-DA)
5.
Kernel Principal Component Analysis (KPCA)
Dimensionality Reduction Techniques for Clustering:
1.
Principal Component Analysis (PCA)
2.
Non-Negative Matrix Factorization (NMF)
3.
t-Distributed Stochastic Neighbour Embedding (t-SNE)
4.
Autoencoders [ How does it works for Feature selection? ]
5.
Independent Component Analysis (ICA)
some
specialized dimensionality reduction techniques and the specific scenarios in
which they are commonly used:
6.
Independent Component Analysis (ICA)
7.
Factor Analysis
8.
Non-Negative Matrix Factorization (NMF)
9.
Manifold Learning Techniques
10.
Random Projection
11.
Feature Agglomeration
7. Feature Selection
a. Filter Methods
1.Varience
Threshold
2.Univariate
Selection
Univariate
Selection Statistical Tests for Regression:
F-Test
(ANOVA)
Pearson's
Correlation Coefficient
Spearman's
Rank Correlation Coefficient
Partial
Correlation
Univariate
Selection Tests for Classification
Chi-Square
Test
ANOVA
(Analysis of Variance)
Univariate
Selection Tests for Clustering:
Silhouette
Score
Dunn
Index
b. Embedded methods
Embedding
methods are typically applied as part of dimensionality reduction techniques
and are not specifically designed for feature selection. However, they can indirectly
contribute to feature selection by learning meaningful representations that
capture the essential information of the input data.
Embedding
Methods for Regression:
Principal
Component Analysis (PCA)
L1,
L2 Regularization
Random
Forest importance
Embedding
Methods for Classification:
Linear
Discriminant Analysis (LDA)
Embedding
Methods for Clustering:
t-Distributed
Stochastic Neighbour Embedding (t-SNE)
Self-Organizing
Maps (SOM)
Hierarchical
Clustering
While
these embedding methods are not inherently feature selection techniques, they
can be used to transform the original features into a lower-dimensional
representation that captures the important structure and patterns in the data.
This can indirectly lead to feature selection by focusing on the most
informative aspects of the data. However, it's important to note that the
primary purpose of embedding methods is dimensionality reduction and data
representation learning rather than explicit feature selection.
c. Wrapped Methods
Forward
Feature Selection
Backward
Feature Selection
Exhaustive
Feature Selection
Recursive
Feature Elimination
Wrapped
Methods for Regression:
Recursive
Feature Elimination (RFE)
Wrapped
Methods for Classification:
Recursive
Feature Elimination (RFE)
Wrapped
Methods for Clustering:
Sequential
Feature Selection
Genetic
Algorithms (GA)
d. Common Feature Selection Methods
Feature Selection
Techniques for Regression:
1.
Univariate Selection
2.
Recursive Feature Elimination (RFE)
3.
Lasso Regression
4.
Ridge Regression
5.
Elastic Net
6.
Feature Importance from Tree-Based Models
7.
Forward/Backward Stepwise Selection
8.
Principal Component Analysis (PCA)
Feature Selection
Techniques for Classification:
1.
Univariate Selection
2.
Recursive Feature Elimination (RFE)
3.
Lasso Logistic Regression
4.
Ridge Logistic Regression
5.
Elastic Net
6.
Feature Importance from Tree-Based Models
7.
Forward/Backward Stepwise Selection
8.
Principal Component Analysis (PCA)
9.
SelectKBest
10.
SelectFromModel
Feature Selection
Techniques for Clustering:
1.
Silhouette Coefficient
2.
Dunn Index
3.
Gap Statistic
4.
Elbow Method
5.
Feature Importance from Clustering Algorithms (e.g., K-means, Hierarchical
Clustering)
6.
Mutual Information
7.
Random Forest Importance
8.
Feature Agglomeration
9.
Principal Component Analysis (PCA)
10.
Independent Component Analysis (ICA)
8. Data augmentation Techniques
Determine the resampling strategy: Decide how you want to resample
the existing data to generate new samples. There are several methods you can
consider:
•
Random sampling: Randomly select existing samples without
replacement to create new ones. This can help balance class distributions if
you have imbalanced data.
•
Bootstrapping: Randomly sample existing samples with
replacement, allowing some samples to be selected multiple times. This can
introduce variability and diversity into the dataset.
•
SMOTE (Synthetic Minority Over-sampling Technique): Generate
synthetic samples by interpolating between neighboring samples from the
minority class. This is useful for addressing class imbalance.
•
ADASYN (Adaptive Synthetic Sampling): Similar to SMOTE,
ADASYN generates synthetic samples for the minority class, but it focuses on
samples that are difficult to classify.
•
Gaussian mixture models: Fit a Gaussian mixture model
to the existing data and sample new data points from the learned model. This
can be effective when the data follows a Gaussian distribution.
•
Kernel density estimation: Estimate the probability
density function (PDF) of the existing data using kernel density estimation and
sample new data points from the estimated PDF.
9. Balancing the
Data set for classification models
1.
Random Under sampling: Randomly removes instances from the majority class to
match the number of instances in the minority class. This can help balance the
class distribution but may result in loss of information.
2.
Random Oversampling: Randomly duplicates instances from the minority class to
increase its representation in the dataset. This technique can lead to
overfitting if not carefully applied.
3.
SMOTE (Synthetic Minority Over-sampling Technique): SMOTE generates synthetic
instances in the minority class by interpolating between existing instances.
This technique helps to increase the minority class representation and address
class imbalance.
4.
ADASYN (Adaptive Synthetic Sampling): ADASYN is an extension of SMOTE that
adaptively generates synthetic instances based on the difficulty of learning
examples in the minority class. It focuses on regions where the decision
boundary is more ambiguous.
5.
Class Weighting: Assigning higher weights to instances in the minority class or
lower weights to instances in the majority class during model training can help
address class imbalance. This gives more importance to the minority class
during the learning process.
6.
Ensemble Methods: Ensemble methods, such as Bagging and Boosting, can be
effective in dealing with class imbalance. Techniques like Random Forest or
AdaBoost can balance the class distribution and improve model performance.
7.
Cluster-Based Oversampling: This technique involves clustering the minority
class instances and oversampling within each cluster. It aims to create diverse
synthetic instances that better represent the underlying data distribution.
8.
Cost-Sensitive Learning: Assigning different misclassification costs to
different classes during training can help the model focus more on the minority
class and reduce bias towards the majority class.
It's important to note that the choice of data balancing technique depends on the specific problem, dataset, and algorithm being used. Experimentation and evaluation of different techniques are often necessary to find the most effective approach for a particular scenario.
10. Train, Test Data
sets splitting
11. Model selection
12. Understand the
Problem: Start by gaining a deep understanding of the
problem you are trying to solve. Determine whether it is a regression,
classification, or clustering problem based on the nature of the target
variable and the desired outcome.
13. Analyse the
Data: Perform exploratory data analysis (EDA) to understand the
characteristics of the dataset. Identify the types of features (continuous,
categorical, etc.), check for missing values, assess the distribution of the
target variable, and evaluate the balance of classes (for classification
tasks).
14. Consider the
Dataset Size: Consider the size of your dataset. For small
datasets, simpler models may be preferred to avoid overfitting, while larger
datasets can potentially handle more complex models.
15. Evaluate Model
Assumptions: Different models make different assumptions
about the data. For example, linear regression assumes a linear relationship
between the features and the target variable. Ensure that the chosen model
aligns with the assumptions of the problem and the dataset.
16. Select Model
Types: Based on the problem type, consider the appropriate model
types for regression, classification, or clustering tasks. Some common options
include linear regression, decision trees, random forests, support vector
machines (SVM), logistic regression, naive Bayes, k-nearest neighbors (KNN),
and various clustering algorithms (k-means, hierarchical clustering, etc.).
17. Consider
Complexity and Interpretability: Assess the complexity and
interpretability requirements for your problem. Some models, like linear
regression or decision trees, offer simplicity and interpretability, while
others, such as neural networks, may provide more complexity and predictive
power but may be less interpretable.
18. Perform Model
Comparison: Compare the performance of different models
using appropriate evaluation metrics. For regression tasks, metrics like mean
squared error (MSE) or R-squared can be used. For classification tasks, metrics
like accuracy, precision, recall, or F1 score are commonly used. For clustering
tasks, metrics like silhouette score or within-cluster sum of squares can be
employed.
Use of Grid Search CV:
You
can use GridSearchCV in the following situations:
1.
Model Selection: When you are unsure about which model or
algorithm to use, GridSearchCV can help you compare multiple models by searching
for the best hyperparameters for each model and evaluating their performance.
2.
Hyperparameter Tuning: Even if you have selected a
specific model, you may still need to find the best combination of
hyperparameters for that model. GridSearchCV allows you to define a grid of
hyperparameters and exhaustively search through all possible combinations to
find the optimal set.
3.
Limited Hyperparameter Space: When the hyperparameter space
is relatively small and computationally feasible to search through,
GridSearchCV can be used to ensure that no combination of hyperparameters is
missed.
4.
Baseline Model Evaluation: GridSearchCV can serve as a
baseline for evaluating model performance. By comparing the performance of the
models with different hyperparameter combinations, you can assess the best
achievable performance and use it as a benchmark for further experimentation.
5.
Reproducibility and Transparency: GridSearchCV provides a
systematic and reproducible way of evaluating and comparing models, as it
considers all combinations of hyperparameters. This ensures transparency in the
model selection and hyperparameter tuning process.
It's
important to note that GridSearchCV can be computationally expensive,
especially for larger hyperparameter spaces or larger datasets. In such cases,
techniques like RandomizedSearchCV or Bayesian optimization can be used to
sample a subset of hyperparameter combinations more efficiently.
19. Model
Training
20. Model
evaluation
Model Evaluation Techniques for Regression:
•
R-Squared (R2)
•
Adjusted R-Squared
•
Mean Squared Error (MSE)
•
Root Mean Squared Error (RMSE)
•
Mean Square Logarithmic Error (MSLE)
•
Absolute Error
•
Mean Absolute Error (MAE)
•
Mean Absolute Percentage Error (MAPE)
•
Residual Standard Error (RSE)
•
Mean Absolute Deviation (MAD)
•
Maximum Residual Error (MRE)
•
Root Relative Squared Error (RRSE)
•
Bayesian Information Criteria (BIC)
•
Mallows’s Cp
•
Correlation Coefficient
Model Evaluation Techniques for Classification:
•
Accuracy
•
Confusion Matrix
•
Precision
•
Recall (Sensitivity/Specificity)
•
F1 Score
•
Receiver Operating Characteristic (ROC) Curve
•
Hamming Loss
•
Jaccard Score
•
Cross Entropy Loss
Model Evaluation Techniques for Clustering:
•
Silhouette Score:
•
Within-Cluster Sum of Squares (WCSS)
•
Davies-Bouldin Index
•
Adjusted Rand Index (ARI)
21.
Model Cross Validation
Cross-Validation Techniques for Regression and Classification:
•
K-Fold Cross-Validation
•
Stratified K-Fold Cross-Validation
•
Leave-One-Out Cross-Validation (LOOCV)
•
Leave-one-group-out Cross-Validation
•
Nested Cross-Validation
•
Time-series Cross-Validation
•
Repeated K-Fold Cross-Validation
Cross-Validation Techniques for Clustering:
•
Cluster-Focused Cross-Validation
•
Cross-Validation with Silhouette Score
22. Model Comparing
Tests
Model Comparing Tests for Regression:
•
F-Test or Analysis of Variance (ANOVA)
•
t-Test:
Model Comparing Tests for Classification:
•
McNemar's Test
•
Paired t-Test
Model Comparing Tests for Clustering:
•
Rand Index
•
Fowlkes-Mallows Index
•
Dunn Index
•
Calinski-Harabasz Index:
•
Silhouette Score
•
Gap Statistic
23. Model
optimization and tuning
24. Model deployment
and monitoring

Comments
Post a Comment