Skip to main content

Basic Machine Learning Procedure

Introduction

Machine learning is the transformative frontier of artificial intelligence that empowers computers to learn from data and make intelligent decisions without explicit programming. It's the driving force behind countless innovations, from self-driving cars and personalized recommendations to medical diagnoses and fraud detection. At its core, machine learning mimics the human learning process, allowing algorithms to evolve and adapt, continuously improving their performance. In this era of data abundance, machine learning is the compass guiding us through the vast digital landscape, unlocking valuable insights, automating complex tasks, and reshaping industries across the globe. Join me on a journey through the intricacies and possibilities of this remarkable field, where data becomes the ink, algorithms the quill, and intelligence the masterpiece.

Procedure Involved in Machine Learning

1. Business Problem Understanding & Problem Formulation  

2. Data CollectionAccording to Business Problem

3. Exploratory Data Analysis

a. Data Loading [ Dealing with different types of file formats and Errors while Loading]

b. Lowering the Data

Data Cleaning

c. Duplicates checking

d. Treating Missing Values in Dependent and Independent Variables 

i. Checking For Missing Values

ii. Handle Missing Data

iii. Deleting Missing Values OR

iv. Imputing Missing Values

a. Replacing with Arbitrary Values

b. Frequent Category Imputation [ Filling with Mode ]

c. Replacing with Mean

d. Replacing Median

e. Replacing with Previous Values – Forward Fill

f. Replacing with Next Value – Backward Fill

g. Imputing Using Interpolation

h. Imputing with SK-Learn’s Uni and Multi variate Approach

e. Outliers Detection and Treating

1. Visualisation Methods

a. Outliers Detecting Using Histogram

b. Outliers Detecting using Boxplot

c. Outliers Detecting Using Scatter Plot

2. Statistical Methods

a. Z-Score Method for Normal Distribution of Data 

b. IQR- Method for Skewed Distribution of Data

c. Percentile Based Approach

d. Special Case – Capping with Winsorizer method 

4. Multicollinearity Checking for Regression Models

To detect multicollinearity in a given dataset Here are some commonly used methods:

a. Correlation Matrix

b. Variance Inflation Factor (VIF)

c. Eigenvalues or Condition Indices

d. Tolerance

Once multicollinearity is detected, there are several strategies to address or correct it:

a. Feature Selection: You can select variables based on domain knowledge, statistical tests, or feature selection techniques (e.g., backward elimination, LASSO, or stepwise regression).

b. Data Collection: Collect additional data to reduce the correlation between variables. More diverse and independent data can help mitigate multicollinearity issues.

c. Data Transformation: Transform the variables to reduce the multicollinearity. Common techniques include standardization, normalization, or applying mathematical transformations (e.g., logarithmic or power transformations) to the variables.

d. Principal Component Analysis (PCA): Perform dimensionality reduction using techniques like PCA to create orthogonal components that capture most of the variation in the data while minimizing multicollinearity.

e. Ridge Regression or Lasso Regression: Use regularization techniques like ridge regression or lasso regression, which can shrink the regression coefficients and reduce the impact of multicollinearity.

5. Feature Engineering

1. Encoding The Labelled Data

a. Label Encoding

b. One-Hot-Encoding

c. Special Case – get_dummies()

2. Feature Scaling 

a. Standardization

b. Normalization

6. Dimensionality Reduction Techniques

Dimensionality Reduction Techniques for Regression:

1. Principal Component Analysis (PCA)

2. Partial Least Squares Regression (PLS)

3. Ridge Regression

4. Lasso Regression

5. Elastic Net

Dimensionality Reduction Techniques for Classification:

1. Principal Component Analysis (PCA)

2. Linear Discriminant Analysis (LDA)

3. Quadratic Discriminant Analysis (QDA)

4. Partial Least Squares Discriminant Analysis (PLS-DA)

5. Kernel Principal Component Analysis (KPCA)

Dimensionality Reduction Techniques for Clustering:

1. Principal Component Analysis (PCA)

2. Non-Negative Matrix Factorization (NMF)

3. t-Distributed Stochastic Neighbour Embedding (t-SNE)

4. Autoencoders [ How does it works for Feature selection? ]

5. Independent Component Analysis (ICA)

some specialized dimensionality reduction techniques and the specific scenarios in which they are commonly used:

6. Independent Component Analysis (ICA)

7. Factor Analysis

8. Non-Negative Matrix Factorization (NMF)

9. Manifold Learning Techniques

10. Random Projection

11. Feature Agglomeration

7. Feature Selection

a. Filter Methods

1.Varience Threshold

2.Univariate Selection

Univariate Selection Statistical Tests for Regression:

F-Test (ANOVA)

Pearson's Correlation Coefficient 

Spearman's Rank Correlation Coefficient

Partial Correlation

Univariate Selection Tests for Classification

Chi-Square Test

ANOVA (Analysis of Variance)

Univariate Selection Tests for Clustering:

Silhouette Score 

Dunn Index

b. Embedded methods

Embedding methods are typically applied as part of dimensionality reduction techniques and are not specifically designed for feature selection. However, they can indirectly contribute to feature selection by learning meaningful representations that capture the essential information of the input data.

Embedding Methods for Regression:

Principal Component Analysis (PCA)

L1, L2 Regularization

Random Forest importance

Embedding Methods for Classification:

Linear Discriminant Analysis (LDA)

Embedding Methods for Clustering:

t-Distributed Stochastic Neighbour Embedding (t-SNE)

Self-Organizing Maps (SOM)

Hierarchical Clustering

While these embedding methods are not inherently feature selection techniques, they can be used to transform the original features into a lower-dimensional representation that captures the important structure and patterns in the data. This can indirectly lead to feature selection by focusing on the most informative aspects of the data. However, it's important to note that the primary purpose of embedding methods is dimensionality reduction and data representation learning rather than explicit feature selection.

c. Wrapped Methods

Forward Feature Selection

Backward Feature Selection

Exhaustive Feature Selection

Recursive Feature Elimination

Wrapped Methods for Regression:

Recursive Feature Elimination (RFE)

Wrapped Methods for Classification:

Recursive Feature Elimination (RFE)

Wrapped Methods for Clustering:

Sequential Feature Selection

Genetic Algorithms (GA)

d. Common Feature Selection Methods 

Feature Selection Techniques for Regression:

1. Univariate Selection

2. Recursive Feature Elimination (RFE)

3. Lasso Regression

4. Ridge Regression

5. Elastic Net

6. Feature Importance from Tree-Based Models

7. Forward/Backward Stepwise Selection

8. Principal Component Analysis (PCA)

Feature Selection Techniques for Classification:

1. Univariate Selection

2. Recursive Feature Elimination (RFE)

3. Lasso Logistic Regression

4. Ridge Logistic Regression

5. Elastic Net

6. Feature Importance from Tree-Based Models

7. Forward/Backward Stepwise Selection

8. Principal Component Analysis (PCA)

9. SelectKBest

10. SelectFromModel

Feature Selection Techniques for Clustering:

1. Silhouette Coefficient

2. Dunn Index

3. Gap Statistic

4. Elbow Method

5. Feature Importance from Clustering Algorithms (e.g., K-means, Hierarchical Clustering)

6. Mutual Information

7. Random Forest Importance

8. Feature Agglomeration

9. Principal Component Analysis (PCA)

10. Independent Component Analysis (ICA)

8. Data augmentation Techniques

Determine the resampling strategy: Decide how you want to resample the existing data to generate new samples. There are several methods you can consider:

• Random sampling: Randomly select existing samples without replacement to create new ones. This can help balance class distributions if you have imbalanced data.

• Bootstrapping: Randomly sample existing samples with replacement, allowing some samples to be selected multiple times. This can introduce variability and diversity into the dataset.

• SMOTE (Synthetic Minority Over-sampling Technique): Generate synthetic samples by interpolating between neighboring samples from the minority class. This is useful for addressing class imbalance.

• ADASYN (Adaptive Synthetic Sampling): Similar to SMOTE, ADASYN generates synthetic samples for the minority class, but it focuses on samples that are difficult to classify.

• Gaussian mixture models: Fit a Gaussian mixture model to the existing data and sample new data points from the learned model. This can be effective when the data follows a Gaussian distribution.

• Kernel density estimation: Estimate the probability density function (PDF) of the existing data using kernel density estimation and sample new data points from the estimated PDF.

9. Balancing the Data set for classification models

1. Random Under sampling: Randomly removes instances from the majority class to match the number of instances in the minority class. This can help balance the class distribution but may result in loss of information.

2. Random Oversampling: Randomly duplicates instances from the minority class to increase its representation in the dataset. This technique can lead to overfitting if not carefully applied.

3. SMOTE (Synthetic Minority Over-sampling Technique): SMOTE generates synthetic instances in the minority class by interpolating between existing instances. This technique helps to increase the minority class representation and address class imbalance.

4. ADASYN (Adaptive Synthetic Sampling): ADASYN is an extension of SMOTE that adaptively generates synthetic instances based on the difficulty of learning examples in the minority class. It focuses on regions where the decision boundary is more ambiguous.

5. Class Weighting: Assigning higher weights to instances in the minority class or lower weights to instances in the majority class during model training can help address class imbalance. This gives more importance to the minority class during the learning process.

6. Ensemble Methods: Ensemble methods, such as Bagging and Boosting, can be effective in dealing with class imbalance. Techniques like Random Forest or AdaBoost can balance the class distribution and improve model performance.

7. Cluster-Based Oversampling: This technique involves clustering the minority class instances and oversampling within each cluster. It aims to create diverse synthetic instances that better represent the underlying data distribution.

8. Cost-Sensitive Learning: Assigning different misclassification costs to different classes during training can help the model focus more on the minority class and reduce bias towards the majority class.

It's important to note that the choice of data balancing technique depends on the specific problem, dataset, and algorithm being used. Experimentation and evaluation of different techniques are often necessary to find the most effective approach for a particular scenario.

10. Train, Test Data sets splitting

11. Model selection

12. Understand the Problem: Start by gaining a deep understanding of the problem you are trying to solve. Determine whether it is a regression, classification, or clustering problem based on the nature of the target variable and the desired outcome.

13. Analyse the Data: Perform exploratory data analysis (EDA) to understand the characteristics of the dataset. Identify the types of features (continuous, categorical, etc.), check for missing values, assess the distribution of the target variable, and evaluate the balance of classes (for classification tasks).

14. Consider the Dataset Size: Consider the size of your dataset. For small datasets, simpler models may be preferred to avoid overfitting, while larger datasets can potentially handle more complex models.

15. Evaluate Model Assumptions: Different models make different assumptions about the data. For example, linear regression assumes a linear relationship between the features and the target variable. Ensure that the chosen model aligns with the assumptions of the problem and the dataset.

16. Select Model Types: Based on the problem type, consider the appropriate model types for regression, classification, or clustering tasks. Some common options include linear regression, decision trees, random forests, support vector machines (SVM), logistic regression, naive Bayes, k-nearest neighbors (KNN), and various clustering algorithms (k-means, hierarchical clustering, etc.).

17. Consider Complexity and Interpretability: Assess the complexity and interpretability requirements for your problem. Some models, like linear regression or decision trees, offer simplicity and interpretability, while others, such as neural networks, may provide more complexity and predictive power but may be less interpretable.

18. Perform Model Comparison: Compare the performance of different models using appropriate evaluation metrics. For regression tasks, metrics like mean squared error (MSE) or R-squared can be used. For classification tasks, metrics like accuracy, precision, recall, or F1 score are commonly used. For clustering tasks, metrics like silhouette score or within-cluster sum of squares can be employed.

Use of Grid Search CV:

You can use GridSearchCV in the following situations:

1. Model Selection: When you are unsure about which model or algorithm to use, GridSearchCV can help you compare multiple models by searching for the best hyperparameters for each model and evaluating their performance.

2. Hyperparameter Tuning: Even if you have selected a specific model, you may still need to find the best combination of hyperparameters for that model. GridSearchCV allows you to define a grid of hyperparameters and exhaustively search through all possible combinations to find the optimal set.

3. Limited Hyperparameter Space: When the hyperparameter space is relatively small and computationally feasible to search through, GridSearchCV can be used to ensure that no combination of hyperparameters is missed.

4. Baseline Model Evaluation: GridSearchCV can serve as a baseline for evaluating model performance. By comparing the performance of the models with different hyperparameter combinations, you can assess the best achievable performance and use it as a benchmark for further experimentation.

5. Reproducibility and Transparency: GridSearchCV provides a systematic and reproducible way of evaluating and comparing models, as it considers all combinations of hyperparameters. This ensures transparency in the model selection and hyperparameter tuning process.

It's important to note that GridSearchCV can be computationally expensive, especially for larger hyperparameter spaces or larger datasets. In such cases, techniques like RandomizedSearchCV or Bayesian optimization can be used to sample a subset of hyperparameter combinations more efficiently.

19. Model Training 

20. Model evaluation 

Model Evaluation Techniques for Regression:

• R-Squared (R2) 

• Adjusted R-Squared 

• Mean Squared Error (MSE) 

• Root Mean Squared Error (RMSE) 

• Mean Square Logarithmic Error (MSLE)

• Absolute Error 

• Mean Absolute Error (MAE)

• Mean Absolute Percentage Error (MAPE)

• Residual Standard Error (RSE) 

• Mean Absolute Deviation (MAD)

• Maximum Residual Error (MRE)

• Root Relative Squared Error (RRSE) 

• Bayesian Information Criteria (BIC) 

• Mallows’s Cp 

• Correlation Coefficient 

Model Evaluation Techniques for Classification:

• Accuracy

• Confusion Matrix

• Precision

• Recall (Sensitivity/Specificity)

• F1 Score

• Receiver Operating Characteristic (ROC) Curve

• Hamming Loss

• Jaccard Score

• Cross Entropy Loss

Model Evaluation Techniques for Clustering:

• Silhouette Score: 

• Within-Cluster Sum of Squares (WCSS)

• Davies-Bouldin Index

• Adjusted Rand Index (ARI)

21. Model Cross Validation

Cross-Validation Techniques for Regression and Classification:

• K-Fold Cross-Validation

• Stratified K-Fold Cross-Validation

• Leave-One-Out Cross-Validation (LOOCV)

• Leave-one-group-out Cross-Validation

• Nested Cross-Validation

• Time-series Cross-Validation

• Repeated K-Fold Cross-Validation

Cross-Validation Techniques for Clustering:

• Cluster-Focused Cross-Validation

• Cross-Validation with Silhouette Score

22. Model Comparing Tests

Model Comparing Tests for Regression:

• F-Test or Analysis of Variance (ANOVA)

• t-Test: 

Model Comparing Tests for Classification:

• McNemar's Test

• Paired t-Test

Model Comparing Tests for Clustering:

• Rand Index

• Fowlkes-Mallows Index

• Dunn Index

• Calinski-Harabasz Index:

• Silhouette Score

• Gap Statistic

23. Model optimization and tuning

24. Model deployment and monitoring 


Comments

Popular posts from this blog

6. Exploratory Data Analysis [ EDA ] _ Part4 [ Deleting Missing Values ]

When should we Delete missing values in a given data set in Machine learning? Handling missing values is an important step in the preprocessing of data for machine learning models. The decision to delete missing values depends on the extent of missing data, the nature of the data, and the impact of missing values on the performance of your model. Here are some considerations: Percentage of Missing Values: If a small percentage of your data has missing values (e.g., less than 5%), you may choose to simply remove the rows with missing values, especially if the missing values are randomly distributed and not likely to introduce bias. If a large percentage of your data has missing values, removing those rows might lead to a significant loss of information. In such cases, other strategies, like imputation, might be more appropriate. Reason for Missing Values: Understanding why the values are missing can help in deciding the appropriate strategy. If values are missing completely at random, d...

4.Exploratory Data Analysis [ EDA ] _ Part 2 [ Checking for Duplicate Values ]

Note : Before going to the forward, please read the previous article :  3. Exploratory Data Analysis_ Part_1   [How to Laod data & Lowering the Data ] for better understand. In this Section we are going to discuss about: 1.Data Cleaning 2.Checking For Duplicate Values in a Dataset What is data cleaning in EDA? Data cleaning in Exploratory Data Analysis (EDA) is the process of identifying and addressing issues or anomalies in the raw data to ensure its accuracy, consistency, and reliability. The purpose of data cleaning is to prepare the data for analysis by removing errors, inconsistencies, and irrelevant information that could potentially distort the results of the analysis. Key aspects of data cleaning in EDA include: Handling Missing Values: Identifying and addressing missing values in the dataset. This may involve imputing missing values using statistical methods, removing rows or columns with missing values, or making informed decisions about ...

5. Exploratory Data Analysis [ EDA ] _Part 3 [ Identifying missing values ]

Note: Please read previous article :  Checking for Duplicate Values  for better understanding. b. Identifying Missing Values  in  Dependent and Independent Variables Checking for missing values is a crucial step in the data analysis and preprocessing process for several important reasons: Data Quality Assurance: Identifying missing values helps ensure the quality and integrity of the dataset. It allows for a thorough examination of data completeness and accuracy. Avoiding Bias in Analysis: Missing values can introduce bias into statistical analyses and machine learning models. Detecting and addressing these gaps is essential to obtain accurate and unbiased results. Preventing Misleading Conclusions: Ignoring missing values may lead to incorrect conclusions and interpretations. It's important to be aware of the extent of missing data to avoid drawing misleading or inaccurate insights. Ensuring Validity of Results: Many statistical tests and analyses assume the availa...