Skip to main content

4.Exploratory Data Analysis [ EDA ] _ Part 2 [ Checking for Duplicate Values ]

Note : Before going to the forward, please read the previous article : 3. Exploratory Data Analysis_ Part_1  [How to Laod data & Lowering the Data ] for better understand.
In this Section we are going to discuss about:
1.Data Cleaning
2.Checking For Duplicate Values in a Dataset

What is data cleaning in EDA?

Data cleaning in Exploratory Data Analysis (EDA) is the process of identifying and addressing issues or anomalies in the raw data to ensure its accuracy, consistency, and reliability. The purpose of data cleaning is to prepare the data for analysis by removing errors, inconsistencies, and irrelevant information that could potentially distort the results of the analysis.

Key aspects of data cleaning in EDA include:

  1. Handling Missing Values: Identifying and addressing missing values in the dataset. This may involve imputing missing values using statistical methods, removing rows or columns with missing values, or making informed decisions about the impact of missing data on the analysis.
  2. Dealing with Outliers: Identifying and addressing outliers, which are extreme values that can significantly affect statistical measures. Depending on the nature of the data, outliers may be corrected, removed, or treated in a way that aligns with the goals of the analysis.
  3. Addressing Duplicates: Identifying and removing duplicate records or entries to avoid redundancy in the dataset, ensuring that each observation is unique.
  4. Consistency Checks: Verifying the consistency of data, such as checking for discrepancies between different variables or columns. Inconsistent data may be corrected or investigated further.
  5. Handling Data Format Issues: Ensuring that data is in the correct format for analysis, including dealing with issues such as mismatched data types, incorrect units, or inconsistent formatting.
  6. Addressing Data Integrity Issues: Verifying the accuracy and integrity of data by checking for logical inconsistencies or errors. This may involve cross-referencing data with external sources or applying domain knowledge to identify anomalies.
  7. Standardizing and Transforming Data: Standardizing variables, converting units, or transforming data to meet the requirements of the analysis. This may include normalizing numerical variables or encoding categorical variables.
  8. Handling Skewed Distributions: Identifying and addressing skewed distributions in the data, as skewed data can impact the performance of certain statistical models. Transformation techniques may be applied to achieve a more symmetrical distribution.

Data cleaning is an essential step in the data analysis process because the quality of the analysis is highly dependent on the quality of the input data. By addressing issues during the data cleaning phase, analysts ensure that subsequent exploratory data analysis and modelling are based on reliable and accurate information, leading to more meaningful and trustworthy insights.

a. Checking for duplicate values:

Why do we check duplicates?

Checking for duplicates during Exploratory Data Analysis (EDA) is important for several reasons:

  1. Data Quality Assurance:

    • Duplicate records can introduce errors and inconsistencies in the analysis. Identifying and removing duplicates ensures that the data is of high quality and free from redundancies.
  2. Data Integrity:

    • Duplicate entries can compromise the integrity of the dataset. By removing duplicates, you maintain the accuracy and reliability of the data, preventing skewed results and misleading conclusions.
  3. Avoiding Redundancy:

    • Duplicate records contribute unnecessary redundancy to the dataset. Removing duplicates streamlines the data, making it more concise and efficient for analysis without repetitive information.
  4. Ensuring Unique Observations:

    • In many analyses, each observation or data point should be unique to avoid overcounting or undercounting specific cases. Checking for duplicates ensures that each record represents a distinct observation.
  5. Preventing Biased Analysis:

    • Duplicate entries can bias statistical measures, leading to inaccurate results. For example, mean and standard deviation calculations may be skewed if the same data point is repeated multiple times.
  6. Accurate Aggregation:

    • In cases where aggregation or summarization is required, duplicates can lead to incorrect calculations. Ensuring uniqueness is crucial for accurate aggregations, such as calculating averages or totals.
  7. Improving Computational Efficiency:

    • Removing duplicates can improve computational efficiency, especially when performing complex analyses. Reducing the dataset size by eliminating redundancies can lead to faster processing times.
  8. Enhancing Data Understanding:

    • Identifying and resolving duplicates contribute to a clearer understanding of the dataset. It allows analysts to focus on unique cases and patterns, leading to more meaningful insights.
  9. Avoiding Data Entry Errors:

    • Duplicate entries may result from data entry errors or system glitches. Identifying and rectifying duplicates at the EDA stage helps maintain data accuracy and rectify any errors in the data collection process.
  10. Compliance and Reporting:

    • In some industries or applications, compliance requirements mandate the removal of duplicates to ensure accurate reporting. This is particularly important in fields like finance, healthcare, and legal where precision is crucial.

In summary, checking for duplicates in EDA is a fundamental step in data preparation that contributes to the overall reliability, accuracy, and integrity of the data. It sets the foundation for meaningful analysis and ensures that subsequent insights and decisions are based on a clean and accurate representation of the underlying information.



From the above picture it is clear that : we can not check each row for True or False ( If there are any duplicate values in a given data set, then the .duplicated() function gives Boolean expression as "True", otherwise gives as "False"). 
To easily recognize them please follow :

To count the duplicates : 
If we wish to duplicates that are present in the dataset, then we should use .sum() function as demonstrated below. It returns a value indicating the number of duplicates; otherwise, it will return 0.

For more information contact on : venkatesh.mungi.datascientist@gmail.com 

Comments

Popular posts from this blog

6. Exploratory Data Analysis [ EDA ] _ Part4 [ Deleting Missing Values ]

When should we Delete missing values in a given data set in Machine learning? Handling missing values is an important step in the preprocessing of data for machine learning models. The decision to delete missing values depends on the extent of missing data, the nature of the data, and the impact of missing values on the performance of your model. Here are some considerations: Percentage of Missing Values: If a small percentage of your data has missing values (e.g., less than 5%), you may choose to simply remove the rows with missing values, especially if the missing values are randomly distributed and not likely to introduce bias. If a large percentage of your data has missing values, removing those rows might lead to a significant loss of information. In such cases, other strategies, like imputation, might be more appropriate. Reason for Missing Values: Understanding why the values are missing can help in deciding the appropriate strategy. If values are missing completely at random, d...

5. Exploratory Data Analysis [ EDA ] _Part 3 [ Identifying missing values ]

Note: Please read previous article :  Checking for Duplicate Values  for better understanding. b. Identifying Missing Values  in  Dependent and Independent Variables Checking for missing values is a crucial step in the data analysis and preprocessing process for several important reasons: Data Quality Assurance: Identifying missing values helps ensure the quality and integrity of the dataset. It allows for a thorough examination of data completeness and accuracy. Avoiding Bias in Analysis: Missing values can introduce bias into statistical analyses and machine learning models. Detecting and addressing these gaps is essential to obtain accurate and unbiased results. Preventing Misleading Conclusions: Ignoring missing values may lead to incorrect conclusions and interpretations. It's important to be aware of the extent of missing data to avoid drawing misleading or inaccurate insights. Ensuring Validity of Results: Many statistical tests and analyses assume the availa...