Guide to Data Wrangling in ML: Mastering Data Preparation

Data plays a crucial role in the machine learning pipeline. It undergoes various processes like data munging, blending, or cleaning, collectively known as data wrangling. This step converts raw data into valuable, reliable information.

Data wrangling with exploratory data analysis ensures the data is complete and reliable before generating insights and training machine learning algorithms. A machine learning or data engineer spends a significant time in data wrangling.

Table of Contents

What is Data Wrangling?

Data Wrangling is the cleaning and transforming raw data into valuable form to get valuable insights. It includes data aggregation, data visualization, and training machine learning models.

We can perform data wrangling manually or automatically.

Why do we use Data Wrangling?

Data wrangling is a crucial part of machine learning algorithm development. We use data wrangling for the following purposes:

Improve data quality.
Clean noises from data.
Remove missing data
Restructure data
Change data format
Clean, enrich, and transform the data.

Challenges in Data Wrangling

Data wrangling comes with its own set of challenges. Some common challenges in data wrangling include:

Incomplete Data: Missing values in data is a significant challenge. During data wrangling, you decide whether to ignore missing data or replace data with any other value.

Inconsistent Data: Data from different sources can be in different formats, units, or scales. For example, data in the date column is not in date format. We need to address these inconsistencies during data wrangling.

Outliners: Outliners in data can impact the machine learning parameters and statistical testing results. You decide whether to remove the outliner or replace it with any other value During data wrangling.

Duplicate Data: We identify and handle duplicate values to get accurate insights.

Reproducible: We need to ensure data wrangling steps are reproducible for smooth working of machine learning pipeline.

Steps in Data Wrangling

Here are the list of steps in data wrangling. All of these steps are not done one after another. It involves iterating again and again untill we reach the acceptable solution.

Data Discovery
- Understand Data Source
- Data Inspection
Data Structuring
- Data Transformation
- Data Integration
Data Cleaning
- Handle Missing Values
- Handle Duplicates
- Remove the Outliners
Data Enriching
- Feature Engineering
Data Validation
Publish Data

Data Discovery in Data Wrangling

Data discovery in data wrangling is a process of exploring and understanding the characteristics of raw data before cleaning and transforming. This helps in gaining insights and finding potential issues in data. Here is the list of steps in data discovery during data wrangling.

Understand the Data Source

The first step in data wrangling is to understand the data source and how it is collected. It gives complete picture of the data pipeline. As a result, you can analyze the problem if it comes up.

Data Inspection: Exploratory data Analysis

Next step is to inspect the data to know its characteristics. Exploratory Data Analysis is one of the best way for data inspection. It provides following inputs:

Data Profiling: Generate summary statistic, get overview of the data distribution and central tendency.
Visualize and understand data using histograms, box plots, scatter plots, and correlation matrices.
Identify the missing values.
Recognize the data type
Detect Outliners
Understand relationship in data

We suggest you read this article for more details on exploratory data analysis.

Data Structuring

Data structuring on data wrangling is a process of organizing and arranging the raw data into a more usable form as per the required analysis. Here are the key tasks involved in data structuring:

Data Transformation

Data transformation in machine learning is a crucial step in the data preparation pipeline for machine learning. It involves changing input data format, structure, or values to make it more suitable for machine learning algorithms. Here is the list of steps in data transformation:

Convert categorical data into numerical.
Normalization and standardization of data.
Implement strategy to deal with missing data.
Handle outliners.
Scaling and Normalization of Numerical Data
Handle Time series data.
Feature Engineering.
Handle text data.

Data integration

Data integration in data wrangling involves restructuring the existing data by combining data from multiple sources. The goal of data integration is to make one single dataset that we can use for detailed analysis or machine learning algorithm training.

The prior requirement for data integration is to know the Data Source, and transformed data. But it does not mean we can not transform data after data integration. Here is the list of steps in data integration for data wrangling:

Understand data structure from each source.
Make sure all preprocessing is done on data.
Standardize the data across all sources. For example, it ensures data from all sources are in the same units.
Identify and resolve any difference in the schema or structure of the data from the different sources.
Merge multiple datasets into a single dataset.
Resolve conflicts and discrepancies between data from different sources.

Data Cleaning during Wrangling

Data cleaning is an important step in data wrangling for machine learning algorithms. It involves correcting the errors in data to ensure it is complete and accurate for data analysis. Here is the list of steps involved in data cleaning.

Handle Missing Values

Missing values in data can result in inconsistency in data analysis results. These values either need to be eliminated or replaced with other values. Here is the list of steps involved in handling the missing values.

Determine missing values during data discovery or exploratory data analysis.
Make a Decision on whether to remove the row/column or replace missing values with estimated values.
Remove the required rows or columns for missing values.
Replace the missing value with the estimated value.

Handle Duplicates in data

Duplicates in data may skew the analysis results and impact the machine learning algorithm. Here is the list of steps to handle the duplicates.

Identify duplicates during data discovery.
Make sure data instances are duplicate, not the original. Some times original data looks like a duplicate.
Make a decision on which duplicate value you want to remove.

Handle the Outliners in Data

Any outliers in data can impact the results of the training machine learning algorithms. Therefore, either transform or remove the outliners before data analysis or training machine learning algorithms. Here is the list of steps to handle outliers in data.

Identify the outliners during data discovery.
Decide on whether to remove the outliner or transform them according to the nature of the analysis.
Transform or remove the outliners.

Data Enriching during data wrangling

Data enriching in data wrangling is about enhancing’s the existing data. We can achieve this by providing new information or combining the existing data. Here is the list of steps involved in data enriching:

Feature Engineering

Feature engineering in machine learning ensures we use only relevant data to create the machine learning algorithm. It involves following activities:

Drive new features from data
Modify or transform existing features
Remove irrelevant features from data.

Extract important information from data

This activity involves extracting the additional information from data or external sources to enrich the data.

Validation in Data Wrangling

Data validation typically ensures the data quality, correctness and suitability for training ML algorithms. This step is done after data cleaning and transformation to ensure our data meets the quality criteria.

Steps involved in data validation are similar to what we have done during data inspection. Here is the list of steps we do for data validation:

Check data for completeness.
Ensure missing values are not present in data.
Verify the consistency in data. For example, verify if all relevant feature columns are on same format.
Check data for outliners
Ensure duplicates are not present in data.
Make sure data is meeting the proposed ml model assumptions.

Publishing

After data validation, the next step is to publish the data in required format for further analysis. It is a process to share the clean, transformed and validated data with documentation. Here is the list of activities done during data publishing:

Implement version control to track change in published data.
Ensure the shared data is meeting the data privacy and security standards.
Data documentation describes the data sources, collection methods and a list of all changes in data.

Conclusion

In essence, by preparing and refining raw data aids in informed decision-making, accurate predictions, and meaningful insights. The quality of decision making outcome depends on the integrity of our data. Therefore, Data Wrangling plays a key role in ensuring the quality of machine learning algorithms.