Executive Summary
Every data science endeavor begins with source data that will hopefully provide insights on a question (business, technical, scientific, etc). Each data set will present with its own characteristic data quality issues that must be identified, characterized, and (if problematic) corrected or mitigated.
The objective of data preparation is to yield a data set that can be effectively analyzed and, if desired, used as a training resource to make predictions with machine learning methods.
The Google slides shown below will step you through the various steps in preparing a challenging dataset for statistical analysis and use in machine learning (shown on a separate page LINK HERE). Below these slides you will find links to the GitHub repo that holds the python code (in notebook format) used in this project.
Project files include:
This slide presentation will step you through the various steps in preparing a challenging dataset for statistical analysis and use in machine learning (shown in a separate project.
Create DataFrame from CSV data and navigate through the unprocessed data to begin to identify data issues
Check for accuracy, completeness, consistency, etc
Data Pre-Processing
Cleaning, transforming, and possibly reducing the complexity of data
Correct spelling and naming format issues
Drop unnecessary columns
Check and fix datatypes for correct format
Clean up monetary data to be integers versus heterogenous string and int
Check for Missing Data
Encode categorical variables
Identify categories by looking for unique values
Implement a nominal encoding scheme
Outliers
Tech Stack shown below
Sample Outcomes
Throughout the notebook and slides shared above you will see a progression from raw data to data that has been cleaned and prepared from statistical analysis and machine learning.
Below are a couple screenshots of data before and after the cleaning methods.