Data Preparation

Executive Summary
Every data science endeavor begins with source data that will hopefully provide insights on a question (business, technical, scientific, etc). Each data set will present with its own characteristic data quality issues that must be identified, characterized, and (if problematic) corrected or mitigated.
The objective of data preparation is to yield a data set that can be effectively analyzed and, if desired, used as a training resource to make predictions with machine learning methods.
The Google slides shown below will step you through the various steps in preparing a challenging dataset for statistical analysis and use in machine learning (shown on a separate page LINK HERE).
Below these slides you will find links to the GitHub repo that holds the python code (in notebook format) used in this project.

Project files include:

This slide presentation will step you through the various steps in preparing a challenging dataset for statistical analysis and use in machine learning (shown in a separate project.

Related jupyter notebook - GitHub repo with the code for this project

Data Preparation Methods:

Data Quality Assessment

Create DataFrame from CSV data and navigate through the unprocessed data to begin to identify data issues

Check for accuracy, completeness, consistency, etc

Data Pre-Processing

Cleaning, transforming, and possibly reducing the complexity of data

Correct spelling and naming format issues
Drop unnecessary columns

Check and fix datatypes for correct format

Clean up monetary data to be integers versus heterogenous string and int
Check for Missing Data

Encode categorical variables
Identify categories by looking for unique values

Implement a nominal encoding scheme

Outliers

Tech Stack shown below

Sample Outcomes
Throughout the notebook and slides shared above you will see a progression from raw data to data that has been cleaned and prepared from statistical analysis and machine learning.

Below are a couple screenshots of data before and after the cleaning methods.

The Tech Stack for this project includes:

Python

Numpy

matplotlib

Pandas

Seaborn

Scikit Learn

Pyplot

SciPy

Parquet