Hello world my name is Data Mining Mike and I am going to teach you about the marvelous industry of Data Mining. Data Mining is slang for “data science”. In this article, we cover data cleaning, the most import step of all data science. Data cleaning or “scrubbing”, is the act of preparing data for exploration and statistical analysis. Data cleaning is the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset.

The following is a wave-top overview of how basic data cleaning is conducted. From the data source to industry standards for for statistical quality. The method and techniques examples are justified as a standard of trust in the industry.

Cleaning data is an art and a science.

Maintain the scientific method as you execute our data cleaning process / kata / flow /or movement.

Document the variables used and give explanations as why other variables / columns are not used. Below is the flow chart of all data mining operations.

Step one is assembling the data, i.e. matching or joining on key identifiable variables and piecing together a master-joined dataset.

Its good practice to explore the data as a spreadsheet in Excel, this allows for the human eye to instantly notice data patterns and game-plan cleaning issues; Scrolling slowly from top to bottom and inspecting the data. In some cases, it may be more convenient to quickly engineer a column in Excel vs. Python.

Generally, we upload the data as a csv file into a Python environment to use cloud computing / specific statistical and data manipulation libraries. From this point, it’s about making that perfect ‘square’ in the data. That ‘square’ is like a ‘rubrics cube’ needing to be solved. The Pandas package allows for this manipulation of big data.

The first thing is identifying what variables are workable and germane; Redundant and trivial variables are dropped. Then all missing or ‘null’ values need to be handled. Generally dropping the rows / instances with empty or ‘null’ values is common practice; Sometimes you may need to fill the missing values with the mean or some other metric due to variable importance and sparsity percentage. Generally, variables with over 50% sparsity will be dropped or import twice the amount of data if the variable / feature is important. Next is to drop all rows with duplicate instances (paramount).

Depending on the scope, sometimes columns with text are lowercased and stripped of all white space and special characters; Taking care of other common data quality issues like state abbreviations and capitalizations.

At this point, the dataset should have no missing values. With feature importance in mind, we then must convert all categorical variables to binary, thus transforming (featuring engineering). Sometimes numeric columns need to be scaled or create ratio columns via formula. There may need to be some iterative cleaning and inspecting between Python and Excel depending on any issues during cleaning.

The following visualizations are an example of how text data is handled prior to processing. Removing all the common stop words and special characters like #’s. Then removes any leading or trailing white spaces, making the string of text, one string instance per row in the dataset.

The resulting text would be reduced to one string-literal named: “mansbestfriend”. From this point string-literals can be further tokenized for Natural Language Processing.

Most time is spent in data cleaning to manipulate the data to the intended machine learning model.

Then we export the cleaned dataset back to excel to inspect the outputs. Next, is exploring the probability distributions of all variables via visualizations. Then generating visualizations of two or more variables plotted against each other.

At this point, with all relevant variables cleaned, the data frame becomes a ‘solved rubrics cube’ (which is a data product). That is like refining crude-oil to jet-fuel, the fuel the machines need to run clean predictions. Now the notional data is fully prepared for machine learning. The cleaned dataset is used to train a machine learning model to classify or predict one of the variable(s) in the data.

From this point, the data could remain on a database, sold, or concatenated with other bigger datasets.

In final analysis, we covered data cleaning and where it fits in the world of machine learning production. We covered where the data can be sourced and common industry practices for cleaning. Showcasing the end state necessary for all variables prior to statistical testing. We covered the value of the cleaned data using rubrics cube examples. After this article, you are now more confident in your ability to understand the basics of data cleaning from the best in the business.