K Nearest Neighbors in Python
Abstract
Bottom Line Up Front: How accurately can K Nearest Neighbors (KNN) classify hospital readmission, given a set of set of variables?
A major hospital chain has a business problem: reducing patient readmission rates. Outside agencies put financial pressure on hospitals with high readmission rates. The dataset was scrubbed, standardized, and explored via K Nearest Neighbors supervised machine learning algorithm. Several variables will be selected and explored to notice any impact on the target variable “Readmission”. One goal of the data analysis is to hyper tune to the KNN model to find the optimum number “K” given the selected variables, yielding a most accurate model possible with the given dataset. The model will then be cross validated with sample strata to simulate how the model will perform on unseen data.
K Nearest Neighbors is easy to implement with few parameters to tune. KNN is a supervised Machine Learning algorithm that will take a dataset with labeled data and classify a categorical discrete variable. KNN can be used to for classification and regression problems. KNN assumes that similar things exist in close proximity. KNN captures the idea of similarity by calculating the distance between points on a graph and classifies them by similarity. The variable ‘K’, (number of unlabeled points), will be initialized. The distances of the surrounding points to the unlabeled point will decide what the label should be based off the closer points. The expected outcome if ‘K’ =1, results in the model having the least amount of accuracy. “Inversely, as we increase the value of K, our predictions become more stable due to majority voting / averaging, and thus, more likely to make more accurate predictions (up to a certain point). Eventually, we begin to witness an increasing number of errors. It is at this point we know we have pushed the value of K too far” (towardsdatascience.com). One of the goals of this analysis is to find the most efficient number of neighbors “K”.
Python will be used because it has all the necessary statistical libraries to conduct the data analysis such as Sklearn. Sklearn has functions that can facilitate all stages of the Extract Load Transform process. Sklearn also contains several Machine Learning functions to include KNN. Sklearn has useful functions such as the StandardScalar(), for scalling the data. The GridSearchCV() function finds the optimum number of ‘K’ . The Pandas package will be used to convert big data into a useable dataset that can be easily manipulated. Lastly, the cross_val_score() method gives the Area Under the Curve scores for simulated unseen data. The best_params_ method will output the number of neighbors ‘K’ for the optimum parameter tuning. Calling the cross-validation function from Sklearn, will iterate the KNN algorithm several times on different slices or “Folds” of the dataset. Resulting in a ‘K’ that reduces the number of errors and improving the integrity of the model to unseen data. The outcome of the analysis will be a KNN model that is hyper-tunned to optimum classification performance given the medical dataset on an unseen dataset of the same variables.
To achieve this end state, the dataset will first be imported into the Python environment. The variables to be analyzed are: Population (Continous), Children (Continous), Marital_Married (Categorical), ReAdmis_Yes (Categorical), HighBlood_Yes (Categorical), Stroke_Yes (Categorical) Overweight_Yes (Categorical), Diabetes_Yes(Categorical), Anxiety_Yes (Categorical) Asthma_Yes (Categorical). Second, all missing values will be dropped so they do not impact the model or the summary statistics; This is accomplished via the dropna() function. Third, unnecessary features will be dropped from the data set via the drop() function. Fourth, categorical variables will be converted into binary representation via the get_dummies() function; get_dummies() creates two columns for every category and drops the first column. Fifth, to remove outliers, the StandardScalar() function is called and the dataset is passed as an argument. Before creating the model, the cleaned dataset will be separated into two data frames, one for testing ‘X’ and the other for training ‘y’. This accomplished by calling the test_train_split() function which creates a tuple of four variables when passing ‘X’ and ‘y’ as arguments. This tuple creates the variables: ‘X_train’, ‘X_test’, ‘y_train’, ‘y_test’; In which 30% of the data was allocated to testing and 70% to training.
#-- Importing all nessesary libraries --#
import statistics
import numpy as np
import seaborn as sns
from scipy import stats
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
import numpy as np
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn import metrics
#-- Importing the dataset --#
med_df_raw = pd.read_csv("/Users/lindasegalini/Desktop/WGU/New Program/D208 Predictive Modeling/medical_clean.csv")
#-- Dropping missing values --#
med_df_raw.dropna()
#-- Dropping unessesary features --#
med_df_raw = med_df_raw.drop(columns = ['Gender','Complication_risk','Initial_admin','Additional_charges','TotalCharge','Initial_days','VitD_levels','Age','CaseOrder','Allergic_rhinitis','Reflux_esophagitis','Doc_visits','Full_meals_eaten','vitD_supp','Soft_drink','Customer_id','Interaction','UID','City','State','County','Zip','Lat','Lng', 'Area','TimeZone','Job','Income','Services','Item1','Item2','Item3','Item4','Item5','Item6','Item7','Item8', 'Arthritis','Hyperlipidemia','BackPain'])
#-- Changing categorical to binary with get_dummies() and dropping the first column.
med_df = pd.get_dummies(med_df_raw, drop_first =True)
#-- Dropping the previously combined columns to keep the number of columns down and create a tidier dataset --#
med_clean = med_df.drop(columns = ['Marital_Never Married','Marital_Separated','Marital_Widowed'])
#-- Filtering the data frame to remove the values exceeding 3 standard deviations --#
med_remove_df = med_clean[(np.abs(stats.zscore(med_df)) < 3).all(axis=1)]
#-- Displaying what rows were removed --#
med_clean.index.difference(med_remove_df.index)
#-- Saving a copy of the cleaned dataset --#
med_clean.to_csv('/Users/lindasegalini/Desktop/WGU/New Program/D209 Data Mining/Submissions/KNN.csv')
#-- Split the data into CrossValidation and holdout sets --#
#-- Split the data into X & y --#
X = med_clean.drop('ReAdmis_Yes', axis = 1).values
y = med_clean['ReAdmis_Yes']
y = y.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify = y)
#-- Initailizing KNN Classifier --#
knn = KNeighborsClassifier(n_neighbors = 48)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
y_pred
#-- The initial score of the model --#
#-- 'n_neighbors': is the hyperparameter name --#
#-- The parameters the grid will cover --#
knn = KNeighborsClassifier()
param_grid = {'n_neighbors': np.arange(1, 50)} #-- if we specify more parameters, all will b tried--#
#-- grid variable--#
knn_cv = GridSearchCV(knn, param_grid, cv=10)
knn_cv.fit(X,y)
#-- Generating the confusion matrix --#
from sklearn.metrics import confusion_matrix
y_true = y_test # True values
cf_matrix = confusion_matrix(y_true, y_pred)
print("\n KNN \nTest confusion_matrix")
sns.heatmap(cf_matrix, annot=True, cmap='Blues')
plt.xlabel('Predicted', fontsize=12)
plt.ylabel('True', fontsize=12)
#-- Calculating False Positives (FP), False Negatives (FN), True Positives (TP) & True Negatives (TN)
FP = cf_matrix.sum(axis=0) - np.diag(cf_matrix)
FN = cf_matrix.sum(axis=1) - np.diag(cf_matrix)
TP = np.diag(cf_matrix)
TN = cf_matrix.sum() - (FP + FN + TP)
print('KNN Scores: ')
#-- Sensitivity, hit rate, recall, or true positive rate --#
TPR = TP / (TP + FN)
print("The True Positive Rate is:", TPR)
#-- Precision or positive predictive value --#
PPV = TP / (TP + FP)
print("The Precision is:", PPV)
#-- False positive rate or False alarm rate --#
FPR = FP / (FP + TN)
print("The False positive rate is:", FPR)
#-- False negative rate or Miss Rate --#
FNR = FN / (FN + TP)
print("The False Negative Rate is: ", FNR)
#-- Total averages --#
print("")
print("The average TPR is:", TPR.sum()/2)
print("The average Precision is:", PPV.sum()/2)
print("The average False positive rate is:", FPR.sum()/2)
print("The average False Negative Rate is:", FNR.sum()/2)
#-- Finds all the parameters --#
#--to retrieve the hyperparameters that perform the best along with the mean cross-validation
#-- score over that fold--#
knn_cv.best_params_
#-- Compute and print metrics --#
print("Accuracy: {}".format(knn_cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(knn_cv.best_params_))
knn.score(X_test,y_test)
#-- Generating the ROC /AUC --#
fpr, tpr, _ = metrics.roc_curve(y_true, y_pred)
auc = metrics.roc_auc_score(y_test, y_pred)
#-- Plotting the ROC curve --#
plt.plot(fpr,tpr,label="AUC="+str(auc))
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc=4)
plt.show()
The final accuracy of the model is 63.7%. However, the Area Under the Curve scores are not very good considering the six different Cross Validation sets on “unseen” data scored around 50%; No better than flipping a coin. “When AUC is 0.5, it means the model has no class separation capacity whatsoever” (Towardsdatascience.com). Additionally, the Receiver Operator Characteristic ROC is probability curve that compares two bell-shaped distributions plotted on against the True Positive rate (‘x’ axis) and the False Positive Rate (y-axis). “When two curves don’t overlap at all means model has an ideal measure of separability. It is perfectly able to distinguish between positive class and negative class” (Towardsdatascience.com). If the distributions overlap at all, then type 1and 2 errors can occur. In the case of this model, both distributions overlap completely, making one distribution. Indicated by the directly straight 45-degree line traveling from the lower left to upper right of the above graph.
In final analysis, the variables chosen don’t have much impact on hospital readmission. Furthermore, the optimum number of neighbors is 48, which seems like a rather large amount. Regardless of the number of neighbors, the model’s accuracy improved very slightly between 1 to 48 neighbors. Additionally, the ratio of testing to training data was tested and the highest accuracy came from 30% testing 70% training. The silver-lining is the elimination of the variables selected from further analysis since they do not contribute to an efficient KNN model. Overall, this is not an efficient model for predicting hospital readmission.
Work cited
Harrison, O. (2018, September 10). Machine Learning Basics with the K-Nearest Neighbors Algorithm. towardsdatascience.com. Retrieved from https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761
Narkhede, S. (2021, June 15). Understanding AUC - roc curve. Medium. Retrieved January 16, 2022, from https://towardsdatascience.com/understanding-auc-roc-curve-68b2303cc9c5