Principal Component Analysis
Bottom Line Up Front: Customer yearly bandwidth may not impact Churn.
A major telecommunications company is in a very competitive industry where every variable my matter to customers leaving to a different provider. The company intends to minimize customer churn overall. Given the dataset intended to explore and mitigate customer churn: How much variance is in the continuous variables? Additionally, one goal is to discover which features factor the most in variance, then lean the features out with the Kaiser criterion. The following paper explains the overall concept of Principal Component Analysis (PCA) and how it is applied. After dimensionality reduction with PCA, the question was answered, and the goal was reached.
PCA is an unsupervised dimensionality reduction technique that relies on feature extraction and orthogonality to decompose a dataset’s continuous variables into a hasty ranking system. The ranking system is called the Kaiser criterion; In which, all Principal Components (PCs) are squared and summed into Eigenvalues. The concept behind the reduction process is a simple as taking a matrix and multiplying it by a scaler, the result is an Eigenvector. Derived from the Eigenvector is a corresponding scaled number, the Eigenvalue of each feature. According to StatQuest.com, the Eigenvalue is sum of squares of the distances between the graphed points and the origin (2016). All Eigenvalues can be graphed on a Scree plot for a visual assessment. According to the Kaiser criterion, all variables with Eigenvalues greater than 1, should be selected because of their overall effect on the dataset or values before the “Elbow” in the Scree plot. Prior to linear transformation, the data is scaled so the distribution of variances are proportional.
Imagine the points of two groups are plotted onto a 2D graph; Between these two points will be a line of best fit. After finding the line of best fit, divide the line half and index the halfway point at the (0,0) mark. Repeat the process for all features, creating orthogonal lines going through the (0,0) mark. Each created line from a feature is now a combination of all features scaled but still aligned to the variance of the data (Data Camp). On the other hand, if too many PCs are traveling through the (0,0) point, the visualization will look like a spiny sea urchin, meaning overfitting. Likewise, if it looks like an urchin and the length of each spine represents the variance, then we want to reduce the urchin to its longest spines, avoiding the tight ball of little spines in the middle. Unlike feature selection which creates a mask to “select” particular features, PCA combines different features and combines them into a linear equation. Furthermore, one assumption of PCA is continuous variables as features (TowardsDataScience). Converting categorical variables to dummy variables don’t work, all variables must be continuous. Explaining PCA further by Leois Labs, “Graph two arbitrary points and draw a straight line between them, then rotate the 2D graph in your mind’s eye in one direction; The distance and angle of line may change. However, if the two points fall on an Eigenvector, only the distances will change” (2018).
The continuous variables used from this dataset are: Lat, Lng, Population, Children, Age, Income, Outage_sec_perweek, Email, Contacts, Yearly_equip_failure, Tenure, MonthlyCharge, and Bandwidth_GB_Year.
Below is a matrix of all the PCs:
According to the Scree Plot and the Kaiser Criterion, about 12 features need to be used. Notice that the line is diagonal from bottom left to upper right and no “elbow”. Also, the mildly tapered distribution of total variance, going from PC1 at 15.34% to PC12 at 5.66% and dropping off at 0.04 for PC13. According to Captured Variance per PC chart, PC 13 would not fit the Kaiser Criterion. Same is true with the heatmap reflecting the low variance of PC13. Plus, PCs 1-12 carry 99.95% of the total variance.
In final analysis, the result of this PCA represents volatility like a Stock Index fund. Therefore, the volatility of an industry sector is reduced and less risky because it is scaled to buying an equal share of stock in every company within a business sector. Each company has its own variance or volatility but in the long run the winners compensate for the losers. In the case of this telecom company, all these feature variances are fairly-distributed, indicating that PCs 1 – 12 meet the standard for further exploration of the dataset. The variance in the dataset is like water, it seeks its own level and that is at PC 12. Therefore, Bandwidth_GB_Year has little impact on the overall variance of the dataset and may not necessarily factor into customer Churn. After dropping Bandwidth_GB_Year, the dataset could then be used for a faster and more accurate prediction of customer churn.
Work Cited:
joshstarmer. (2018, April 2). StatQuest: Principal Component Analysis (PCA), step-by-step. YouTube. Retrieved July 10, 2022, from https://www.youtube.com/watch?v=FgakZw6K1QQ
LeiosOS. (2016, June 25). What is an eigenvector? YouTube. Retrieved July 10, 2022, from https://www.youtube.com/watch?v=ue3yoeZvt8E
PCA applications: Python. campus.datacamp.com. (n.d.). Retrieved July 10, 2022, from https://campus.datacamp.com/courses/dimensionality-reduction-in-python/feature-extraction?ex=9
Wong, W. (2020, April 7). Think twice before you use principal component analysis in supervised learning tasks. Medium. Retrieved July 10, 2022, from https://towardsdatascience.com/think-twice-before-you-use-principal-component-analysis-in-supervised-learning-tasks-70fbb68ebd0c#:~:text=Note%20that%20PCA%20is%20an,any%20labels%20in%20the%20computation