Who is Moving to Washington State: Exploring Immigration 2024
What are the top 20 countries people move from?
What are the top 20 U.S. States people move from?
What are the top 20 counties settled in Washington State?
What is the yearly immigration rate?
What is the monthly immigration rate?
What are the statistics on people whom move from an unknown origin?
What are the statistics on people who move from Mexico?
What are the statistics on people whom move from South Korea?
What are the statistics on people whom move from India?
What are the statistics on people whom move from California?
Can yearly immigration be forecasted?
Null Hypothesis H0:
The distribution of immigration to Washington is random.
Alternative Hypothesis H1:
The distribution of immigration to Washington is not random.
Context:
The contribution of this study to the field of data analytics and the tax payer awareness is to attempt to accurately forecast immigration to Washington State. With this information a Washington State policymakers and leadership can establish situational awareness.
An article titled, Random Forrest Analysis of Household Surveys, showcases a study using Random Forrest to forecast immigration using the identical variables of ‘Year’ and frequency count (National Science Foundation, 2020). They found that these variables are key factors in overall prediction accuracy. The random forrest, tests the null hypothesis that the population median of all of the groups are equal. Random forest models are an ensemble method of decision trees. They work by fitting many decision trees with random subsets of the data and then averaging across them for a final prediction. Thus, random forest models can achieve high predictive accuracy while avoiding overfitting the data. As previously mentioned, one strength of random forest models, especially over other “black box” statistical models, is their ability to assess variable importance and account for complex, nonlinear interactions between variables. They are also able to take inputs of categorical, factored, or continuous data without requiring dummy variables or scaled data (National Science Foundation, 2020).
Data:
A link to the dataset can be found here.
The code in python to replicate the findings can be found here.
An open-source dataset of Drivers License data containing the necessary variables about newly exchanged licenses from out of state persons. A Washington State Department of Licensing (DOL) dataset was available via Data.gov. Data.gov is the open-source repository / organization that hosts the datasets. The dataset contains almost 994,065 rows (before any rows where removed) and 4 columns. The dataset is limited to only 6 years from 2018 – 2024. The dataset has multiple columns for possible exploration. Different people immigrate to Washington State over the planet and at different rates. Delimitations for this analysis, only 4 columns of the dataset will be used as they are workable and relevant: The ‘Year, ‘Month’, ‘Prior State’, and ‘County’; The dataset is easy to work with because the columns with whole intereger values. The time will need to be separated from the date. Delimitations for this this study is that it is limited to only 5 years of data. Another delimitation is that the location of immigration is not narrowed past the county level.
From the years of 2018 - 2024 a total of 994,065 people have moved to Washington State.
While California is number 1 with 208,080, “Not Applicable” or unknown origin is a glaringly high number at 90,747.
Most people move to King County.
While immigration increase in 2021, it has been on a slow decline for the last 4 years.
It appears that the most popular months to move are September and October.
Immigration from “Not Applicable” states increased in 2022 peaking in 2023 and dropping down again in 2024.
Yakima County ranks number 3.
The frequency distributions for “Not Applicable” and Mexico appear to be similar.
Major spike of Ukrainian immigrants in 2022 and has dropped drastically since.
Immigration from South Korea appears to match the Washington State distribution.
The frequency distribution of immigrants from India appears to match the baseline for the state.
Understanding the above scores; The random forrest machine learning model performed poorly, outputting a same forecast number for the next 5 years. When scoring, the Mean Square Error (MSE) should be as close to ‘0’ as possible for a good score; With an MSE of 4,113,316,874, that score is not acceptable. Additionally the R2 score or Coefficient of Determination is -19; A good R2 score should be more in the 70s and at least a positive number. The model predictions are not accurate or good to use generalize on unseen data.
In in final analysis:
No, we accept the random forrest null hypothesis: the frequency distribution of the Washington State immigration data is random and not predictable. That is not surprising since the residuals of the dataset are visibly random. While California leads as the number one State of origin, 90,747 come from “Not Available” states. The yearly frequency distributions of “Not Available” states or “Unknown Origin” appears to be similar in nature settlers from Mexico. Additionally, the mass spike of Ukrainian immigrants may be correlated with the Uko-Russo War. 2021 was the highest year in recorded for immigration and most people came from California. It appear the 21 - 23 migration has come to a decline. However, more data needs to be compiled for future study.
After exploring the data, the questions where answered:
1.What are the top 20 countries people move from?
2. What are the top 20 U.S. States people move from?
3. What are the top 20 counties settled in Washington State?
4. What is the yearly immigration rate?
5.What is the monthly immigration rate?
6.What are the statistics on people whom move from an unknown origin?
7. What are the statistics on people who move from Mexico?
8. What are the statistics on people whom move from South Korea?
9. What are the statistics on people whom move from India?
11. What are the statistics on people whom move from California?
12. Can yearly immigration be forecasted?
Work Cited
Find a product. IBM Products. (2020, November 9). https://www.ibm.com/cloud/blog/python-vs-r
Machine learning - it’s all about assumptions. KDnuggets. (n.d.). https://www.kdnuggets.com/2021/02/machine-learning-assumptions.html
Niggl, D. (2022, July 20). Predicting human migration using machine learning. Medium. https://medium.com/geekculture/predicting-human-migration-using-machine-learning-548aa902d0ca
Pandey, Y. (2022, May 25). SAS vs python. LinkedIn. https://www.linkedin.com/pulse/sas-vs-python-yuvaraj-pandey/