Data Mining YouTube Keywords 2025

Research Questions:

 

1.    What are to top trending key words for video titles?

2. What are the top 5 video categories?  

3.  Is there evidence of video maturation cycle?

 

Null Hypothesis H0:

YouTube videos have no maturation process, the engagement is random.

 

Alternative Hypothesis H1:

YouTube videos have a maturation process, engagement is NOT random.

Above is the WordCloud output of the comedy category.

 

Context:

 

The contribution of this study to the field of data analytics and video marketing, is to investigate the top 1000 YouTube trending videos for January 2025. With this information a media company can maximize the investment put into the video content.

An article titled, The Influence of Titles on YouTube trending Videos, showcases a study using big data to explore engagement strength using the identical variables of ‘Likes’, ‘Views’, and ‘Comment Count’ (Research Gate, 2020). An additional study, by the Royal Publishing Society states: Title, abstract, and keywords are practical to improving the visibility and impact of content. They found that these variables are key factors in overall content engagement and brand awareness (2023).  Understanding these variables can help describe the relationship between the IV and DVs.

Additionally, this exploration is a follow-up study from this initial YouTube keyword analysis.

 

 

Data:

 

An opensource dataset of YouTube data containing the necessary variables about video uploads. A Kaggle dataset from www.kaggle.com. Kaggle is the opensource repository / organization that hosts the datasets.  The dataset contains almost 1,000 rows (before any rows where removed) and 7columns. The dataset is limited to only 1 month of YouTube’s trending videos; Uploaded from 2005 – 2025. The dataset has multiple columns for possible exploration. “Rank”, “Video”, “Video views”, “Likes”, “Dislikes”, “Category”, and year “Published”. Published is the key variable to help indicate a video maturity lifecycle. A feature was engineered called Engagement Per View (EPV), where:

 

EPV = Views / (Dislikes + Likes)

 Access to the dataset can be found in the link below:

https://www.kaggle.com/datasets/samithsachidanandan/most-popular-1000-youtube-videos

Available to the public via Kaggle.com, meaning that the dataset may be limiting in accuracy and completeness.

Access to the source code for scientific replication can be found here: Source Code

Data Gathering:

 

Plan and direct data gathering to opensource repositories (Google). Looking for keywords such as “YouTube”, “Videos”, “Keywords”, “+ .csv”. Next, selecting the 1st to 3rd ranked piece of content (reachable csv file) and inspecting each csv file for quality such as “length” (at least 7k rows), data cleanliness, massive gaps in data, and enough relevant variables to create an ‘X’ and ‘Y’ axis. Available to the public via Kaggle.com means that it may be limiting in accuracy and completeness.

In understanding popularity via growth of YouTube videos and time, the dependent variable must be a continuous (interval or ratio) level of measurement (Chowhurry, 2013).

Fortunately, all dependent variables are continuous. The dataset is 0% sparse and all missing or null columns will be dropped when cleaning the dataset.

 

 

Data Analytics Tools and Techniques: A WordCloud output was used to visualize the frequency of the trending words in the video titles. Overall, this is an exploratory quantitative data analytic technique and a descriptive statistic. The tools used will be Jupyter Notebook operating in Python code, running statsmodel api as a reliable open-source statistical library. Due to the data size, a Pandas data frame will be called, same with Numpy and Seaborn will be used for visualizations.

 

 

Justification of Tools/Techniques:

Python will be used for this analysis because of Numpy and Pandas packages that can manipulate large datasets (IBM, 2021). The tools and techniques are common industry practice and have consensus of trust. The technique is justified through the integer variables necessary to plot against a timeline. In so doing, may just reveal different modes of frequency distribution. Another reason why data exploration is ideal is because the data is based off of human viewing behavior, which is notoriously skewed. Because of the size of the dataset, pandas and Numpy will be called. Python is being selected over SAS because the Python has better visualizations (Panday, 2022).  

Project Outcomes: To find statistically significant differences, the proposed end state is several WordCloud outputs that can compare the unique keywords + frequencies reflected by the output of word size; Additionally, showcasing instant differences between video categories. Moreover, the ranked outputs of years vs against continuous variables (Statology, 2019).  A visualization of the frequency distribution of the EPV against categories and year published. A cleaned dataset of all the correctly labeled columns and rows, for replication. A better understanding of previously stated groups with exploratory graphs, giving support as to what time engagement maybe highest. Lastly, a copy of the Jupyter NoteBook with the Python code will be available, along with a video presentation added by PowerPoint. According to the same study exploratory data analysis was instrumental in support for alternative hypothesis, against other categorical variables. (Royal Society, 2023).

Now exploring research question #2:

What is the maturation cycle for YouTube videos?

The below output shows that videos published in 2015 have the highest number of views.

The below output shows upload year vs. number of ‘Likes’. Notice that videos published 2013 have the highest number of ‘Likes’ as of January 2025.

According to the below output of EPV against upload year, shows that videos published in 2007 have the highest engagement counts today. Secondly, the Meta verse has a maturation process too. One might speculate that part of it is “ever-green” content.

In final analysis

After parsing and exploring the data, we have identified that top video categories and showcased the trending keywords . The top 5 video categories are: Comedy, Pets, Travel, Autos, and Sports. Moreover, it appears that YouTube videos have a maturation timeline; While videos from 10 years ago have the most ‘Views’, ‘Likes’, and EPV. Therefore, we reject the Null Hypothesis in favor of the alternative. Importantly, video maturation appears pragmatic as older-more established videos would generate more gravity overtime, like a planet in space adding to its size and compounding the gravity. Akin to a green hedge of ever-green video content that bushes out and increases in scale. In some cases the videos where published 20 years ago and are trending in the top 1000 as of 2025. With bigger data, other things can be explored such as, the average life of a video in particular category. Therefore, more data needs to be compiled and explored to inspect trends, monitor the alternative hypothesis.

Work Cited

(PDF) popularity growth patterns of YouTube videos: A category-based study. (n.d.-a). https://www.researchgate.net/publication/287035081_Popularity_growth_patterns_of_youtube_videos_A_category-based_study

(PDF) promoting social media engagement via Branded Content Communication: A fashion brands study on Instagram. (n.d.-b). https://www.researchgate.net/publication/358322977_Promoting_Social_Media_Engagement_Via_Branded_Content_Communication_A_Fashion_Brands_Study_on_Instagram

(PDF) the influence of titles on Youtube trending videos. (n.d.-c). https://www.researchgate.net/publication/379971848_The_Influence_of_Titles_on_YouTube_Trending_Videos

 

IBM products. IBM. (n.d.). https://www.ibm.com/cloud/blog/python-vs-r

Pandey, Y. (2022, May 25). SAS vs python. LinkedIn. https://www.linkedin.com/pulse/sas-vs-python-yuvaraj-pandey/

 

Title, abstract and keywords: A practical guide to maximize the ... (n.d.-d). https://royalsocietypublishing.org/doi/10.1098/rspb.2024.1222


Powered by Data Mining Mike: America’s Big Data Authority

Michael Segaline

A Data Scientist and Search Engine Optimization Expert.

https://www.bloomingbiz.marketing
Previous
Previous

Spokane Roofing Report 2025

Next
Next

What is Prompt Engineering?