Machine Learning the Trolls: Exposing Scandals in Patent Law 2024

Bottom Line Up Front: U.S. Patent infringement cases continue to rise.

Research Questions:

 

1. How long does the average patent case take to be settled?

2. What are the top 20 Courthouses that entertain patent law?

3. What are the top 20 disputing parties by frequency count?

4. What is the yearly frequency of patent cases?

5. Who are the top 20 patent trolls?

6. Who are the top 20 respondents of patent trolls?

7. Who are the top 20 judges that preside over most patent-law cases?

8. How often are juries requested and by whom?

9. Can a machine learning model be developed to forecast the patent case costs in the next 10 years?

10. Where do trolls get patents?

 

Null Hypothesis H0:

The distribution of patent infringement cases is random.

 

Alternative Hypothesis H1:

The distribution of patent infringement cases is not random.

 

Context:

 

The contribution of this study to voter and policymaker awareness, is to attempt to create advanced statistical forecasting test to investigate patent infringement claims to the U.S. Federal Court system; Encouraging legislation proposals for passing federal anti-trolling laws and regulation for Intangible Property lending .

The information will be used in support of legislative reforms for patent law with the annual predicted dollar amount being access. An article titled, Improving Supreme Court Forecasting Using Boosted Decision Trees, showcases a study using ensemble methods to explore prediction strength using the identical variables of ‘Date’ and ‘value’ (Cambridge University, 2019). They found that these variables are key factors in overall case engagement and judicial awareness. Ensemble learning is a machine learning technique that aggregates two or more learners (e.g. regression models, neural networks) to produce better predictions. In other words, an ensemble model combines several individual models to produce more accurate predictions than a single model alone (IBM, 2020). Understanding these variables can help describe the relationship between the Independent Variable and Dependent Variables.

 

 

Data:

 

Explored is an open-source dataset of patent infringement case law data containing the necessary variables. The dataset is from the Patent Litigation Docket Reports Data page. The USPTO.gov is the opensource repository / organization that hosts the datasets and free limited access to the Pacer database. The dataset contains 96,996 rows (before any rows where removed) and 23 columns. The dataset is limited to patent law cases from 1980 – 2020. The dataset has multiple columns for possible exploration.

 

Delimitations for this analysis, only two columns of the dataset will be used as they factor into the prediction for a gross annual dollar value: The ‘year_filed’ and ‘value’; The ‘date_filed’ is a timestamp index. The date will need to be separated from the ‘year_filed’ and the variable ‘value’ will be derived from the ‘year_filed’ frequency count. However, other columns will be explored and visualizations generated.

 A link to the dataset:

https://www.uspto.gov/ip-policy/economic-research/research-datasets/patent-litigation-docket-reports-data

 

Available to the public via USPTO.gov, meaning that the dataset may be limiting in accuracy and completeness.

 A link to the python code for replication can be found here:

https://github.com/Bloomingbiz/Blog_codes/blob/main/US%20Patient%20Case%20Exploration%202024.ipynb

Below are the initial variables of the dataset before any feature engineering:
  Below are the variables for exploration and statistical testing after feature engineering:

Data Gathering:

 

Plan and direct data gathering to opensource repositories (Google). Looking for keywords such as “patent law”, “patent cases”, “annual count”, “+ .csv”. Next, selecting the 1st to 3rd ranked piece of content (reachable csv file) and inspecting each csv file for quality such as “length” (at least 7k rows), data cleanliness, massive gaps in data, and enough relevant variables to create an ‘X’ and ‘Y’ axis. In ensemble methods, the dependent variable may be either continuous (interval or ratio) level of measurement (IBM, 2019). Fortunately, all dependent variables are continuous. The dataset is 9% sparse and all missing or null columns will be dropped when cleaning the dataset; Yielding 87,373 instances. The ‘date_filed’, will be separated from the year to create ‘year_filed’. From the ‘jury_demanded’ variables, three additional classifications where derived:’ jury_demanded_Petitioner’, ‘jury_demanded_Respondent’, ‘ jury_demanded_None’. From the original column, ‘case_name’ was engineered into two separate variables: ‘Petitioner’ and ‘Respondent’.

 

 

Data Analytics Tools and Techniques: A KDE plot was used to visualize the distribution and visual observation of the distribution will be the first indicator of Gaussian vs. non-parametric probability distributions. If the residuals appear to be parametric, then Shapiro-Wilk test will be used to test for normality. Ensemble methods are germane to studying this data because it can compare distributions of non-parametric data. However, the ensemble tests do not assume normality in the data (Scikit-learn, 2019). Overall, this is an exploratory quantitative data analytic technique and a descriptive statistic. The tools used will be Jupyter Notebook operating in Python code, running statsmodel api as a reliable open-source statistical library. Due to the data size, a Pandas data frame will be called, same with Numpy and Seaborn will be used for visualizations. A Random Forrest, XGboost, and Linear Regression with be used in an ensemble and voted on via sklearn’s VotingRegressor() function. There will also be a presentation layer, with Univariate and by Bivariate graphs, as well as Wordclouds to visually explore name frequency.

 

 

Justification of Tools/Techniques:

Python will be used for this analysis because of Numpy and Pandas packages that can manipulate large datasets (IBM, 2021). The tools and techniques are common industry practice and have consensus of trust.

The technique is justified through the integer variables necessary to plot against a timeline. In so doing, may just reveal different modes of frequency distribution. Another reason why ensemble testing is ideal is because the data is based off of human behavior, which is notoriously skewed. Because of the size of the dataset, pandas and Numpy will be called. Python is being selected over SAS because the Python has better visualizations (Panday, 2022).  

 

Project Outcomes: In order to find statistically significant differences, the proposed end state is an ensemble predictive statistical model that can forecast the amount of money patient infringement cases cost in future years. Additionally, a visualization of the forecasted residuals will be graphed. Also generated, is a cleaned dataset of all the correctly labeled columns and rows, for replication. The end state of the analysis is to give policymakers and voters aware of situation. Lastly, a copy of the Jupyter NoteBook with the Python code will be available, along with a video presentation added by PowerPoint. According to the same study, an ensemble method was instrumental in support for alternative hypothesis, against other categorical variables. (Cambridge University, 2019).

1. How long does the average patent case take to be settled?

It takes the average patent infringement case takes 423 days to close.

2. What are the top 20 Courthouses that entertain patent law?

patent case law common courts in the united states 2024

3. What are the top 20 disputing parties by frequency count?

4. What is the frequency of patent cases year over year?

frequency of patent infringement cases yearly

According to the above graph, there was a bullwhip increase in patent infringement cases; Peaking in 2013 then quickly decreasing after 2016. However, the trend line is sharply increasing from left to right with a right skewed probability distribution. The distribution appears to be non-parametric in nature.

5. Who are the top 20 patent trolls?

6. Who are the top 20 respondents of patent trolls?

After tallying the top 100 respondents, the sum equaled about 5%; Therefore big tech companies being sued only account for 5%. Meaning that the other 95% of patent lawsuits are happening to small business.

7. Who are the top 20 judges that preside over most patent-law cases?

8. How often are juries requested and by whom?

The above graph indicates in appox. 1/3 of the patent cases, a jury is requested and of those requests, the majority are submitted to the court by the petitioner.

9. Can a machine learning model be developed to forecast the cost of patent cases in the next 10 years?

From these predicted patent infringement case instances, a later base dollar value per case can be calculated and multiplied with yearly forecasted count.

10. Where do the patent trolls get all these patents?

 

They often buy up patents cheaply from companies down on their luck who are looking to monetize what resources they have left, such as patents. Unfortunately, the Patent Office has a habit of issuing patents for ideas that are neither new nor revolutionary, and these patents can be very broad, covering every day or commonsense types of computing – things that should never have been patented in the first place. (EFF, 2017). According to the Harvard Business Review: The impact on American innovation is devastating.  Each year, patent trolls create $29 billion in direct, out-of-pocket costs from the companies they go after.  The companies that settle with patent trolls, or lose to them in court, wind up reducing investments in research and development by an average of more than $160 million over the next two years.  Massive amounts of money are being drained from the hardworking people who are driving our economy forward to instead line the pockets of wealthy investors who are offering no goods or services of their own (2017).

 According to ScienceDirect, patent trolls pose as Venture Capitalists (VCs), business incubators, or “seeders” (2018). From anecdotal experiences, the troll-feeder’s business model, is to target young tech-savvy entrepreneurs with no business degree. Commonly, the troll-feeders host “pitch-competitions”, while selling start-up founders the lie:

“How are you ever going to secure funding without a patent, you need a patent”.

The start-up founders produce a patent in exchange for a patent-backed loan (IP loan). Then the troll-feeders sell the patent to the trolls. Some lenders even tell the start-up founders, “Sorry, the funding didn’t come through”. The VCs may even be the owners of the building, commonly with a coffee shop and some separate office-rooms for rent. “Tell you what, why don’t you just use this basement room (janitor’s closet). You could have it as your business headquarters. I’ll even put your name on the door. Staying here, you can volunteer with our start-ups here in this building for ‘experience’ and ‘collaboration’”. Some troll-feeders, are also willing to exchange one year’s worth of office-space rent (in the coffee shop) for a patent. Small start-ups are using patents as a form of intangible currency (only because it holds value to the troll). A patent is only as good as the attorney representing it. Most situations ending with the troll-feeder making money and the troll has a patent; It appears that IP backed loans are generally, “unregulated”.

(FYI don’t every waste you life with securing patents. Learn how to win an Government IT contract with 75% chance in this article here.)

 

In final analysis

The study answers the following questions:

1. How long does the average patent case take to be settled?

2. What are the top 20 Courthouses that entertain patent law?

3. What are the top 20 most frequent settlement disputes?

4. What is the frequency of patent cases year over year?

5. Who are the top 20 patent trolls?

6. Who are the top 20 respondents of patent trolls?

7. Who are the top 20 judges that preside over most patent-law cases?

8. How often are juries requested and by whom?

9. Can a machine learning model be developed to forecast the cost of patent cases in the next 10 years?

10. Where do trolls get patents?

 

We reject the Null Hypothesis in favor of Alternative Hypothesis: the probability distribution of patent infringement case is not random. We can forecast that the predictions for the yearly economic costs of patent trolls. Additionally, 95% of patent infringement cases are against smaller companies. The data will be compiled and monitored as new data is generated at the USPTO.gov. With this information a voters and policymakers can be informed about the current state of U.S. patent law and effects on the economy. As for the startup founders and inventors with ideas: Don’t feed the trolls, it’s unethical.

Work Cited

1.11. ensembles: Gradient boosting, random forests, bagging, voting, stacking. scikit. (n.d.). https://scikit-learn.org/stable/modules/ensemble.html

NPEs, Adelino, M., Hochberg, Y., Mann, W., Benmelech, E., Bernstein, S., Bertrand, M., Black, S., Boguth, O., Bryant, T., ChienC., Chodorow-Reich, G., ChungS., Cohen, L., Da, Z., … Feldman, R. (2019, January 10). Patent trolls and startup employment. Journal of Financial Economics. https://www.sciencedirect.com/science/article/abs/pii/S0304405X19300030

IBM Products. (2020, November 9). https://www.ibm.com/cloud/blog/python-vs-r


Kaufman, A. R., Kraft, P., & Sen, M. (2019, February 19). Improving Supreme Court forecasting using boosted decision trees: Political analysis. Cambridge Core. https://www.cambridge.org/core/journals/political-analysis/article/improving-supreme-court-forecasting-using-boosted-decision-trees/166AA006B8DA7C87F1B17291B0BB8B63


Mullin, J. (n.d.). Patent trolls. Electronic Frontier Foundation. https://www.eff.org/issues/resources-patent-troll-victims#:~:text=Instead%2C%20trolls%20are%20in%20the,have%20left%2C%20such%20as%20patents.


Pandey, Y. (2022, May 25). SAS vs python. LinkedIn. https://www.linkedin.com/pulse/sas-vs-python-yuvaraj-pandey/

What is ensemble learning?. IBM. (2024, February 9). https://www.ibm.com/topics/ensemble-learning



More great articles below from Data Mining Mike.

Michael Segaline

A Data Scientist and Search Engine Optimization Expert.

https://www.bloomingbiz.marketing
Previous
Previous

Reverse Engineer Google Political Ads with Big Data 2024

Next
Next

Exploring Washington State’s Voter Turnout with Machine Learning 2024