Data Science | Artificial Intelligence & Machine Learning | Sports Science
Analyzing Hamstring Injury & Performance within a Professional Baseball Organization
Professional sports use data in numerous ways, and within all departmenst and levels of the organization. Teams have added data scientists and sports scientists to answer questions and provide key insights to the various departments that make up the front and back offices. The use of data has become a requisite tool for teams consistently trying to gain an edge over their competition.
​
Major League Baseball (MLB) gained a lot of attention as being one of the first sports to utlized applied statistics and data science thanks to the notariey of the "Moneyball" story. MLB teams have departments such as 'Baseball Research & Development' that collect and use data to build software products and machine learning models in an effort to put the best players on the field and to capture the best young players in the MLB Draft.
​
Sports Science is another area that has seen tremendous growth in the past decade among all sports. The sports science department is intertwined bewteen the medical staff, performance staff, coaching staff, and the players themselves. The aim of sports science is to use data and technology to aid in player development and ensure that players are staying healthy and performing optimally. This is done through physical and physiological testing in collaboration with the medical and strength & conditioning staffs. One example usecase where sporst science plays a role is in injury management and prevention.

The Problem
Hamstring strain injury (HSI) is a commonly occurring injury within team sports. Mitigating HSI is a primary concern among the human performance teams that support athletes and organizations at all levels of athletics. Within professional sports, HSI can be costly due to the financial impact of having players on the injured list as well as missing training and developmental opportunities. Additionally, having players out on the injured list due to HSI may cause a lack of depth at certain positions and place an increased stress load on the players that are available.
A primary role of the human performance team is to perform ongoing monitoring of the athletes and use the data collected to make the best decisions for the individual athlete’s training, recovery, and/or rehabilitation programs. This is often a collaborative effort, with the sports science team being responsible for the data collection and analysis which is then reported to the other disciplines within the organization. Looking for trends in the data and identifying variables that correlate to an increased injury risk allow for interventions to be implemented before a problem occurs.

The Project
The instructions for this project was simply to "take the numbers given, and provide insight," and was left very open-ended by design.
This project utilizes a de-identified dataset consisting of data on 65 players from different levels within the ranks of a MLB organization. The dataset contains demographic information such as age, height, and weight, along with performance testing data on trials of the NordBord (nb) and the Long Lever, Single-Leg Isometric Bridge test (ib). Our goal is to use this data to develop insights that can help to support the athletes, while also providing helpful information to the medical and performance staffs. The NordBord, by Vald Performance, is a system that “combines advanced sensors, real-time data visualization and cloud analytics” and is used to quantify and monitor an athlete’s hamstring strength and imbalance. The Single-Leg Isometric Bridge test is used to assess the capacity of the hamstrings, with particular focus on between-limb asymmetries.


Data Processing and Analysis
The DeepNote notebook for this project can be found HERE.
Before going to work on trying to gain insights from the dataset, I first needed to clean and process the dataset. Using Python, I examined the dataset to determine how many rows and columns there are, and identify if there were any missing values, null values, or obvious errors in the data.
The dataset consisted of 65 rows and 24 columns. There were several areas that showed as having missing or incorrect data. There are a few considerations to make when deciding how to deal with missing data. One option is to simply remove the entire row, which can be a reasonable choice when there is a relatively small amount of missing data and the sample size is large. A second option is to replace the missing data with a different statistic based on the values of the data in that column. The most common methods are imputation of the mean, for numerical data, or the mode for categorical data. K-Nearesty Neighbors (KNN) Imputation is also used, where the KNN algorithm is used to fill in missing values based on the nearest neighbors in the datset. Another option is to create a new column and variable to represent the missing data, which is particularly useful if the absence of the data carries meaning. For example, in our dataset a player may have missing data because they were held out of testing due to an injury. Having more context can be helpful in making this determination. Depending on the analysis being performed, some algorithms like XGBoost and Random Forests can handle missing values wiithout requiring imputation.
For this analysis, I dropped the columns for ib_left_3 and ib_right_3 since they were each missing 60/65 and 57/65 values respectively. The row for Player 33 was also dropped, as they were missing all testing data. This may be a scenario where a dummy variable "Missing" might be useful if this player was held out of testing due to an active injury, but the added context does not exist to make this call, so dropping the player seemed most appropriate. Additionally, the columns for ib_trial_date, nb_date, and ortho_eval_date were also dropped, as that information did not provide additional value.
Player 3 had testing data for the ib trails, but no data for nb testing. Player 16 had nb testing data, but no data for ib testing. The decision was made to impute the mean of the nb testing for Player 3, and to impute the mean of the ib testing for Player 16, in an effort to retain as many players as possible within the dataset. There were four players with missing values for hamstring_rom_l, hamstring_rom_r, and leg_length_cm, so the mean of the respective column was imputed. Additionally, Player 14 showed a leg_length_cm value of 10cm, which was assumed to be a typo and was imputed the mean of the column.
Hamstring testing data for the Nordboard and Long Lever, Single-Leg Isometric Bridge Test both show multiple test trials for each leg. For data analysis, I chose to create a new column which would take the average of the testing trials for each leg. This created four new columns: avg_ib_left, avg_ib_right, avg_nb_left, and avg_nb_right.
​
I also chose to then use these averages to create a new column for the Percentage Difference between legs for each trail. This resulted in the addition of two new columns, percent_diff_ib and percent_diff_nb. One variable that is factored into the development of an injury is asymmetries of range of motion and/or strength between left and right limbs. I wanted to see if there was a correlation between the percentage difference seen in Right and Left ib and nb scores and those that sustained an injury during the 2021 season. The plot below shows no correlation in the between limb percentage difference in ib or nb testing scores and sustainment of an injury.

Data Visualizations and Exploration
In evaluating the data, we see that the age range of players in this dataset is from 17 to 27 years old with a mean age of 23. We do not have any other demographic information regarding ethinicity, city or country of origin, college or high school attended, etc.
​
We also do not have any information regarding medical history or history of previous injuries. This information can be useful in the context of discussing players who have sustained an injury during the previous season, as we know that the greatest predictor of injury is the occurence of a past injury.

The 'higherst_level' variable refers to the player's highest level of play within the organization. The players in this dataset are all within the different developmental levels of the organization, and none are currently playing for the Major League team. The levels seen in the legend below represent the following:
​
-
'r' = Rookie - often times the first assignment for the development of young players
-
'a' = Class A - the lowest level of play in which teams play a full Minor League season
-
'adva' = Class A Advanced - some players will move up to this level before going on to the 'Upper Minors,' which are AA and AAA.
-
'aa' = Double-A - a significant achievement that separates players with the potential to reach Major League play. Some players will jump from AA to MLB, some will move bewteen the AA, AAA, and MLB ranks based on performance, rehabilitation of injuries, etc.
-
'aaa' = Triple-A - the highest level of Minor League play made up of the Pacific Coast League and the International League. Players often move up/down between AAA and Major League play.
​​​​

One step that can be helpful when dealing with multiple variables is to create a correlation matrix and assess the data for the presence of multicollinearity. Multicollinearity is a statistical concept where two or more predictor variables are correlated. This is a problem because independent variables are supposed to be truly independent. If multicollinearity exists, it can cause problems when applying your modeal and attempting to interpret results because as one variable changes it causes changes in one or more other variables.
Looking for Relationships
​
While no obvious relationships exist between variables in the data, we can plot variables to see if any
correlation exists. Using the data provided, I looked initially to see if there was any relationship between height_cm and weight_kg, and between height_cm and leg_length_cm. The plots below show that no correlation exists bewteen these two sets of variables.


Applying Algorithms
Since the focus of the dataset -- and project as a whole -- is on the occurence of Hamstring Strain Injuries, the last column injured_2021 is of particular interest. The data shows that 13 of the 65 players sustained a HSI in 2021. The binary nature of this data, "yes - injury" or "no injury," will allow this column to serve as our target variable for use with both Decision Tree and Logistic Regression models.
Decision Tree versus Logistic Regression
​
My last goal for this project was to see if a model could be built that would be able to identify players who would be injured based on having similar testing scores as those in the current dataset. In this instance an attempt is being made to predict the liklihood of an event occurring based on historical data. Since the target variable we are looking for is binary, "injury" = (1) verus "no injury" = (0), I chose to evaluate the results of a Decision Tree and a Logistic Regression model.
​
I utilized the sklearn library in Python for both the Decision Tree and Logistic Regression models. Sklearn is a machine learning library consisting of many statistical, mathematical, and general-purpose algorithms.
The Decision Tree is a supervised learning algorithm that can be used for both classification and regression problems. The idea behind the decision tree is that it uses a hierarchal structure analogous to a tree, which has a root node, branches, internal nodes, and leaves. The full initial data is the root node, which has no incoming branches and two outgoing branches to its initial internal nodes. The internal nodes continue to branch until reaching the smallest subunit, the leaf, which cannot be broken down further and represents all possible outcomes.
​
Decision Trees are beneficial in that they require little to no data preparation, can be applied to small datasets, are more interpretable than other models, and have greater flexibility due to their ability to be used in problems of classification or regression. One consideration I made for this dataset was the class imbalance in our dataset. To address class imbalance, I chose to use the scikit-learn class weight adjustment. Setting the class_weight='balanced' automatically adjusts weights based on the class distribution, with the goal being to penalize the misclassification of the minority class. The negative attributes of Decision Trees are their tendency for overfitting, and their sensitivity to noise, which can degrade their results. The Decision Tree algorithm resulted in an Accuracy Score of 0.85 when optimized using "gini" as our criterion. With "entropy" as the criterion, the Accuracy Score was 0.65.
Logistic Regression
​
I chose to use Logistic Regression because this project consists of a binary classification problem which aims to predict a binary outcome where 0 = "no injury" and 1 = "injury." This can be confusion a first glance, because even though 'regression' is in the name, this is a classification problem.
Logistic Regression is a statistical method which uses historical data to make predicitions on future outcomes. The basis of this algorithm is to measure the relationship between the dependent variable and one or more independent variables. In this context, the dependent variable is the variable we are looking to predict (in this dataset the injured_2021 column) and the independent variables are the features (in this dataset the columns related to test scores and demographics).
The results of the Logistic Regression model show an accuracy score of 0.8, which is slightly less than the 0.85 accuracy score seen from the Decision Tree model. It is important to note that one algorithm is not inherently better than the other, and often times, which algorithm performs better is attributed to the data.
Shortfalls and Considerations
​
There are several considerations that need to be made which could improve the results of either or both algorithms. One potential shortfall is that the dataset is small and only consists of 65 players, one of which was removed due to missing all of their testing data. Having a more robust dataset could increase the confidence in the model outcomes. One option, which I do not yet have experience with but am researching, is to see if it is possible to create a large synthetic data based on this smaller dataset.
​
Another consideration is that there may be additional information which we do not currently have available that could show direct correlation to injury. For example, it may be beneficial to know if a player is Right or Left handed, if the HSI they sustained was on the Right or Left leg, and if they have had previous HSI or other past injuries during their athletic carreer. It is also possible that the testing being performed is not sensitive or specific enough to be used for the purposes of creating a prediction model. For example, there is research published which questions the validity of the Long Lever, Single-Leg Isometric Bridge test (ib). (See "Single Leg Bridge Test is Not a Valid Clinical Tool to Assess Maximum Hamstring Strength") While this is only one study, it does cause reason to further assess the utility and validity of the testing being performed. Conversely, are there other tests and measures in the literature which, if implemented, would increase the predictive ability of our models?
Thank you for taking the time to review my project! Feel free to reach out with any questions or feedback.
Don't forget to Connect with me on LinkedIn as well!