The authors have declared that no competing interests exist.

Sports such as diving, gymnastics, and ice skating rely on expert judges to score performance accurately. Human error and bias can affect the scores, sometimes leading to controversy, especially at high levels. Instant replay or recorded video can be used to assess judges’ scores, or sometimes update judges’ scores, during a competition. For diving in particular, judges are trained to look for certain characteristics of a dive, such as angle of entry, height of splash, and distance of the dive from the end of the board, to score each dive on a scale of 0 to 10, where a 0 is a failed dive and a 10 is a perfect dive. In an effort to obtain objective comparisons for judges’ scores, a diving meet was filmed and the video footage used to measure certain characteristics of each dive for each participant. The variables measured from the video were height of the dive at its apex, angle of entry into the water, and distance of the dive from the end of the board. The measured items were then used as explanatory variables in a regression model where the judge’s scores were the response. The measurements from the video are gathered to provide a gold standard that is specific to the athletic performances at the meet being judged, and supplement judges’ scores with synergistic quantitative and visual information. In this article we show, via a series of regression analyses, that certain aspects of an athlete’s performance measured from video after a meet provide similar information to the judges’ scores. The model was shown to fit the data well enough to warrant use of characteristics from video footage to supplement judges’ scores in future meets. In addition, we calibrated the results from the model against those of meets where the same divers competed to show that the measurement data ranks divers in approximately the same order as they were ranked in other meets, showing meet to meet consistency in measured data and judges’ scores. Eventually, our findings could lead to use of video footage to supplement judges’ scores in real time.

There are many competitive sports where expert judges score performance of participants, such as gymnastics [

The main problem with determining the best judges, or conversely determining bias in judging, is a lack of a gold standard, other than the expertise of the judges themselves. By a gold standard, we mean a measure of the overall ability of an athlete to inform the fairness of scores for the current competition. Heiniger and Mercier [

Heiniger and Mercier [

Sports fans are familiar with the use of instant replay as a device for reviewing plays and calls in order to ensure the fairness of the game. Instant replay was controversial at first, but has become mainstream [

Although video is prevalent, and some would say ubiquitous in sport, the focus of previous work has been on the analysis of video after a match or a meet [

We seek to provide a gold standard for diving meets that is specific to the meet being judged and supplemental to the judges’ scores, as mentioned in [

In January 2019, footage was collected from a regional high school diving meet using a Canon T-3i filming at 60 frames per second. Twenty-six divers, 14 female and 12 male, competed in eleven rounds of diving on a one-meter springboard. As is typical in club and high school diving meets, divers received scores ranging from 0 to 10 from five independent judges [

Using recorded video, selected characteristics of each dive were measured. There are many important features of a dive that can contribute to the score such as height in the air, distance from the diving board, speed of rotation, compactness while in the air, angle of entry, and amount of splash. Due to time constraints and software limitations, only a few measurements from each dive could be collected. This subset included:

Maximum height in the air: maximum vertical distance from the diver’s center of gravity to the surface of the water.

Distance from the diving board: horizontal distance from the edge of the diving board to where the diver’s body enters the water.

First angle of entry: the number of degrees past vertical of the diver’s body in the first three frames once the diver has broken the water’s surface

Second angle of entry: the number of degrees past vertical of the diver’s body once the diver’s waist has entered the water.

These measurements were collected manually using the film editing software Final Cut Pro X [

Still photo illustrating how distance of the diver from the board was measured from the recorded footage.

Still photo illustrating how height of the dive was measured from the recorded footage.

Angle of entry was measured in a similar fashion. It was necessary to include two angle measurements because many divers would enter the water bent at the waist. Having the legs at an angle to the torso causes more splash, which negatively impacts judges’ scores. Using one angle measurement was not enough to accurately describe the diver’s body position. Angle measurements are given as the positive number of degrees past vertical; therefore, a small angle measurement means the diver entered the water nearly vertically. An illustration of the angle measurement is given in

Still photo illustrating how angle of diver entry to the water was measured from the recorded footage.

In addition to the variables measured from the footage, there were four other explanatory variables recorded. These were gender of diver, degree of difficulty for the dive, position of the dive (tuck, pike, straight, free), and the round number (one through eleven). Scores for each dive were obtained by taking still photos of the scoreboard after each dive. This data was then manually entered into an Excel spreadsheet along with the characteristics measured from the footage.

All athletes participating in the meet are members of the Amateur Athletic Union (AAU), and written permission for filming was granted from the AAU [

Missing data has been known to have a deleterious effect on measurement outcomes. Results based on complete case analysis, the subset of data that contains no missing values, are inefficient and biased [

We were unable to measure elements of seven out of 286 total dives because footage of these dives was not obtained. This was due to the camera overheating, a camera battery needing to be replaced, a SD card being full, or adjustment of the camera angle at the wrong time. None of these causes for missingness are related to the data values; therefore, the missing observations can be considered missing completely at random (MCAR). The missing data for these dives was imputed using the package

There is undeniably measurement error in the data. Placing a pixel in the correct spot was sometimes difficult. Final Cut Pro X has a “snapping” function that will occasionally pull the pixel to a certain (

The goal of this project is to remove the subjectivity and bias that arises from human judges’ preferences and biases. However, using the footage, a human must still decide on the video frame in which to take measurements such as angle of entry and maximum height. In other words, making judgment calls when collecting measurements from the footage was unavoidable. Angle of entry, for example, is not fixed. It changes frame by frame for some divers. Choosing when to measure angle of entry was based somewhat on convenience. The choice was limited to no more than three frames, but the choice was subjective. The same was true about measuring height in the air. It is not always clear when a diver has reached the apex of his or her dive. Many of these problems could be remedied by using computer automation to measure the height and angle over multiple frames and taking an average over these measurements.

Information from 285 out of the 286 dives was used. One dive was removed from the data set because it was a failed dive and had a score of 0. Including failed dives greatly influences the coefficients and causes problems with the normality of the error terms. Furthermore, failed dives are decided by a consensus of all five judges [

Note that the set of judges’ scores for separate dives, _{ij}, where

A preliminary model that included all eight explanatory variables was fit to the data. It was determined that round number as a categorical variable with 10 levels was not a good predictor as p-values for each level ranged from 0.226 to 0.973. The first angle measurement had a p-value of 0.855 and was also deemed a poor predictor of the average score. A model that included the remaining six explanatory variables was then fit to the data. Looking at dive positions, the four categories can be collapsed into two categories: Pike and Other. This makes sense because the pike position is generally considered the most difficult, and often has a higher degree of difficulty associated with it than a tuck position for the same dive.

Box plots showing the distribution of mean scores for all four possible dive positions: free, pike, straight, and tuck.

Box plots showing the distribution of mean scores once the positions free, straight, and tuck have been collapsed into the category of other.

Once the diving positions had been collapsed, careful attention was paid to the order that the variables were added to the model. Gender was added first since really there are two different competitions, one for females and one for males. The difference between male and females mean scores are shown in the box plot in

Box plot comparing male and female average scores.

The other variables were included in an order that prioritized the ones that took the most time to measure. ^{2} value. The coefficients and p-values associated with each variable are summarized in

Variable Name | Description | Order |
---|---|---|

Gender | Male or Female Division | 1 |

Second Angle | Number of degrees the diver’s body is from vertical after their waist has entered the water. | 2 |

Distance | The distance the diver is from the diving board when they break the surface of the water measured in “board marks”. | 3 |

Height | The diver’s height in the air at the apex measured in “board marks” | 4 |

Position | Pike or other | 5 |

Degree of Difficulty | Degree of difficulty determined by dive | 6 |

First Angle | Number of degrees the diver’s body is from vertical in the first three frames after entering the water. | 7 |

Round Number | Categorical indicator variable for each round of competition | 8 |

Variable | Significance Level | Coefficient |
---|---|---|

Intercept | <0.001 | 2.54 |

Male | 0.007 | 1.79 |

Second Angle | <0.001 | -0.02 |

Distance | <0.001 | -0.08 |

Height | <0.001 | 0.22 |

Pike | <0.001 | 0.46 |

Degree of Difficulty | 0.051 | 0.32 |

Male*Height | <0.001 | -0.18 |

The coefficients make sense in terms of how the variables should be affecting the score. For example, as distance from the board increases, the judges’ scores decrease. This is expected, because closer distance to the board implies that the diver has more control over the dive, which indicates a better dive. Also, as the height of the dive increases the score increases. Males start out with a higher base score, but as the height of their dives increases by one mark length, their average score increases by 0.04 points whereas a female diver’s score would increase by 0.22 points. The only coefficient that defies intuition is the coefficient for the degree of difficulty. If degree of difficulty is the only regressor in the model, the coefficient is negative because more difficult dives are more difficult to execute well. This can be seen in

A scatter plot of Average Score vs Degree of Difficulty. Average score is on the vertical axis and degree of difficulty for each dive is on the horizontal axis. The least squares regression line has been added.

Overall, the model fit the data well. The adjusted ^{2} value is approximately 0.46. There is no evidence of lack of fit. If we fit a model that includes only the characteristics measured from the video and gender, then the adjusted ^{2} value decreases to 0.42. This is not a statistically significant nor a practically important decrease, indicating that the video-based variables predict the judges’ scores well.

For the regional diving competition, the divers in the top three spots are eligible to move on to the state diving championship. Therefore, it is important that the top three places be correct in the sense that these divers are the ones with the greatest ability, and thus the best representatives for a given region at the state meet. Measuring the judges’ ability to obtain the correct top three rankings requires an outside measure of the divers’ ability for comparison. Fortunately, many high school divers also compete in club diving, and their results over the history of their competition are available on DiveMeets.com [

The DiveMeets scores are from the 2018-19 season in order to be comparable for the 2019 high school regional competition. Our main assumption is that the ability level of a diver will not change substantially over a season. Ten of the 13 female divers had participated in club diving in the past three years prior to the regional competition. Only 2 of the 12 boys had records in DiveMeets.com. This difference in the percentage of club divers among girls and boys might indicate a higher level of competition for the girls.

Place in Regional | Score at Regional | Best Score on DiveMeets |
---|---|---|

First | 400.8 | 362.95 |

Second | 397.55 | 328.2 |

Third | 378.05 | 253.4 |

Of the two boys who participated in club diving, both finished in the top three. For the girls, of the 10 who participated in club diving, 6 finished in the top 8. Two divers who finished in the top 8 did not have a record of club diving in DiveMeets. Even so, the club diving scores are good independent measures of diver ability for judging the accuracy of the diving judges at the regional meet.

We compared the ranking of competitors from the regional diving meet to the rankings produced by the regression model and rankings produced from the raw measurements taken from the video. To get the rankings from the regression model, we mimicked the fashion that scores are calculated from subjective judges’ scores [

Obtaining ranks based on the raw measurements required adjustment to the measurements. First, we standardized each measurement by subtracting the mean and dividing by the standard deviation. Standardization puts all measurements on the same scale so that they are comparable [

The rankings determined by the regression model and the raw measurements are compared to the divers actual ranking in Tables

Diver | Meet Total | Raw Total | Regression Total | Ranks | Raw Ranks | Regression Ranks |
---|---|---|---|---|---|---|

3 | 400.80 | 226.52 | 372.34 | 1 | 2 | 2 |

8 | 399.25 | 235.65 | 376.22 | 2 | 1 | 1 |

4 | 378.05 | 195.43 | 328.34 | 3 | 4 | 3 |

14 | 355.20 | 196.14 | 320.04 | 4 | 3 | 5 |

11 | 329.80 | 169.10 | 324.52 | 5 | 8 | 4 |

13 | 289.85 | 187.99 | 290.75 | 6 | 5 | 7 |

7 | 288.03 | 175.94 | 302.59 | 7 | 7 | 6 |

2 | 282.30 | 167.55 | 274.56 | 8 | 10 | 10 |

5 | 260.75 | 157.65 | 271.85 | 9 | 11 | 11 |

10 | 257.05 | 178.76 | 282.33 | 10 | 6 | 8 |

9 | 252.10 | 143.28 | 265.90 | 11 | 12 | 12 |

6 | 250.30 | 141.76 | 274.81 | 12 | 13 | 9 |

12 | 242.05 | 168.84 | 263.99 | 13 | 9 | 13 |

1 | 215.50 | 128.38 | 253.54 | 14 | 14 | 14 |

Diver | Meet Total | Raw Total | Regression Total | Ranks | Raw Ranks | Regression Ranks |
---|---|---|---|---|---|---|

18 | 317.75 | 201.55 | 285.77 | 1 | 1 | 1 |

16 | 282.95 | 160.48 | 261.66 | 2 | 10 | 3 |

24 | 281.60 | 188.24 | 273.34 | 3 | 3 | 2 |

22 | 271.80 | 173.44 | 258.02 | 4 | 7 | 7 |

21 | 261.05 | 185.76 | 260.39 | 5 | 5 | 5 |

23 | 258.60 | 181.25 | 259.52 | 6 | 6 | 6 |

19 | 250.55 | 165.79 | 261.64 | 7 | 8 | 4 |

25 | 239.10 | 201.32 | 249.77 | 8 | 2 | 8 |

20 | 233.15 | 188.10 | 246.76 | 9 | 4 | 9 |

26 | 232.95 | 161.22 | 241.50 | 10 | 9 | 11 |

17 | 224.50 | 141.07 | 245.78 | 11 | 12 | 10 |

15 | 202.50 | 141.24 | 210.43 | 12 | 11 | 12 |

Several popular sports employ panels of human judges to assign scores to competitors in the sport. It is impossible to eliminate all subjectivity when human judges are employed, as is seen by the existence of several scandals where judging bias was alleged in high-level competitions [

Diving is similar to ice skating in that there are certain technical aspects that judges expect in an exemplary performance. They are a vertical entry, a lack of splash on entry, and a minimal distance from the diving board on entry, among others [

In order to build a better model that can be applied to any diving meet, we need to make use of current machine learning techniques. For example, angle of entry changes over the course of a dive. To get an accurate static measure of an instantaneous variable, we would want a computer to measure the angle of entry on every frame by recognizing the difference in color between pixels. We could then average over all these angle measurements to get a single mean angle of entry measurement. We would want to use a similar process with height in the air and distance from the diving board. This way we could get an objective measurement of maximal height and an objective measurement of distance. We believe that with these changes, it would be possible to build a real-time scoring device.

We filmed three different meets: a high school regional diving meet, a high school district meet, and an intercollegiate meet, and used the footage to measure characteristics for each dive after the meet. The regional meet is a high stakes meet for high school divers, as the top three finishers earn the right to go the state meet, while the other two meets have relatively low stakes. A linear regression model was used to model the judges scores as a function of the dive difficulty, the distance from the diving board on entry, the height of the diver at the apex of the dive, and the angle of the diver’s body at entry, in order to determine whether these features of the dive had a relationship to the judges’ scores. For the low stakes meets (data not shown), the dive features measured on video are not good predictors of the judges’ scores. This is likely due to poor accuracy of judges’ scores due to the fact that most low stakes meets employ easily accessible staff that are untrained as diving judges. Indeed, personal observation revealed judges for the low stakes meets chatting with one another or coaching a participant while other divers were competing. This is more evidence as to why supplementing the judges’ scores with data dependent video-based scores is necessary. However, in a high stakes meet, the model fit the data well enough to show that if the process of filming and measuring characteristics of a dive could be automated, then a computer could produce an objective score for each dive to supplement judges’ scores. This marriage of computer-based and subjective scores will give a good mix of the precision of the diver, which a computer can measure, and the artistic quality of a dive, which only a human can assess.