Data augmented Autoformer for runner injury prediction

Introduction

Running is one of the most popular sports in the world, leading to a growing attention on exercise management, muscle overuse monitor, and injury prediction [1]. Injury prediction has become a crucial component of modern player management for individual athlete longevity and overall team success. Since the competitive performance of track athletes relies on the body function, injury risk may incur extra cost on medical treatments, rehabilitation training, competitions missing and interrupted career progression. Accurate forecasting of injury risk allows healthcare teams to develop customized individual training plan about exercising load and match schedule [2]. However, runners are susceptible to various injury types, characterized by substantial individual variability, and such injuries are often caused by unpredictable incidents, hindering the generalization and reliability of injury risk prediction prototype [3].

As the volume and complexity of available runner records continue to expand, artificial intelligence (AI) techniques transcend the limits of traditional techniques. Data-driven model streamlines the recognition process of high-dimensional pattern to expose the pathological factor of injury risk from the massive and complex datasets [4]. Nevertheless, the complex and multi-modal contributors block the prediction accuracy of existing solutions [5]. Main challenges lies in the counter-intuitive influence factors. Compared with healthy records, injuries data are rare and influenced by complex, random and sometimes contradictory factors including workloads, the environment, player’s body status and emotional conditions [6], [7]. The injury records consist of imbalanced datasets of players and ambiguous correlation, which needs comprehensive analysis for multi-modal influence factors [8]. Then, injury occurs accidentally and hard to forecast even for a seasoned sports manager. Empiricism training plan always leads to resources wastes or unexpected accident. Moreover, most existing methods lack generalization across sports, teams, or even individual players since the variability in playing style, personal status [9]. Therefore, sport world still thirsts for a comprehensive, general and reliable injury prediction solution for practiced and personalized runner management.

With the overwhelming demand of comprehensive and reliable injury prediction solutions, some data driven approaches like Deep learning (DL) have garnered much attention in athletic health managements. DL benefits in uncovering the general laws of complex physical condition records and capturing the relationship of contributors and injury risks. Its outstanding generalization ability provides a promising solution with the trade-off between personalized injury diagnosis and general risk prediction [4], [10], [11]. However, when it comes to the DL approach, there are still several questions in runner injury prediction modeling. One is existing models are suffered from the biased runner dataset. Real-world runner training settings leads to numerous healthy records and rare injury events, which arises questions about model overfitting on normal records [12]. Despite some techniques as resampling and cost sensitive loss function were utilized for balanced training data, new problems may ruin forecasting accuracy like unfair training and improper feature recognition. The other problem lies in the construction of valuable features, which is crucial to effective modeling and explainable prediction. Since the runner injury records is known as a multi-variable time series, DL models are designed to determine the causality and correlation of injury risk and observed features. However, the major contributor of injury risk highly depends on domain knowledge as expert experience and body conditions, revealing complex relationships of features and historical injuries records [13]. Moreover, athletes faces individualized features weights, even associated with the same injury. These challenges hinder DL model to recognize the general patterns of numerous variables, arousing the concern that forecasting performance may degenerate with unseen records.

Motivated to tackle the imbalanced injury records and complex influence factors, this paper proposes a variational autoencoder based Autoformer (VAE-Former) model for precise runner injury risk prediction. To deal with the imbalanced injury data, a variational autoencoders (VAEs) is designed to feature construction and generate synthetic injury records for balanced deep learning. Based on augmented data, Autoformer [14] introduces the attention mechanism and subseries wise correlation to infer the intrinsic features of injury risk without domain knowledge. Moreover, the memory mechanisms ensure Autoformer not only to consider historical injury record but to analysis the impact of constructed features for more precise injury risk prediction. Our main contributions are as follows,

•
We propose a VAE-Former to predict injury risk with multiple features. Proposed model improves runner injury prediction by considering both statistical indicator and constructed features. Our model achieves 79.76% accuracy with a public runner injury dataset and an accuracy increases of up to 19.2% than cases without data augmentation. Comparative study reveals that VAE-Former is more efficient than other similar model in injury forecasting.
•
Considering the complex and imbalanced runner injury dataset, we design a VAE module as a data generator and feature extractor for the cognitive patterns and generate synthetic injury data. In stead of raw data, the intermediate features of trained VAE are used in injury prediction to fully understand the essential characteristics of runner’s features. These features represent the sensitivity of injury risk on influencing factors. Simulation results demonstrates VAE module is capable to generate reliable injury data and it provides balanced and comprehensive inputs for training stable and general prediction model.
•
To fully grasp the features information and statistical indicators of injury time series, we leverage the Autoformer for precise injury prediction. The attention-based series corelation mechanism benefits in analyzing temporal characteristics and features interaction of multiple input. The improvement of cross-correlation module accelerate the training process with multiple variables. The incorporation of series decomposition model provides a finer-grained solution for precise and explainable prediction.

The rest of this paper is organized as follows. Section 2 reviews literature about runner injury and relevant techniques, highlighting main challenges and research direction in this field. Section 3 formulates the model and demonstrates details of data augmentation model and learning model. Section 4 illustrates a public runner injury dataset used in this paper, analyses influencing features of injury risks and reveals the outstanding prediction performance of our model in Comparative study. Section 5 draws a comprehensive conclusion about out achievements.

2. Literature review

2.1. Runner injury analysis

As advancement of techniques on wearable sensors, research on runner injury analysis faces data-rich insights. Smart wearable devices provide vast quantities of data by real-time physical status assessment and risk monitor for athletes [15]. Although various runner records laid a solid foundation for forecasting injury risk, the surplus sources aggravate the difficulty of uncovering the logical relationship of injury with numerous contributors. In the literature, the contributors of injury involve athletes body conditions (e.g. age, gender and body mass index) [16], training load [3], [7], sports type [17] and mental health [18]. Some statistic models are introduce to verify the significance of given factors, there remains no accepted theory regarding which factors contribute to an increased risk of injury. Some study confirm the importance of training plan on runner’s injury risk by linear analysis [7], but some results object that individual sensitivity of training intensity performs significant difference with different body conditions [19]. Jonge et al. announces positive mental attitude is beneficial to prevent injuries, but fails to illustrate how this works [18]. Kluitenberg et al. sums the injury risk of different runners as marathon, cross-country, short-distance track, but ignore to evaluate them in the same criteria for a general announcement of the influence of running types [17]. Hence, further study is urgent to explore the mysteries of runner injury analysis.

2.2. Data augmentation

Since the runner injury datasets always contains imbalanced labels with numerous normal records and rare injury records, data-driven prediction solutions need proper data analysis mechanism to ensure models can learn fair information from the imbalanced dataset. Mainstream solutions employ down sampling or resampling methods to enlarge the proportion of injury records in the training set [9], [20], [21]. However, sampling methods inevitably incur significant information loss by discarding pretty of major class records. The limited samples lead to substantial performance reduction and instability, especially in large and complex model training. Some study incorporate cost wise loss function to alleviate the performance degeneration of imbalanced datasets by assigning larger weights to rare records, but the weighted learning process does not increase the diversity of minority samples, which still leads to unfair performance on the minority. Moreover, the weights design enlarge the modeling overhead with complex healthy record inputs.

Data augmentation mechanism has been utilized to enhance model generalization with limited or improper datasets in healthcare and medical computer version. The objective of data augmentation solutions is enrich the characteristics of minor class to alleviate data skew, in which the generative model gathers more attentions. Yang et al. introduce the LSTM into generative adversarial networks for more reliable electrocardiogram signal generation, the augmented time series contains realistic spatio-temporal information [22]. Nishizaki propose a VAE model to extract latent features for corpus data augmentation and feature vector extraction [23]. Despite the prevalence of class imbalance in runner injury prediction datasets, current methodologies have rarely adopted mature or standardized data augmentation frameworks to alleviate this challenge. Padmanandam et al. implement augmentation data in runner injury prediction system but omit the data augmentation process and the influence of augmented data. Thus, we propose a VAE model to grasp injury feature and enrich the original dataset.