Wearable-based Human Activity Recognition with spatial-temporal Graph Convolutional Transformer Network (https://doi.org/10.63386/618561)

Lu Ma ^1,2^,*, Xiaodong Yang ^{3, 4}

¹ School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China.

² Suzhou Vocational and Technical College, Suzhou 234000, China.

³ School of Computer Science and Technology, China University of Mining and Technology, Xuzhou 221116, China

⁴ School of Electronic Science and Engineering, Nanjing University, Nanjing 210023, China.

malu_2025@163.com

First author and Corresponding author: Lu Ma, malu_2025@163.com ORCID: 0009-0005-4568-8405

Second author: Xiaodong Yang, yxdnju@163.com ORCID: 0000-0001-9618-0891

Acknowledgement:

Key research project of Natural Science in Anhui Province, Project No.: 2023AH052958

Abstract: Human Activity Recognition(HAR) is an important topic in the field of wearable computing as it involves the study of Spatio-Temporal (ST) interactions. Nowadays, algorithms based on Graph Convolutional Networks (GCN) are widely used. From a graphical point of view, this approach is effective in capturing a large amount of spatial information, however, it ignores the connection between different sensor information at different times with the global information. To overcome this limitation, we propose a novel network, Graph Convolutional Transformer Network (GCTNet), combining Transformer and GCN blocks, and using a fully-connected (FC) graph to optimize the graph structure constructed by the GCN model. To verify the superiority of the model, we are using the UCI-HAR dataset to validate our model. Experimental results demonstrate that our network achieves a 4% higher accuracy com- pared to the transformer model and a 0.5% higher accuracy compared to the latest FC-STGNN model on the UCI-HAR dataset. Compared to soat methods such as DeepConvLSTM, our model improves by 1.5%.

Keywords: Human Activity Recognition · Spatio-Temporal interaction· GCTNet.

1 Introduction

With the rapid development of wearable technologies such as mobile phones, watches and other smart gadgets, the healthcare sector is placing increasing emphasis on monitoring user actions and behaviours. In this context, HAR systems make use of sensors in smart wearable devices, such as accelerometers, gyroscopes, and electroencephalography (EEG) sensors, to identify and predict user’s behavioural demeanour, aiming to achieve the goals of elderly care, smart fitness solutions, medical diagnostics, sports injury detection, and health management [7, 10]. With the development of deep learning, researchers utilized convolutional layers in Artificial Neural Networks (ANNs) [12] to enhance model per- formance through gradient descent optimization. Transformer models are known for their unique model structure and have led to impressive achievements in the fields of computer vision and natural language processing [16]. And the popularity of GNN approaches is increasing due to the significant correlation between each action in the data and the improved capacity to extract the relationship between spatial features and data when time series data is transformed into activity graphs. The technique commences by generating ST graphs, which are individual graphs for each timestamp, illustrating the interactions among sensors over time and location. As shown in Fig. 1, graph neural networks (GNNs) are initially used to capture the spatial interdependencies between the sensors at each timestamp. Subsequently, a time encoder is used to record the temporal interdependencies of the corresponding sensors at multiple timestamps. The performance of these systems is improved compared to the traditional approach using only time encoders.

Fig. 1: Construct ST graphs from Multivariate Time Series(MTS) data, creating separate graphs for each timestamp to capture ST dependencies. In Step 1, each graph’s spatial dependence between different sensors is simultaneously captured

by the GNNs, e.g., .In Step 2, the time encoder captures the

sensor’s time dependence over multiple timestamps, e.g., . Nevertheless, this method is not effective in capturing correlations across distinct sensors at different timestamps, and as a result, it is incapable to represent entire ST dependencies.

However, due to the limitations of the research at this stage, the spatiotemporal connection between global sensors has been neglected, which leads to problems in the construction of the model. Therefore, the paper starts from this problem and considers to solve the problem of extracting the connection between different sensors at different timestamps to better learn and capture the data.

To address the limitations of current methods and accurately capture the spatio-temporal dependency between timestamps and sensors, we propose GCTNet, an innovative framework that aims to improve the representation learning of MTS data. Our model uses transformer as well as GCN module to process data in parallel. Due to the different feature captured by the two modules, it is not efficient to directly merge the two processed datasets. Therefore, we utilise the Model Fully Connected structure (MFC) to connect the two parallel models and remove semantic divergence in an interactive and additive manner and turn the data into the same latitude. The fusion process has the potential to significantly improve the overall understanding of local features and specific elements of the global representation. In this paper, we will investigate and experiment with GCTNet in the field of HAR.

The model has the following benefits:

1)The relationship between timestamps and sensors can be more accurately represented by focusing on local elements, which allows for the inclusion of more valuable spatial features.

2)Global features provide a more accurate way to obtain correlations between timestamps.

3)Correlations between different sensors at different timestamps can be better handled.

2 Related Work

2.1 GNN for MTS data

Recently, more and more researchers have recognized the importance of incorporating spatial dependencies in the process of learning how to represent human skeleton pose and MTS data [8, 9, 11, 23]. In order to accomplish this objective, a commonly used method is to employ Graph Neural Networks (GNNs), which typ- ically entails integrating GNNs with other temporal encoders to capture spatial and temporal dependencies separately [14]. For instance, HierCorrPool (Wang et al., 2023a) generated sequential graphs [15] and employed CNNs to capture temporal dependencies. Then, GNNs were applied to gather spatial dependencies across sensors within each graph. The researchers have achieved notable progress by incorporating spatial dependency in MTS data using GNN [18]. The recent proposal of FC-STGCN has successfully addressed the limitation of previous traditional graph neural network (GNN) methods in graph construction and graph convolution [17]. However, it cannot captur correlations at different timestamps. However, it does not capture the correlation of different timestamps well.

Fig. 2: GCTNet’s network design. First, the sensors capture information about behavior from humans. After processing, the data is filtered and normalized. After that, the data undergoes preliminary procedures such as encoding, convo- lution, and positional encoding. Subsequently, the data will be sent to both the Transformer Block and the GCN Block. The classifier will be ultimately formed by fusing the output data from both blocks using the MFC structure.

2.2 Transformer for MTS data

The Transformers model was initially developed to improve natural language processing [3, 22] and soon achieved great success in computer vision and MTS data prediction [13]. It typically includes a self-attention mechanism and a feedforward network layer. To enhance the training process, residual concatenation and batch normalisation are typically used. Transformers Early studies in MTS data prediction focused primarily on capturing and modelling temporal dependencies at a single time step.However, these methods suffer from a quadratic time complexity problem, which limits the input length of the data, which is precisely a key factor in MTS prediction [8]. Informer [4] proposed and used ProbSparse self-attention to reduce the time complexity to O(nlogn). Fedformer [21] enhanced the performance of the transformer model using Fourier and wavelet transforms to achieve linear computational complexity and memory cost. Although transformers are starting to be used in MTS data with corresponding results, the existing methods still have shortcomings in capturing the spatial relationships of the data at different timestamps, so it is not possible to make too much progress for the HAR task.

3 Methodology

3.1 Feature Capturing

As shown in Fig. 2, starting from the HAR sample , we first segment the signal from each sensor into many patches, each representing a certain time interval. Using patch size , we create from X , where is the index of the patch which represents a timestamp, and each . denotes the number of segmented patches, calculated as , where represents the truncation operation. Each contains segmented signals from N sensors.

Subsequently, the data is convolved in one dimension to capture temporal features. Furthermore, we employ an encoder to process the segmented signals within each window. Notably, the encoder operates at the sensor-level to learn sensor-level features. Moreover, to maintain the directionality across patches, i.e., the relative positional information of patches, we adopt positional encoding as inspired by (Vaswani et al. 2017). Specifically, for the i-th sensor , positional encoding, as shown in Equation.(1), is introduced into sensor features, e.g., , representing the sensor features enhanced by positional encoding. m represents the m-th feature of sensor features.

3.2 Transformer Branch

The model is a modified Transformer model based on a self-attention mechanism that utilises sensor modal attention, self-attention blocks to construct feature representations for classification. Effective feature representations of sensor data are created by utilising attention in different ways.

To process the input data, the first sensor attention layer is first added, as seen in Fig. 3 Following 1D convolution, precise spatiotemporal characteristics are extracted. Subsequently, two distinct modules are used to extract, represent, and learn the ST data. The processed data will be entered into the multi-head attention at the same time as the original data before input based on the residual network structure. Following the data enters the feed forward layer, the feature extraction is repeated to obtain the optimal outcome based on the recurrent network topology.

In this Block, HAR Spectral-Spatial Representation is a module based on residual network consisting of multiple convolutional layers for extracting spatial information. HAR Tempora Representation is a module based on temporal window consisting of multiple one-dimensional convolutional, pooling layers for extracting temporal information between different sensors. The fusion of the two processed data allows the Transformer to overcome the original limitations and extract global information about time and space. Such a structure then can replace global attention with better results.

Fig. 3: Detailed internal structure of Trans Block in GCTNet. The data first goes to the sensor attention, and then a 1D convolution is performed to capture information for local features. After that, feature representation and learning will be performed on the time-space data separately. Finally, the processed data will pass through the multi-head attention and then through the feed-forward layer to finally get the output data.

3.3 GCN Branch

With Fig. 4, we can know that the GCN Block consists of two parts, Fully Connected (FC) Graph Construction and Fully Connected (FC) Graph Convolution.

Fig. 4: Detailed internal structure of GCN Block in GCTNET. The Block consists of multiple identical structures in that Block. In each structure, there are graph construction and graph convolution operations, and then the values obtained from each structure are fused to finally get the output data.

When constructing the FC graph, we assume that related sensors should exhibit similar characteristics so that their features are close in the feature space. This allows us to use similarity to represent the correlation between sensors; the greater the similarity, the higher the correlation. In this case, we use a simple but effective metric, the dot product, to quantify the similarity between two sensors, defined as

(2)

where t, r ∈ [1, L], and . The function is employed to enhance the expressive capacity, drawing inspiration from the attention computation in, where is the learnable weights. Finally, we can create a FC graph

，where ，and . is the data of each sensor node after position encoding, E is the adjacency matrix of the FC graph.It connects all the sensors in different patches.

Subsequently, in order to make the model more complete and perform better, we propose the use of a decay matrix to simulate the forgetting of the model. Each row in this matrix displays the interconnectivity of a sensor with other sensors across all patches. For instance, the linkage of the initial sensor from the(T-1)-th patch. Given that these sensors are simultaneously present, it is expected that they would exhibit greater correlations compared to sensors in other patches. The sensors are located in various patches and their correlation with should decrease over time, as determined by the decay rate . Absorbing from the above description, we formulate the decay matrix C:

(3)

where each element .

(4)

With Equation.(2)(4)and(4), we can obtain the final constructed decay graph.This elucidates the diminishing connection caused by the passage of time and has the potential to significantly enhance the precision of the FC map depiction.

After constructing the FC graph, the next task is to identify and record the ST dependencies in the HAR data for the purpose of representation learning. We utilized a move-pooling GNN that incorporates a mobile window to capture local ST dependencies and temporal pooling to extract high-level characteristics. We employ a mobile window of a defined magnitude M that traverses along the patch. Whenever the window is moved, it displaces sliding points. Subsequently, a GNN is utilized for each individual window. For example, given a centraln node of the -th window in the -th layer, it has a set of neighboring nodes across M patches in the same window. The central node has correlations with its neighbors as

4 Experiments and Analyses

4.1 Datasets

The HAR dataset was acquired from 30 volunteers affiliated with the University of Genoa. These individuals collected the dataset using smartphone devices andsubsequently submitted it to the UCI database [2]. The dataset contained accelerometer data for all three axes, in addition to gyroscope data. The data was collected at a frequency of 50 Hz, and feature engineering techniques were used to extract a feature vector of 561 dimensions from the window data. The dataset was partitioned in a 7:3 ratio, with 70% allocated to the training set and 30% allocated to the test set.

4.2 Evaluation

To evaluate the performance of HAR prediction, we used Accuracy Macroaveraged F1-Score scores (MF1) and time (s). In addition, to reduce the effect of random initialisation, we performed all experiments ten times and averaged the results for comparison.

4.3 Ablation Study

(a) (b)

Fig. 5: (a) Sensitivity analysis for numbers of moving windows. (b)Sensitivity analysis for Pos.Embed.

Numbers of moving windows The moving window is a crucial structural element that is accountable for graph convolution within the GCN module. To assess the impact of varying the number of moving windows, we obtain the results in Fig. 5(a). It can be observed that the number of moving windows improves the performance to some extent, thus affirming the efficacy of modelling

(a) (b)

Fig. 6: (a) Effect of decay Rate on accuracy (b) Effect of learning rate on accu- racy. We can get that the model achieves the best results at a loss rate of 0.7 and a learning rate of 1e-4.

ST dependencies using multiple layers. For example, a model with 2 moving windows performs better than a model with a number of 1. Meanwhile, the model introducing 3 moving windows performs better than the model using a fewer windows. However, when more windows are introduced, the model’s performance gains begin to diminish or even reverse. This may be due to the overfitting phenomenon that occurs when more moving windows are introduced. Therefore, too many moving windows are unnecessary and three is optimal. So we set the number of moving windows to 3 to make the model optimal.

Positional Embeddings The process of positional embeddings plays a crucial role in analyzing and extracting the ST links of the data. The presence or absence of this structure will have an impact on GCTNet’s capacity to learn from the data to some degree. Fig. 5(b) It is clear from the figure that when Pos.Embed is added to the model, the accuracy is close to 94%. However, when the Pos.Embed embedding is omitted, the accuracy of the model is less than 93%, a decrease of about 1.5%. The MF1 coefficient is also reduced due to this factor. So we decided to include positional embedding in the model to improve its performance.

Decay and Learning Rate Analysis The decay matrix is utilized to improve the FC graph by modeling the forgetting of sensor information over a period of time to ensure a correct representation of the correlation between various sensors across different times and spaces. Consequently, the selection of the attenuation rate is of utmost importance and necessitates thorough evaluation. As shown in Fig. 6(a), it is evident that the variants with higher values exhibit superior performance. The accuracy of these variants reaches its maximum at a value of 0.7. This is because when is too large, it indicates that no forgetting has taken place, resulting in too much data to learn effectively from the data, while when is too small, there is the equivalent of ignoring key data, resulting in poor model performance. Therefore, we chose a value of 0.7 to ensure optimal model performance.

As well, it vital to take into account the impact of the learning rate on our model. To investigate this, we performed experiments with learning rates within the range of [1e-3, 1e-4, 1e-5]. The findings, depicted in Fig. 6(b), show that the model achieves the highest test accuracy when the learning rate is set to 1e-4.

4.4 Results

Our model achieves more significant results on the HAR dataset. As shown in Table. 1, our model has high prediction probabilities for recognition of various actions. Among them, lying flat has the highest prediction rate of 99.9%, of which the total number is 21990, and basically all of them are predicted to be recognised. Walking, and going up and down stairs also have high accuracy rates, all of which are greater than 95%, which is remarkable. However, for sitting and standing, the accuracy is not so high, of which sitting is only 79.9% and standing is 88.2%, which is not very effective. And, it can be clearly seen from Fig. 7 that the two movements, sitting and standing, are very easy to be confused, leading to misjudgment. And it can be clearly seen from Fig. 8 that the model is easy to misjudge sitting as standing, so it leads to a little higher prediction value of standing in the figure, which also explains the reason why the total number of standing in table1 is large but the accuracy rate is low. The confusion may be due to the similarity between the two movements themselves, whose accelerometers do not collect their movement characteristics well. After analysing this, it can be concluded that this is a problem that belongs to the limitations of the equipment and not to our model.

Table 1: Diffrernt activities in UCI-HAR

Activity	Accuracy	Aggregate
WALKING	96.7%	19656
WALKING__UPSTAIRS	95.1%	18366
WALKING__DOWNSTAIRS	97.4%	16774
SITTING	79.9%	16079
STANDING	88.2%	19235
LAYING	99.9%	21990

Fig. 7: Confusion matrix of GCTNet. From the results, it is known that sitting still and standing can be easily confused, resulting in lower accuracy, while other movements did not show such problems.

Fig. 8: Kernel density estimation of GCTNet. With the kernel function curve we know that there will be some sits that are predicted to become stands, resulting in more predicted values for stands than true values.

Table 2: Comparisons with other models in UCI-HAR

Model	Accuracy (%)	MF1 (%)
FC-STGCN [1]	95.62 ± 0.33	94.37 ± 0.42
Transformer [3]	90.22 ± 0.33	90.91 ± 0.34
LSTM [19]	88.63 ± 0.42	84.22 ± 0.21
FCN [20]	92.23 ± 0.34	88.67 ± 0.32
DeepConvLSTM [5]	94.51 ± 0.27	92.60 ± 0.24
AutoFormer [6]	54.72 ± 0.53	52.14 ± 0.61
InFormer [4]	91.23 ± 0.48	90.23 ± 0.47
GCTNet(ours)	96.06 ± 0.42	93.70 ± 0.31

(a) (b)

Fig. 9: (a) Accuracy curve for each model training (b) Loss curve for each model training.

4.5 Comparative Study

The GCTNet is a combined model that integrates the GCN module and the Transformer module. The two modules combine their output data and enhancing the accuracy of the task utilizing the final MFC layer. To verify the accuracy of our comparison, we compare our model against the Transformer, Transformer-based approach InFormer, and the most advanced GCN method FC-STGCN. Additionally, we include other SOAT approaches encompassing conventional method like DeepConvLSTM and Transformer-based approach AutoFormer. As shown in Table. 2, our model achieves an accuracy of 96.06% and an MF1 score of 93.70%. The FC-STGCN model performs slightly worse with an accuracy of 95.62% and an MF1 score of 94.37%. The Transformer model is approximately 6% less accurate than GCTNet. The Informer model is slightly more accurate than Transformer baseline but still about 5% lower than our model. However, the AutoFormer model is not applicable to this task and performed the worst. The DeepConvLSTM model achieves an accuracy of 94.51%, which is about 1.5% lower than our model. It can be seen through Fig. 9 that the learning curves and loss curves of all models except the LSTM model are very close to the ideal curves with better results.

5 Conclusion

This paper introduces GCTNet, an innovative dual-backbone network that com- bines Graph Convolutional Networks (GCNs) and Transformers for Human Activity Recognition (HAR). By leveraging graph construction and convolution for local spatio-temporal features and a self-attention mechanism for global dependencies, GCTNet significantly outperforms state-of-the-art methods such as FC- STGCN on the UCI-HAR dataset, achieving higher accuracy and MF1 scores. The model’s efficiency and robustness are validated through extensive experiments and ablation studies. Future research can further optimize GCTNet and expand its application to larger and more diverse datasets, demonstrating its promise in enhancing HAR systems across healthcare, fitness, and beyond.

Acknowledgments. Acknowledgement: This work was supported by the Xuzhou Key Research and Development Program (Social Development) (Grant No. KC213004) and the National Training Program of Innovation and Entrepreneurship for Undergraduates (Grant No. 202310290201Y)

Disclosure of Interests. The authors have no competing interests to declare that are relevant to the content of this article.

References

[1] Wang Y, Xu Y, Yang J, et al. Fully-Connected Spatial-Temporal Graph for Multivariate Time Series Data[C]. AAAI Conference on Artificial Intelligence. 2023.

[2] Anguita D, Ghio A, Oneto L, et al. A Public Domain Dataset for Human Activity Recognition using Smartphones[C]. European Symposium on Artificial Neural Networks. 2013.

[3] Vaswani A, Shazeer N, Parmar N, et al. Attention is All You Need[C]. Neural Information Processing Systems (NeurIPS). 2017.

[4] Zhou H, et al. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting[C]. AAAI Conference on Artificial Intelligence. 2021, vol. 35, no. 12, Track: AAAI Technical Track on Machine Learning V.

[5] Ordóñez F J, Roggen D. Deep Convolutional and LSTM Recurrent Neural Networks for Multimodal Wearable Activity Recognition[J]. Sensors (Basel, Switzerland), 2016, 16(1): 115.

[6] Wu H, Xu J, Wang J, et al. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting[C]. Neural Information Processing Systems (NeurIPS). 2021.

[7] Sun Z, Liu J, Ke Q, et al. Human Action Recognition From Various Data Modalities: A Review[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 45: 3200-3225.

[8] Du S, Li T, Yang Y, et al. Multivariate time series forecasting via attention-based encoder-decoder framework[J]. Neurocomputing, 2019, 388: 269-279.

[9] Pletnev A, Rivera-Castro R, Burnaev E. Graph Neural Networks for Model Recommendation using Time Series Data[C]. 19th IEEE International Conference on Machine Learning and Applications (ICMLA). 2020, pp. 1534-1541.

[10] Jaberi M, Ravanmehr R. Human activity recognition via wearable devices using enhanced ternary weight convolutional neural network[J]. Pervasive Mob. Comput., 2022, 83: 101620.

[11] Jin M, et al. Multivariate Time Series Forecasting With Dynamic Graph Neural ODEs[J]. IEEE Transactions on Knowledge and Data Engineering, 2022, 35: 9168-9180.

[12] Yang G R, Wang X J. Artificial Neural Networks for Neuroscientists: A Primer[J]. Neuron, 2020, 107(6): 1048-1070.

[13] Zhang X, et al. TapNet: Multivariate Time Series Classification with Attentional Prototypical Network[C]. AAAI Conference on Artificial Intelligence. 2020, vol. 34, no. 04, pp. 6845-6852.

[14] Li J, Li X, He D. A Directed Acyclic Graph Network Combined With CNN and LSTM for Remaining Useful Life Prediction[J]. IEEE Access, 2019, 7: 75464-75475.

[15] Caramalau R, Bhattarai B, Kim T. Sequential Graph Convolutional Network for Active Learning[C]. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2021, pp. 9578-9587.

[16] Peng Z, Huang W, Gu S, et al. Conformer: Local Features Coupling Global Representations for Visual Recognition[C]. IEEE/CVF International Conference on Computer Vision (ICCV). 2021, pp. 357-366.

[17] Li T, Zhao Z, Sun C, et al. Hierarchical attention graph convolutional network to fuse multi-sensor signals for remaining useful life prediction[J]. Reliab. Eng. Syst. Saf., 2021, 215: 107878.

[18] Jin M, Koh H, Wen Q, et al. A Survey on Graph Neural Networks for Time Series: Forecasting, Classification, Imputation, and Anomaly Detection[R]. arXiv preprint arXiv:2307.03759, 2023.

[19] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9: 1735-1780.

[20] Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[C]. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2015, pp. 3431-3440.

[21] Zhou T, Ma Z, Wen Q, et al. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting[C]. 39th International Conference on Machine Learning. 2022, vol. 162, pp. 27268-27286.

[22] Bano N S, Khalid S. BERT-based Extractive Text Summarization of Scholarly Articles: A Novel Architecture[C]. 2022 International Conference on Artificial Intelligence of Things (ICAIoT). 2022, pp. 1-5.

[23] Lovanshi M, Tiwari V. Human skeleton pose and spatio-temporal feature-based activity recognition using ST-GCN[J]. Multimedia Tools and Applications, 2023, 83: 12705-12730.

Original Version Master File-Zkg International-Lu Ma-China (1)Download

[1] Wang Y, Xu Y, Yang J, et al. Fully-Connected Spatial-Temporal Graph for Multivariate Time Series Data[C]. AAAI Conference on Artificial Intelligence. 2023. [2] Anguita D, Ghio A, Oneto L, et al. A Public Domain Dataset for Human Activity Recognition using Smartphones[C]. European Symposium on Artificial Neural Networks. 2013.

malu_2025@163.com

Wearable-based-Human-Activity-Recognition-with-spatial-temporal-Graph-Convolutional-Transformer-Network.docx

Leave a Reply Cancel reply

Information