面向视唱认知仿真的眼动跟踪优化建模方法研究

doi:10.16182/j.issn1004731x.joss.25-1237

摘要/Abstract

摘要：

针对乐谱视唱教学仿真中头部运动引发的注视点偏移问题及现有方法缺乏系统级仿真验证，提出一种融合图像语义理解、时序轨迹建模与视唱认知仿真的注视精度优化 方法。以Vision Transformer为核心，经Mahalanobis距离、滑动窗口与兴趣区域预处理后，引入位置偏移感知、偏移残差回归与双通路融合，实现无标注条件下的偏移建模与校正。仿真结果表明：该方法误差较原始值误差降低43.9%；移除任一模块平均欧几里得距离明显上升，最大增幅为48.6%；跨数据集实验中，不同数据集校正率保持在40%左右；不同任务场景中平均降低偏移误差36.6%~40.9%。提升了眼动数据可靠性，为视唱认知评估与人机交互仿真系统提供技术支持。

关键词: 眼动跟踪, 乐谱视唱, 头部运动偏移, 注视精度优化, Vision Transformer模型

Abstract:

To address the fixation offset problem caused by head movement in music solfeggio teaching simulation and the lack of system-level simulation validation in existing methods, this paper proposed a fixation accuracy optimization method integrating image semantic understanding, temporal trajectory modeling, and solfeggio cognitive simulation. With Vision Transformer as the core, after preprocessing via Mahalanobis distance, sliding window, and region of interest, position offset perception, offset residual regression, and dual-pathway fusion were introduced to achieve offset modeling and correction under unlabeled conditions. Simulation results indicate that the error of this method decreases by 43.9% compared with the original value error; removing any module significantly increases the average Euclidean distance, with a maximum increase of 48.6%; in cross-dataset experiments, the correction rates across different datasets remain at around 40%; the offset error is reduced by 36.6%~40.9% on average in different task scenarios. This method improves the reliability of eye tracking data and provides technical support for solfeggio cognitive assessment and human-computer interaction simulation systems.

Key words: eye tracking, music solfeggio, head movement offset, fixation accuracy optimization, Vision Transformer model

中图分类号:

TP391.7

张堃,钱佳杰,马树红等 . 面向视唱认知仿真的眼动跟踪优化建模方法研究[J]. 系统仿真学报, 2026, 38(6): 1749-1760.

Zhang Kun,Qian Jiajie,Ma Shuhong,et al . Research on Optimization Modeling Method for Eye Tracking in Solfeggio Cognitive Simulation[J]. Journal of System Simulation, 2026, 38(6): 1749-1760.

图/表 10

图1

图2

图3

图4

表1

表2

表3

图5

表4

表5

参考文献 25

[1]	Debevc Matjaž, Weiss Jernej, Šorgo Andrej, et al. Solfeggio Learning and the Influence of a Mobile Application Based on Visual, Auditory and Tactile Modalities[J]. British Journal of Educational Technology, 2020, 51(1): 177-193.
[2]	Meva Bayrak Karsli, Demirel Turgay, Kurşun Engin. Examination of Different Reading Strategies with Eye Tracking Measures in Paragraph Questions[J]. Hacettepe University Journal of Education, 2020, 35(1): 92-106.
[3]	Nigrelli E, Carender W, Sienko K H, et al. Eye Tracking Reveals Physical Therapist Decision Making While Evaluating Standing Balance Performance[J]. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2025, 33: 2587-2596.
[4]	Iacono Paolo, Khan Naimul. Multi-modal Emotion Recognition Using EEG and Eye Tracking Features[C]//2024 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). Piscataway: IEEE, 2024: 1-5.
[5]	Seitz Matthias, Tallon Miles, Gotthardt Karina, et al. Application to Combine Questioning, Stimulus Presentation and External Measurements in a Real-life Music Eye Tracking Experiment[J]. MethodsX, 2024, 13: 102825.
[6]	Sheridan H, Maturi K S, Kleinsmith A L. Eye Movements During Music Reading: Toward a Unified Understanding of Visual Expertise[M]//Federmeier K D, Schotter E R. Psychology of Learning and Motivation. New York: Academic Press, 2020: 119-156.
[7]	Dai Lihong, Liu Jinguo, Ju Zhaojie. Attention Mechanism and Bidirectional Long Short-term Memory-based Real-time Gaze Tracking[J]. Electronics, 2024, 13(23): 4599.
[8]	吕嘉琦, 王长元. 双向融合CNN与Transformer的三维视线估计[J]. 计算机系统应用, 2024, 33(10): 66-74.
	Jiaqi Lü, Wang Changyuan. 3D Gaze Estimation by Bidirectional Fusion of CNN and Transformer[J]. Computer Systems & Applications, 2024, 33(10): 66-74.
[9]	刘纪元, 祁瀚文, 刘志诚, 等. 一种基于机器视觉的精准注意力追踪系统[J]. 系统仿真学报, 2023, 35(10): 2087-2100.
	Liu Jiyuan, Qi Hanwen, Liu Zhicheng, et al. A Precise Attention Tracking System Based on Computer Vision[J]. Journal of System Simulation, 2023, 35(10): 2087-2100.
[10]	赵小强, 柳勇勇, 惠永永, 等. 基于改进时域卷积网络与多头自注意力机制的间歇过程质量预测模型[J]. 计算机应用, 2025, 45(7): 2245-2252.
	Zhao Xiaoqiang, Liu Yongyong, Hui Yongyong, et al. Batch Process Quality Prediction Model Using Improved Time-domain Convolutional Network with Multi-head Self-attention Mechanism[J]. Journal of Computer Applications, 2025, 45(7): 2245-2252.
[11]	Huang Jianglong, Hong Chaoqun, Xie Rongsheng, et al. A Simple and Efficient Channel MLP on Token for Human Pose Estimation[J]. International Journal of Machine Learning and Cybernetics, 2025, 16(5): 3809-3817.
[12]	Wang Jing, Yu Long, Tian Shengwei. Cross-attention Interaction Learning Network for Multi-model Image Fusion via Transformer[J]. Engineering Applications of Artificial Intelligence, 2025, 139, Part A: 109583.
[13]	Dhamale Prashant S, Kashikar Akanksha S. Outlier Detection in Cylindrical Data Based on Mahalanobis Distance[J]. Communications in Statistics-Simulation and Computation, 2025, 54(2): 331-341.
[14]	Dong Haoyu, Zhang Yifan, Gu Hanxue, et al. SWSSL: Sliding Window-based Self-supervised Learning for Anomaly Detection in High-resolution Images[J]. IEEE Transactions on Medical Imaging, 2023, 42(12): 3860-3870.
[15]	李永辉, 赵耀, 加小红, 等. CNN与Transformer协同的多模态边缘检测网络[J]. 计算机工程与应用, 2025, 61(14): 195-205.
	Li Yonghui, Zhao Yao, Jia Xiaohong, et al. Multi-modal Edge Detection Network Based on Collaboration of CNN and Transformer[J]. Computer Engineering and Applications, 2025, 61(14): 195-205.
[16]	Xu Taotao, Yao Lijian, Xu Lijun, et al. Image Segmentation of Cucumber Seedlings Based on Genetic Algorithm[J]. Sustainability, 2023, 15(4): 3089.
[17]	Saigo Hiroshi. Mean Square Error and Variance Estimation of the Sample Ratio Under Lahiri's Design[J]. Statistics & Probability Letters, 2025, 216: 110277.
[18]	Mubarak Auwalu Saleh, Zubaida Said Ameen, Altrjman Chadi, et al. Computer-vision-based Statue Detection with Gaussian Smoothing Filter and EfficientDet[J]. Sustainability, 2022, 14(18): 11413.
[19]	Kang Yun, Wu Chongyan, Yu Bin. Time Series Clustering Based on Polynomial Fitting and Multi-order Trend Features[J]. Information Sciences, 2024, 678: 120939.
[20]	Gomolka Zbigniew, Zeslawska Ewa, Czuba Barbara, et al. Diagnosing Dyslexia in Early School-aged Children Using the LSTM Network and Eye Tracking Technology[J]. Applied Sciences, 2024, 14(17): 8004.
[21]	王域玲, 陆小锋, 宋海洋. 基于多模态CNN的脑卒中病人康复训练视线估计[J]. 工业控制计算机, 2025, 38(10): 103-104, 107.
	Wang Yuling, Lu Xiaofeng, Song Haiyang. Gaze Estimation for Stroke Patient Rehabilitation Training Based on Multi-modal CNN[J]. Industrial Control Computer, 2025, 38(10): 103-104, 107.
[22]	彭黄果, 陈亮. 结合多尺度融合与注意力机制的混合Transformer注视估计模型[J]. 科学技术与工程, 2025, 25(29): 12579-12585.
	Peng Huangguo, Chen Liang. Hybrid Transformer Gaze Estimation Model Combining Multi-scale Fusion and Attention Mechanism[J]. Science Technology and Engineering, 2025, 25(29): 12579-12585.
[23]	Zhang Xucong, Sugano Yusuke, Fritz Mario, et al. MPIIGaze: Real-world Dataset and Deep Appearance-based Gaze Estimation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41(1): 162-175.
[24]	Gunawardena Nishan, Jeewani Anupama Ginige, Javadi Bahman, et al. Performance Analysis of CNN Models for Mobile Device Eye Tracking with Edge Computing[J]. Procedia Computer Science, 2022, 207: 2291-2300.
[25]	Adebayo S, Dessing J C, McLoone Seán. SLYKLatent: A Learning Framework for Gaze Estimation Using Deep Facial Feature Learning[J]. IEEE Transactions on Human-Machine Systems, 2025, 55(3): 333-346.

算法名称	AED/像素	MSE/像素²	ROI-Hit@30/%	ROI-Hit@50/%	OCR/%	TS/像素
Raw	24.6	742.1	58.3	31.2	0	43.2
G-Smooth^[18]	22.7	659.4	60.8	33.0	7.7	39.4
Poly-Align^[19]	20.3	621.0	64.2	38.9	17.5	36.2
LSTM-Gaze^[20]	17.1	518.7	71.4	45.6	30.5	32.1
SOTA-CNN-Gaze^[21]	18.9	561.3	69.2	44.1	26.7	34.8
SOTA-Transformer-Gaze^[22]	17.8	534.6	72.6	47.9	31.4	31.6
IViT-STFGC	13.8	398.5	83.2	61.3	43.9	25.7

消融变体	AED/像素	MSE/像素²	ROI-Hit@30/%	ROI-Hit@50/%	OCR/%	TS/像素
Full	13.8	398.5	83.2	61.3	43.9	25.7
w/o PAA	17.6	522.3	75.1	50.2	29.4	32.8
w/o ORR	16.9	509.8	76.3	51.0	31.2	31.7
w/o DSF	15.8	438.6	80.1	57.4	38.5	28.9
Image-only	20.5	612.4	68.7	44.5	22.3	35.2
Temporal-only	18.3	541.7	70.4	47.2	26.9	33.5

数据集	AED/像素	MSE/像素²	ROI-Hit@30/%	ROI-Hit@50/%	OCR/%	TS/像素
MPIIGaze	15.2	420.3	78.1	59.4	39.5	27.9
GazeCapture	17.4	462.1	74.8	55.2	35.1	29.4
ETH-XGaze	14.7	410.2	80.5	60.7	41.2	26.3
MY-Solfeggio	13.8	398.5	83.2	61.3	43.9	25.7

算法	无遮挡	模糊滑动遮挡	覆盖滑动遮挡
Raw	23.5	28.3	31.7
G-Smooth	21.1	25.2	28.4
Poly-Align	18.9	22.5	25.8
LSTM-Gaze	16.7	20.6	23.3
IViT-STFGC	13.9	17.2	20.1

方法	LC	IJR	ILS/像素	AFR
Raw	0.59	0.21	18.7	0.17
IViT-STFGC	0.82	0.09	11.4	0.06