-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 00:01:57
Automatic online recognition of surgical phases can provide insight that can help surgical teams make better decisions that may lead to better surgical outcomes. Current state-of-the-art AI approaches for surgical phase recognition utilize both spatial and temporal information to learn context awareness in surgical videos. We propose the use of EfficientNetV2 for spatial feature extraction and we design Cross-Enhancement Causal Transformer (C-ECT) by modifying previous transformer-based architectures for temporal modeling to achieve online surgical phase recognition. Additionally, we propose Cross-Attention Feature Fusion (CAFF) to better integrate the global and local information in our C-ECT. We show that our approach can achieve 94.9% accuracy on the Cholec80 dataset which improves upon the current state-of-the-art method by approximately 3% in accuracy and precision, 2% in the recall, and 4% in terms of the Jaccard score.