SIGN LANGUAGE RECOGNITION VIA DEFORMABLE 3D CONVOLUTIONS AND MODULATED GRAPH CONVOLUTIONAL NETWORKS
Katerina Papadimitriou (University of Thessaly); Gerasimos Potamianos (ECE, University of Thessaly)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Automatic sign language recognition (SLR) remains challenging,
especially when employing RGB video alone (i.e., with no depth
or special glove-based input) and under a signer-independent (SI)
framework, due to inter-personal signing variation. In this paper, we
address SI isolated SLR from RGB video, proposing an innovative
deep-learning framework that leverages multi-modal appearance-
and skeleton-based information. Specifically, we propose three
components for the first time in SLR: (i) a modified version of the
ResNet2+1D network to capture signing appearance information,
where spatial and temporal convolutions are substituted by their
deformable counterparts, accomplishing both prevalent spatial mod-
eling potential and motion-aware modeling adaptability; (ii) a novel
spatio-temporal graph convolutional network (ST-GCN) that inte-
grates a GCN variant, involving weight and affinity modulation for
modeling diverse correlations between different body joints beyond
the physical human skeleton structure, followed by a self-attention
layer and a temporal convolution; and (iii) the “PIXIE” 3D human
pose and shape regressor to generate 3D joint-rotation parameteri-
zation used for ST-GCN graph construction. Both appearance- and
skeleton-based streams are ensembled in the proposed system and
evaluated on two datasets of isolated signs, one in Turkish and one
in Greek. Our system outperforms the state-of-the-art on the second
set, yielding 53% relative error rate reduction (2.45% absolute),
while it performs on par with the best reported system on the first.