Less is more: A unified architecture for device-directed speech detection with multiple invocation types
Ognjen Rudovic (Apple); Wonil Chang (Apple); Vineet Garg (Apple); Pranay Dighe (Apple); Pramod Jaya Simha (Apple Inc); John Berkowitz (Apple); Ahmed Hussen Abdelaziz (Apple); Erik Marchi (Apple); Sachin Kajarekar (Apple); Saurabh Adya (Apple)
-
SPS
IEEE Members: $11.00
Non-members: $15.00
Suppressing unintended invocation of the device because of the speech that sounds like wake-word, or accidental button presses, is critical for a good user experience, and is referred to as False-Trigger-Mitigation (FTM). In case of multiple invocation options, the traditional approach to FTM is to use invocation-specific models, or a single model for all invocations. Both approaches are sub-optimal: the memory cost for the former approach grows linearly with the number of invocation options, which is prohibitive for on-device deployment, and does not take advantage of shared training data; while the latter is unable to accurately capture acoustic differences across different invocation types. To this end, we propose a Unified Acoustic Detector (UAD) for FTM when multiple invocation options are available on device. The proposed UAD is trained using a multi-task learning framework, where a jointly trained acoustic encoder model is augmented with invocation-specific classification layers. In the context of the FTM task, we show for the first time that using the shared model architecture across invocations (thus, keeping the model size similar to that of a monolithic model used for a single invocation type), we can not only match but largely improve the accuracy of the invocation-specific models. In particular, in the challenging case of touch-based invocation, we obtain 50% and 35% relative improvement in false positive rate at 99% true positive rate, when compared with a single-output model for both invocations, and separate models per invocation, respectively. Furthermore, we propose streaming and non-streaming variants of the UAD, and show that they both outperform a traditional ASR-based approach to FTM.