Domain Robust, Fast, And Compact Neural Language Models
Kazuki Irie, Alexander Gerstenberger, Pavel Golik, Eugen Beck, Hermann Ney
-
SPS
IEEE Members: $11.00
Non-members: $15.00Length: 10:13
Despite advances in neural language modeling, obtaining a good model on a large scale multi-domain dataset still remains a difficult task. We propose training methods for building neural language models for such a task, which are not only domain robust, but reasonable in model size and fast for evaluation. We combine knowledge distillation from pre-trained domain expert language models with the noise contrastive estimation (NCE) loss. Knowledge distillation allows to train a single student model which is both compact and domain robust, while the use of NCE loss makes the model self-normalized, which enables fast evaluation. We conduct experiments on a large English multi-domain speech recognition dataset provided by AppTek. The resulting student model is of the size of one domain expert, while it gives similar perplexities as various teacher models on their expert domain; the model is self-normalized, allowing for 30% faster first pass decoding than the naive models which require the full softmax computation, and finally it gives improvements of more than 8% relative in terms of word error rate over a large multi-domain 4-gram count model trained on more than 10 B words.