Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 14:42
04 May 2020

Internet of Things (IoT) applications typically require a large number of heterogeneous devices to be distributed in the environment, which can generate large amounts of data for wireless transmission, affecting the energy requirements and lifetime of the devices. One strategy is computing on the very edge: performing advanced processing directly on the IoT node reduces the amount of transmitted data and the related power consumption. Thanks to recent improvements in embedded technology, commercial microcontrollers with consumptions in the range of mW and computational power to enable Artificial Intelligence (AI) at the ``thing-level” are available. Sound event detection (SED) is an example of an emerging IoT-based application, driven by a growing interest in sensing technologies for smart cities. The recent release of new datasets and challenges (UrbanSound8K, AudioSet, ESC50, and DCASE) has led to substantial advances in terms of accuracy and robustness. Unfortunately, state-of-the-art algorithms employ very large neural networks, which are increasingly hungry of computational power and memory, preventing the development of applications for energy-neutral and cheap IoT devices. This show&tell presents our implementation of state of the art SED at the very edge, by optimizing deep learning techniques on very low-cost low-power embedded platforms, with severe constraints in terms of memory footprint and computational power. Using a student-teacher approach we make a state of the art neural network for sound event detection (based on VGGish) fit on current commercial microcontrollers by achieving extreme compression factors (from 70 millions to 20 thousand parameters). We implement our model on an ARM Cortex M4 using the CMSIS-NN library and adopting an efficient layer-wise 8-bit quantization of buffers and weights. Our real-time embedded implementation achieves 68% accuracy on Urbansound8k, with an inference time of 125 ms for each second of audio and a power consumption of 5.5 mW in just 34.3 kB of RAM.

Value-Added Bundle(s) Including this Product

More Like This

  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00
  • SPS
    Members: $150.00
    IEEE Members: $250.00
    Non-members: $350.00