Skip to main content
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
    Length: 00:12:45
09 May 2022

Previous audio-visual (AV) alignment mainly focuses on frame-level synchronization while neglecting clip-wise matching. We focus on AV parsing on fully unconstrained data where the audio and visual events do not necessarily co-present. A video-enhanced Audioset dataset is provided to investigate parsing on such a mismatching setting, with 376 events included. To our knowledge, this is the first time where AV event parsing and detection are inspected on a clip-wise matching scenario. Experiments show that our proposed method largely improves video parsing accuracy on tagging and detection. Further, a parsing model pretrained on our dataset can assist in accurately locating audio-visual syncing time spans.

More Like This

  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00
  • SPS
    Members: Free
    IEEE Members: $11.00
    Non-members: $15.00