THUHCSI_Logo

AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction

Accepted by ICASSP 2023

Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong Wu, Yujun Wang, Helen Meng

Abstract

Visual information can serve as an effective cue for target speaker extraction (TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and modelfeatures from audio and visual. AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the visual feature. Then self- and cross-attention are employed to model the multi-modal features Furthermore, we use a novel 2D positional encoding, that introduces the positional information between and within chunks and provides significant gains over the traditional positional encoding. Our model has two key advantages: the time granularity of audio chunked feature is synchronized to the visual feature, which alleviates the harm caused by the inconsistency of audio and video sampling rate; by combining self- and cross-attention, feature fusion and speech extraction processes are unified within an attention paradigm. The experimental results show that AV-SepFormer significantly outperforms other existing methods.

Compare with MuSE [1] and AV-ConvTasnet [2]

Mixture video

GT

SI-SDR=2.93dB (Input)

SI-SDR=-3.23dB (Input)

Proposed

SI-SDR=13.18dB (Separated)

SI-SDR=15.95dB (Separated)

MuSE

SI-SDR=10.11dB (Separated)
SI-SDR=12.89dB (Separated)

AV-ConvTasnet

SI-SDR=7.41dB (Separated)
SI-SDR=11.15dB (Separated)

Mixture video

GT

SI-SDR=1.15dB (Input)

SI-SDR=-1.63dB (Input)

Proposed

SI-SDR=19.81dB (Separated)

SI-SDR=15.42dB (Separated)

MuSE

SI-SDR=16.84dB (Separated)
SI-SDR=12.60dB (Separated)

AV-ConvTasnet

SI-SDR=12.84dB (Separated)
SI-SDR=11.57dB (Separated)

Mixture video

GT

SI-SDR=-6.42dB (Input)

SI-SDR=6.11dB (Input)

Proposed

SI-SDR=13.78dB (Separated)

SI-SDR=16.02dB (Separated)

MuSE

SI-SDR=12.20dB (Separated)
SI-SDR=13.76dB (Separated)

AV-ConvTasnet

SI-SDR=10.45dB (Separated)
SI-SDR=12.37dB (Separated)

Ablation study

Mixture video

GT

SI-SDR=-7.13dB (Input)

SI-SDR=7.01dB (Input)

Proposed

SI-SDR=21.18dB (Separated)

SI-SDR=14.08dB (Separated)

w/o cross-attention

SI-SDR=17.56dB (Separated)
SI-SDR=7.27dB (Separated)

w/o 2D positional encoding

SI-SDR=17.83dB (Separated)
SI-SDR=12.96dB (Separated)

Mixture video

GT

SI-SDR=1.18dB (Input)

SI-SDR=-1.54dB (Input)

Proposed

SI-SDR=12.37dB (Separated)

SI-SDR=20.76dB (Separated)

w/o cross-attention

SI-SDR=8.21dB (Separated)
SI-SDR=16.84dB (Separated)

w/o 2D positional encoding

SI-SDR=11.44dB (Separated)
SI-SDR=20.01dB (Separated)

Mixture video

GT

SI-SDR=4.98dB (Input)

SI-SDR=-5.15dB (Input)

Proposed

SI-SDR=18.23dB (Separated)

SI-SDR=10.78dB (Separated)

w/o cross-attention

SI-SDR=16.56dB (Separated)
SI-SDR=9.43dB (Separated)

w/o 2D positional encoding

SI-SDR=17.11dB (Separated)
SI-SDR=9.77dB (Separated)

[1] Pan Z, Tao R, Xu C, et al. Muse: Multi-modal target speaker extraction with visual cues[C]//ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6678-6682.

[2] Wu J, Xu Y, Zhang S X, et al. Time domain audio visual speech separation[C]//2019 IEEE automatic speech recognition and understanding workshop (ASRU). IEEE, 2019: 667-673.