AV-SepFormer: Cross-Attention SepFormer for Audio-Visual Target Speaker Extraction
Accepted by ICASSP 2023
Jiuxin Lin, Xinyu Cai, Heinrich Dinkel, Jun Chen, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Zhiyong
Wu, Yujun Wang,
Helen Meng
Abstract
Visual information can serve as an effective cue
for target speaker extraction
(TSE) and is vital to improving extraction performance. In this paper, we propose AV-SepFormer, a
SepFormer-based attention dual-scale model that utilizes cross- and self-attention to fuse and
modelfeatures from audio and visual.
AV-SepFormer splits the audio feature into a number of chunks, equivalent to the length of the
visual feature. Then self- and cross-attention are employed to model the multi-modal features
Furthermore, we use a novel 2D positional encoding, that introduces the positional information
between and within chunks and provides significant gains over the traditional positional encoding.
Our model has two key advantages: the time granularity of audio chunked feature is synchronized to
the visual feature, which alleviates the harm caused by the inconsistency of audio and video
sampling rate; by combining self- and cross-attention, feature fusion and speech extraction
processes are unified within an attention paradigm.
The experimental results show that AV-SepFormer significantly outperforms other existing methods.
Compare with MuSE [1] and AV-ConvTasnet [2]
Mixture video
GT
SI-SDR=2.93dB (Input)
SI-SDR=-3.23dB (Input)
Proposed
SI-SDR=13.18dB (Separated)
SI-SDR=15.95dB (Separated)
MuSE
SI-SDR=10.11dB (Separated)
SI-SDR=12.89dB (Separated)
AV-ConvTasnet
SI-SDR=7.41dB (Separated)
SI-SDR=11.15dB (Separated)
Mixture video
GT
SI-SDR=1.15dB (Input)
SI-SDR=-1.63dB (Input)
Proposed
SI-SDR=19.81dB (Separated)
SI-SDR=15.42dB (Separated)
MuSE
SI-SDR=16.84dB (Separated)
SI-SDR=12.60dB (Separated)
AV-ConvTasnet
SI-SDR=12.84dB (Separated)
SI-SDR=11.57dB (Separated)
Mixture video
GT
SI-SDR=-6.42dB (Input)
SI-SDR=6.11dB (Input)
Proposed
SI-SDR=13.78dB (Separated)
SI-SDR=16.02dB (Separated)
MuSE
SI-SDR=12.20dB (Separated)
SI-SDR=13.76dB (Separated)
AV-ConvTasnet
SI-SDR=10.45dB (Separated)
SI-SDR=12.37dB (Separated)
Ablation study
Mixture video
GT
SI-SDR=-7.13dB (Input)
SI-SDR=7.01dB (Input)
Proposed
SI-SDR=21.18dB (Separated)
SI-SDR=14.08dB (Separated)
w/o cross-attention
SI-SDR=17.56dB (Separated)
SI-SDR=7.27dB (Separated)
w/o 2D positional encoding
SI-SDR=17.83dB (Separated)
SI-SDR=12.96dB (Separated)
Mixture video
GT
SI-SDR=1.18dB (Input)
SI-SDR=-1.54dB (Input)
Proposed
SI-SDR=12.37dB (Separated)
SI-SDR=20.76dB (Separated)
w/o cross-attention
SI-SDR=8.21dB (Separated)
SI-SDR=16.84dB (Separated)
w/o 2D positional encoding
SI-SDR=11.44dB (Separated)
SI-SDR=20.01dB (Separated)
Mixture video
GT
SI-SDR=4.98dB (Input)
SI-SDR=-5.15dB (Input)
Proposed
SI-SDR=18.23dB (Separated)
SI-SDR=10.78dB (Separated)
w/o cross-attention
SI-SDR=16.56dB (Separated)
SI-SDR=9.43dB (Separated)
w/o 2D positional encoding
SI-SDR=17.11dB (Separated)
SI-SDR=9.77dB (Separated)
[1] Pan Z, Tao R, Xu C, et al. Muse: Multi-modal target speaker extraction with visual cues[C]//ICASSP
2021-2021 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021: 6678-6682.
[2] Wu J, Xu Y, Zhang S X, et al. Time domain audio visual speech separation[C]//2019 IEEE automatic speech
recognition and
understanding workshop (ASRU). IEEE, 2019: 667-673.