TY - JOUR
AU - Lemercier, Jean-Marie
AU - Richter, Julius
AU - Welker, Simon
AU - Gerkmann, Timo
TI - StoRM: A Diffusion-Based Stochastic Regeneration Model for Speech Enhancement and Dereverberation
JO - IEEE ACM transactions on audio, speech, and language processing
VL - 31
IS - arXiv:2212.11851
SN - 2329-9290
CY - New York, NY
PB - IEEE
M1 - PUBDB-2024-00134
M1 - arXiv:2212.11851
SP - 2724 - 2737
PY - 2023
N1 - ISSN 2329-9304 not unique: **2 hits**.
AB - Diffusion models have shown a great ability at bridging the performance gap between predictive and generative approaches for speech enhancement. We have shown that they may even outperform their predictive counterparts for non-additive corruption types or when they are evaluated on mismatched conditions. However, diffusion models suffer from a high computational burden, mainly as they require to run a neural network for each reverse diffusion step, whereas predictive approaches only require one pass. As diffusion models are generative approaches they may also produce vocalizing and breathing artifacts in adverse conditions. In comparison, in such difficult scenarios, predictive models typically do not produce such artifacts but tend to distort the target speech instead, thereby degrading the speech quality. In this work, we present a stochastic regeneration approach where an estimate given by a predictive model is provided as a guide for further diffusion. We show that the proposed approach uses the predictive model to remove the vocalizing and breathing artifacts while producing very high quality samples thanks to the diffusion model, even in adverse conditions. We further show that this approach enables to use lighter sampling schemes with fewer diffusion steps without sacrificing quality, thus lifting the computational burden by an order of magnitude. Source code and audio examples are available online.
LB - PUB:(DE-HGF)16
UR - <Go to ISI:>//WOS:001037791600002
DO - DOI:10.1109/TASLP.2023.3294692
UR - https://bib-pubdb1.desy.de/record/601120
ER -