This work addresses the problem of block-online processing for multi-channel speech enhancement. We consider several variants of a system that performs beamforming supported by DNN-based Voice Activity Detection followed by post-filtering. The speaker is targeted through estimating relative transfer functions between microphones. Each block of the input signals is processed independently in order to make the method applicable in highly dynamic environments. The performance loss caused by the short length of the processing block is studied and compared with results achieved when recordings are processed as one block (batch processing).
The experimental evaluation of the proposed method is performed on large datasets of CHiME-4 and on another dataset featuring moving target speaker. The experiments are evaluated in terms of objective criteria and Word Error Rate achieved by a baseline Automatic Speech Recognition system, for which the enhancement method serves as a front-end solution. The results indicate that the proposed method is robust with respect to the length of the processing block and yields significant WER improvement even for a short block length of 250 ms.
J. Málek, Z. Koldovský, and M. Boháč, “Block-Online Multi-Channel Speech Enhancement Using DNN-Supported Relative Transfer Function Estimates”, arXiv:1905.03632 [cs.SD], May 2019.