Blind Extraction of Target Speech Source Guided by Piloting and Deflation


This page presents a novel robust procedure for the extraction of a speaker of interest (SOI) from a mixture of audio sources. The estimation of the SOI is blind, performed via independent vector extraction. A recently proposed constant separating vector (CSV) model is employed, which improves the estimation of moving sources. The blind algorithm is guided towards the SOI via a process called piloting, which is based on the pre-trained frame-wise speaker identification. When processing challenging data, an incorrect speaker may be extracted due to the limitations of this guidance. To identify such cases, a criterion non-intrusively assessing the quality of the estimated SOI is proposed. It utilizes the same model as the speaker identification; no additional training is therefore required. Using this criterion, the “deflation” approach to extraction is presented. If an incorrect source is estimated, it is subtracted from the mixture and the extraction of the SOI is performed again from the reduced mixture.

Examples: Target speaker extraction from mixtures of two speakers – female and male voice

Case 1: The guided CSV-AuxIVE extracts successfully both target sources using solely the piloting, the deflation is not necessary

mixture
extracted female speech
extracted male speech
mixture
female speech
male speech

Case 2: The guided CSV-AuxIVE utilizing piloting extracts successfully only the male voice but fails to extract the female voice (extracts again the male voice instead)

This is caused by the short duration of the mixture (4 s). The frame-wise speaker identification fails to find a sufficient amount of frames corresponding to the female speaker to identify her. However, this error is automatically detected using the non-intrusive criterion for the assesment of the extraction quality. To correct, the deflation is subsequently applied to the mixture, which succeeds in the extraction of the female speaker.

mixture
failed extraction of female speech (male speech is erroneously extracted)
extracted female speech after deflation
extracted male speech
mixture
failed extraction of female speech (male speech is erroneously extracted)
male speech
female speech after deflation