Privacy and Utility of X-Vector Based Speaker Anonymization

We study the scenario where individuals ( speakers ) contribute to the publication of an anonymized speech corpus. Data users leverage this public corpus for downstream tasks, e.g., training an automatic speech recognition (ASR) system, while attackers may attempt to de-anonymize it using auxiliary...

Full description

Saved in:
Bibliographic Details
Published in:IEEE/ACM transactions on audio, speech, and language processing Vol. 30; pp. 2383 - 2395
Main Authors: Srivastava, Brij Mohan Lal, Maouche, Mohamed, Sahidullah, Md, Vincent, Emmanuel, Bellet, Aurelien, Tommasi, Marc, Tomashenko, Natalia, Wang, Xin, Yamagishi, Junichi
Format: Journal Article
Language:English
Published: Piscataway IEEE 01.01.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Institute of Electrical and Electronics Engineers
Subjects:
ISSN:2329-9290, 2329-9304
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:We study the scenario where individuals ( speakers ) contribute to the publication of an anonymized speech corpus. Data users leverage this public corpus for downstream tasks, e.g., training an automatic speech recognition (ASR) system, while attackers may attempt to de-anonymize it using auxiliary knowledge. Motivated by this scenario, speaker anonymization aims to conceal speaker identity while preserving the quality and usefulness of speech data. In this article, we study x-vector based speaker anonymization, the leading approach in the VoicePrivacy Challenge, which converts the speaker's voice into that of a random pseudo-speaker. We show that the strength of anonymization varies significantly depending on how the pseudo-speaker is chosen. We explore four design choices for this step: the distance metric between speakers, the region of speaker space where the pseudo-speaker is picked, its gender, and whether to assign it to one or all utterances of the original speaker. We assess the quality of anonymization from the perspective of the three actors involved in our threat model, namely the speaker, the user and the attacker. To measure privacy and utility, we use respectively the linkability score achieved by the attackers and the decoding word error rate achieved by an ASR model trained on the anonymized data. Experiments on LibriSpeech show that the best combination of design choices yields state-of-the-art performance in terms of both privacy and utility. Experiments on Mozilla Common Voice further show that it guarantees the same anonymization level against re-identification attacks among 50 speakers as original speech among 20,000 speakers.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2022.3190741