Abstract

Phonetic information is one of the most essential components of a speech signal, playing an important role for many speech processing tasks. However, it is difficult to integrate phonetic information into speaker verification systems since it occurs primarily at the frame level while speaker characteristics typically reside at the segment level. In deep neural network-based speaker verification, existing methods only apply phonetic information to the frame-wise trained speaker embeddings. To improve this weakness, this paper proposes phonetic adaptation and hybrid multi-task learning and further combines these into c-vector and simplified c-vector architectures. Experiments on National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2010 show that the four proposed speaker embeddings achieve better performance than the baseline. The c-vector system performs the best, providing over 30% and 15% relative improvements in equal error rate (EER) for the core-extended and 10 s–10 s conditions, respectively. On the NIST SRE 2016, 2018, and VoxCeleb datasets, the proposed c-vector approach improves the performance even when there is a language mismatch within the training sets or between the training and evaluation sets. Extensive experimental results demonstrate the effectiveness and robustness of the proposed methods.

Document Type

Article

Publication Date

12-5-2019

Notes/Citation Information

Published in EURASIP Journal on Audio, Speech, and Music Processing, 2019, article 19, p. 1-17.

This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Digital Object Identifier (DOI)

https://doi.org/10.1186/s13636-019-0166-8

Funding Information

This work was supported by the National Natural Science Foundation of China under grant no. 61403224 and no. U1836219.

Repository Citation

Liu, Yi; He, Liang; and Johnson, Michael T., "Introducing Phonetic Information to Speaker Embedding for Speaker Verification" (2019). Electrical and Computer Engineering Faculty Publications. 37.
https://uknowledge.uky.edu/ece_facpub/37

Download

Included in

Electrical and Computer Engineering Commons

COinS

Electrical and Computer Engineering Faculty Publications

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

Abstract

Document Type

Publication Date

Notes/Citation Information

Digital Object Identifier (DOI)

Funding Information

Related Content

Repository Citation

Included in

Search

Browse by Author

Author Corner

Connect

Electrical and Computer Engineering Faculty Publications

Introducing Phonetic Information to Speaker Embedding for Speaker Verification

Authors

Abstract

Document Type

Publication Date

Notes/Citation Information

Digital Object Identifier (DOI)

Funding Information

Related Content

Repository Citation

Included in

Share

Search

Browse by Author

Author Corner

Connect