Using Phonetic Posteriorgram Based Frame Pairing for Segmental Accent Conversion

Guanlong Zhao and Ricardo Gutierrez-Osuna

Department of Computer Science and Engineering, Texas A&M University, USA

Accent Conversion Audio Samples

Systems:

Baseline 1 (AC-SIM): using acoustic similarity to pair speech frames. This is the AC baseline system
Baseline 2 (AC-DTW): using Dynamic Time Warping to align speech frames. This is the VC baseline system
Posteriorgram (AC-PPG): using phonetic similarity to pair speech frames. This is the proposed AC system

Notes:

The L1 speakers speak the General American Accent
ABA's native language is Arabic
HKK's native language is Korean
TNI's native language is Hindi
The L1 reference audios were resynthesized from their MCEPs to match the acoustic quality of the other audio clips
Dataset (L2-ARCTIC corpus): https://psi.engr.tamu.edu/l2-arctic-corpus/

L2 speaker	L1 reference speech	L2 speech	Baseline 1	Baseline 2	Posteriorgram
ABA
HKK
TNI