I am a speech researcher and engineer. My research interests broadly focus on the intersection of speech modification and speech perception.
I am also interested in speech synthesis (e.g., TTS) and speech recognition (E2E models in particular).
Specifically, I spent five years working on topics in accent and voice conversion during my Ph.D. training.
If you would like to reach out to me, my email address is zhao at aggienetwork dot com. How do I pronounce my name? In PinYin, it is written as Guàn-Lóng Zhào; the tones
are fourth, second, and fourth. Mapping to American English phonemes, it roughly sounds like Guan-Loan Chao. 🌈 Cheers!
Software Engineer @ Google (Speech), August 2020–Present
Research and Teaching Assistant @ Texas A&M University (Department of Computer Science and Engineering), September 2015–May 2020
Software Engineering Intern @ Google (Geo Machine Perception), May–August 2019
Software Engineering Intern @ Google (Speech), June–August 2018
- S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,"
Computer Speech & Language, vol. 72, 2022.
- G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021.
- I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "The English pronunciation of Arabic speakers: A data-driven approach to segmental error identification," Language Teaching Research,
2020. pdf summary
- G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 27, no. 10, pp. 1649–1660, 2019. pdf code demo
- S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Learning structured sparse representations for voice conversion," IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 28, pp. 343–354, 2019. pdf demo
- S. Ding, C. Liberatore, S. Sonsaat, I. Lučić Rehman, A. Silpachai, G. Zhao, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden speaker
builder–An interactive tool for pronunciation training," Speech Communication, vol. 115, pp. 51–66, 2019. pdf
- A. Hair, G. Zhao, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna, "Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions,"
in Interspeech, 2021. pp. 2936–2940. pdf
- A. Silpachai, I. Lučić Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Khudilaynen, G. Zhao, and R. Gutierrez-Osuna, "Effects of voice type and task on L2 learners' awareness of pronunciation errors,"
in Interspeech, 2021. pp. 1952–1956. pdf
- S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition," in Interspeech, 2020. pp. 776–780.
- A. Das, G. Zhao, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Understanding the effect of voice quality and accent on talker similarity," in Interspeech, 2020. pp. 1763–1767.
- G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign accent conversion by synthesizing speech from phonetic posteriorgrams,"
in Interspeech, 2019, pp. 2843–2847. pdf
- G. Zhao, S. Sonsaat, A. Silpachai, I. Lučić Rehman, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English
speech corpus," in Interspeech, 2018, pp. 2783–2787. pdf
- S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving sparse representations in exemplar-based voice conversion with a
phoneme-selective objective function," in Interspeech, 2018, pp. 476–480. pdf
- C. Liberatore, G. Zhao, and R. Gutierrez-Osuna, "Voice conversion through residual warping in a sparse, anchor-based representation of speech,"
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5284–5288. pdf
- G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams,"
in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5314–5318. pdf
- G. Angello, A. B. Manam, G. Zhao, and R. Gutierrez-Osuna, "Training behavior of successful tacton-phoneme learners,"
in IEEE Haptics Symposium (WIP), 2018. pdf
- G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in voice conversion," in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2017, pp. 5525–5529. pdf
- Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Chapter 10 — Image dehazing: Improved techniques,"
in Deep Learning through Sparse and Low-Rank Modeling: Elsevier, 2019, pp. 251–262.
- G. Zhao, Q. Wang, H. Lu, Y. Huang, and I. Lopez Moreno, "Augmenting transformer-transducer based speaker change detection with token-level training loss,"
arXiv preprint arXiv:2211.06482, 2022. pdf
- B. Labrador*, G. Zhao*, I. Lopez Moreno*, A. Scorza Scarpati, L. Fowl, and Quan Wang, "Exploring sequence-to-sequence transformer-transducer models for keyword spotting,"
arXiv preprint arXiv:2211.06478, 2022. *Equal contribution. pdf
- Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. Lopez Moreno, "Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering,"
arXiv preprint arXiv:2210.13690, 2022. pdf
- Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Improved techniques for learning to dehaze and
beyond: A collective study," arXiv preprint arXiv:1807.00202, 2018. pdf
- Y. Liu and G. Zhao, "PAD-Net: A perception-aided single image dehazing network," arXiv preprint arXiv:1805.03146, 2018.
- A. Datta, G. Zhao, B. Ramabhadran, E. Weinstein, "LSTM acoustic models learn to align and pronounce with graphemes," arXiv preprint arXiv:2008.06121, 2020.
(Work done as an intern at Google NYC during summer 2018.)
- I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "Pronunciation errors — A systematic approach to diagnosis,"
in L2 Pronunciation Research Workshop: Bridging the Gap between Research and Practice, 2019, pp. 23–24. pdf
- S. Sonsaat, E. Chukharev-Hudilainen, I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Golden Speaker Builder, an interactive tool for pronunciation training: User
studies," in 6th International Conference on English Pronunciation: Issues & Practices (EPIP6), 2019, p. 72. pdf
- S. Ding, C. Liberatore, G. Zhao, S. Sonsaat, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden Speaker Builder: an interactive online tool for L2 learners to build pronunciation models,"
in Pronunciation in Second Language Learning and Teaching (PSLLT), 2017, pp. 25–26. pdf
- IEEE Transactions on Image Processing
- Language Learning & Technology
- Computer Speech & Language
- Mathematical Biosciences and Engineering
- IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 2020, 2022, 2023
- International Conference on Advances in Signal, Image and Video Processing (SIGNAL): 2020, 2021, 2022, 2023
- Annual Conference of the International Speech Communication Association (Interspeech): 2019, 2021, 2022
- IEEE Automatic Speech Recognition and Understanding Workshop (ASRU): 2021
- Graduate Student Travel Award (for Interspeech'19), Department of Computer Science and Engineering, Texas A&M University, 2019
- Graduate Student Presentation Grant (for ICASSP'17), Office of Graduate and Professional Studies, Texas A&M University, 2017
- Outstanding Graduate Award, University of Science and Technology of China, 2015
- Outstanding Undergraduate Student Scholarship, University of Science and Technology of China, 2011–2014
- Second Prize @ Chinese Chemistry Olympiad (Provincial Level), Chinese Chemical Society, 2010
The L2-ARCTIC corpus is a multi-purpose non-native English speech dataset. I took a leading role in this project, where I designed the data collection
schemes and the annotation standards. I also spent a lot of time manually cleaning the raw speech recordings and performing quality control to ensure that the speech data and annotations were consistent and high-quality. The
recordings were collected at Iowa State University (ISU), led by Dr. John Levis and his students in the Department of English. The annotations were mostly
done by Dr. Alif Silpachai and Dr. Ivana Lučić Rehman.
The project spanned around two years. We released the first version at Interspeech 2018, and we continued to add more data to the corpus. Its most recent version is almost 2.4x the size of the initial
We initially designed the corpus for the accent conversion task, and that was why we chose to use the CMU-ARCTIC prompts in the first place. Along the way, we were also working on some projects related to mispronunciation
detection (MPD) and realized that there were limited open-source resources for MPD. We noticed that many of the CMU-ARCTIC sentences were hard for the participants to speak, which, on the one hand, made the recording
sessions difficult, on the other hand, elicited rich pronunciation errors in non-native speech productions. As a result, we decided to annotate part of the corpus for phonetic errors. All the sentences we annotated were carefully
selected by Dr. Levis to reflect the pronunciation issues that might happen given the speakers' native languages.
I use this corpus in all my publications on accent conversion. I found it is well-suited for the task because it allows me to test the algorithms on speakers with different accents, fluency level, age, and gender. I also use
this corpus for MPD research. To the best of my knowledge, this is probably the largest open-source annotated MPD corpus. If you are interested in using the corpus for your projects, you can find access guidelines on
its official project site. I would be happy to see it being used in more projects. If you have any questions regarding downloading/using the corpus, please
feel free to drop me an email.