Guanlong Zhao, Ph.D. 赵冠龙

About

As a speech researcher and engineer, my work encompasses a diverse portfolio of research areas, including voice conversion, accent conversion, speaker change detection, speaker diarization, automatic speech recognition (ASR), keyword spotting, and multimodal large language models (LLMs).

For professional correspondence, you may reach me via email at FirstNameLastName at PersonalEmailServiceByGoogle dot com. How do I pronounce my name? In PinYin, it is written as Guàn-Lóng Zhào; the tones are fourth, second, and fourth. Mapping to American English phonemes, it roughly sounds like Guan-Loan Chao. 🌈 Cheers!

Education

Ph.D. in Computer Science, Texas A&M University

Dissertation: Foreign accent conversion with neural acoustic modeling
Advisor: Dr. Ricardo Gutierrez-Osuna

B.S. in Applied Physics (minor in Computer Science), University of Science and Technology of China

Publications

Journal Articles

S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning," Computer Speech & Language, vol. 72, 2022. pdf demo
G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021. pdf demo
I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "The English pronunciation of Arabic speakers: A data-driven approach to segmental error identification," Language Teaching Research, 2020. pdf summary
G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1649–1660, 2019. pdf code demo
S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Learning structured sparse representations for voice conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 343–354, 2019. pdf demo
S. Ding, C. Liberatore, S. Sonsaat, I. Lučić Rehman, A. Silpachai, G. Zhao, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden speaker builder–An interactive tool for pronunciation training," Speech Communication, vol. 115, pp. 51–66, 2019. pdf code demo

Conference Proceedings

B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, and I. L. Moreno, "Personalizing keyword spotting with speaker information," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025. pdf
Q. Wang^*, Y. Huang^*, G. Zhao^*, E. Clark, W. Xia, and H. Liao, "DiarizationLM: Speaker diarization post-processing with large language models," in Interspeech, 2024. pp. 3754–3758. ^*Equal contribution. pdf code
Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "On the success and limitations of auxiliary network based word-level end-to-end neural speaker diarization," in Interspeech, 2024. pp. 32–36. pdf
G. Zhao, Y. Wang, J. Pelecanos, Y. Zhang, H. Liao, Y. Huang, H. Lu, and Q. Wang, "USM-SCD: Multilingual speaker change detection based on large pretrained foundation models," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024. pp. 11801–11805. pdf poster
G. Zhao, Q. Wang, H. Lu, Y. Huang, and I. L. Moreno, "Augmenting transformer-transducer based speaker change detection with token-level training loss," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. pdf poster resources
B. Labrador^*, G. Zhao^*, I. L. Moreno^*, A. S. Scarpati, L. Fowl, and Quan Wang, "Exploring sequence-to-sequence transformer-transducer models for keyword spotting," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. ^*Equal contribution. pdf
A. Hair, G. Zhao, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna, "Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions," in Interspeech, 2021. pp. 2936–2940. pdf
A. Silpachai, I. Lučić Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Khudilaynen, G. Zhao, and R. Gutierrez-Osuna, "Effects of voice type and task on L2 learners' awareness of pronunciation errors," in Interspeech, 2021. pp. 1952–1956. pdf
S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition," in Interspeech, 2020. pp. 776–780. pdf code demo video
A. Das, G. Zhao, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Understanding the effect of voice quality and accent on talker similarity," in Interspeech, 2020. pp. 1763–1767. pdf video
G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign accent conversion by synthesizing speech from phonetic posteriorgrams," in Interspeech, 2019, pp. 2843–2847. pdf code demo slides
G. Zhao, S. Sonsaat, A. Silpachai, I. Lučić Rehman, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English speech corpus," in Interspeech, 2018, pp. 2783–2787. pdf data code slides
S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving sparse representations in exemplar-based voice conversion with a phoneme-selective objective function," in Interspeech, 2018, pp. 476–480. pdf
C. Liberatore, G. Zhao, and R. Gutierrez-Osuna, "Voice conversion through residual warping in a sparse, anchor-based representation of speech," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5284–5288. pdf poster
G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5314–5318. pdf code demo poster
G. Angello, A. B. Manam, G. Zhao, and R. Gutierrez-Osuna, "Training behavior of successful tacton-phoneme learners," in IEEE Haptics Symposium (WIP), 2018. pdf
G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in voice conversion," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 5525–5529. pdf demo poster

Book Chapter

Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Chapter 10 — Image dehazing: Improved techniques," in Deep Learning through Sparse and Low-Rank Modeling: Elsevier, 2019, pp. 251–262. link code

Preprints

Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "Towards word-level end-to-end neural speaker diarization with auxiliary network," arXiv preprint arXiv:2309.08489, 2023. pdf
Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. L. Moreno, "Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering," arXiv preprint arXiv:2210.13690, 2022. pdf code
Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Improved techniques for learning to dehaze and beyond: A collective study," arXiv preprint arXiv:1807.00202, 2018. pdf code
Y. Liu and G. Zhao, "PAD-Net: A perception-aided single image dehazing network," arXiv preprint arXiv:1805.03146, 2018. pdf code
A. Datta, G. Zhao, B. Ramabhadran, E. Weinstein, "LSTM acoustic models learn to align and pronounce with graphemes," arXiv preprint arXiv:2008.06121, 2020. (Work done as an intern at Google NYC during summer 2018.) pdf

Abstracts

I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "Pronunciation errors — A systematic approach to diagnosis," in L2 Pronunciation Research Workshop: Bridging the Gap between Research and Practice, 2019, pp. 23–24. pdf
S. Sonsaat, E. Chukharev-Hudilainen, I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Golden Speaker Builder, an interactive tool for pronunciation training: User studies," in 6th International Conference on English Pronunciation: Issues & Practices (EPIP6), 2019, p. 72. pdf
S. Ding, C. Liberatore, G. Zhao, S. Sonsaat, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden Speaker Builder: an interactive online tool for L2 learners to build pronunciation models," in Pronunciation in Second Language Learning and Teaching (PSLLT), 2017, pp. 25–26. pdf

Professional Service

Reviewer for:

IEEE/ACM Transactions on Audio, Speech, and Language Processing
Computer Speech & Language
Speech Communication
IEEE Transactions on Image Processing
IEEE Transactions on Information Forensics and Security
IEEE Transactions on Computational Social Systems
Language Learning & Technology
Language Resources and Evaluation
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 2020, 2022, 2023, 2024, 2025
Annual Conference of the International Speech Communication Association (Interspeech): 2019, 2021, 2022, 2023, 2024, 2025
IEEE Spoken Language Technology Workshop (SLT): 2024
IEEE Automatic Speech Recognition and Understanding Workshop (ASRU): 2021, 2023
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA): 2023
International Joint Conference on Neural Networks (IJCNN): 2025

Students mentored:

Yu-Neng Chuang, Research Intern @ Google DeepMind, 2025
Anastasia Kuznetsova, Research Intern @ Google, 2023. First Employment: Applied Research Scientist, Rev
Beltrán Labrador, Research Intern @ Google, 2022 & 2023. First Employment: Machine Learning Engineer, Google DeepMind

L2-ARCTIC Corpus

The L2-ARCTIC corpus is a multi-purpose non-native English speech dataset. I took a leading role in this project, where I designed the data collection schemes and the annotation standards. I also spent a lot of time manually cleaning the raw speech recordings and performing quality control to ensure that the speech data and annotations were consistent and high-quality. The recordings were collected at Iowa State University (ISU), led by Dr. John Levis and his students in the Department of English. The annotations were mostly done by Dr. Alif Silpachai and Dr. Ivana Lučić Rehman. The project spanned around two years. We released the first version at Interspeech 2018, and we continued to add more data to the corpus. Its most recent version is almost 2.4x the size of the initial release.

We initially designed the corpus for the accent conversion task, and that was why we chose to use the CMU-ARCTIC prompts in the first place. Along the way, we were also working on some projects related to mispronunciation detection (MPD) and realized that there were limited open-source resources for MPD. We noticed that many of the CMU-ARCTIC sentences were hard for the participants to speak, which, on the one hand, made the recording sessions difficult, on the other hand, elicited rich pronunciation errors in non-native speech productions. As a result, we decided to annotate part of the corpus for phonetic errors. All the sentences we annotated were carefully selected by Dr. Levis to reflect the pronunciation issues that might happen given the speakers' native languages.

I use this corpus in all my publications on accent conversion. I found it is well-suited for the task because it allows me to test the algorithms on speakers with different accents, fluency level, age, and gender. I also use this corpus for MPD research. To the best of my knowledge, this was probably the largest open-source annotated MPD corpus at the time (around 2018). If you are interested in using the corpus for your projects, you can find access guidelines on its official project site. I would be happy to see it being used in more projects. If you have any questions regarding downloading/using the corpus, please feel free to drop me an email.