Guanlong Zhao, Ph.D. 赵冠龙

About

As a speech researcher and engineer, my work encompasses a diverse portfolio of research areas, including voice conversion, accent conversion, speaker change detection, speaker diarization, automatic speech recognition (ASR), keyword spotting, and multimodal large language models (LLMs).

For professional correspondence, you may reach me via email at FirstNameLastName at PersonalEmailServiceByGoogle dot com. How do I pronounce my name? In PinYin, it is written as Guàn-Lóng Zhào; the tones are fourth, second, and fourth. Mapping to American English phonemes, it roughly sounds like Guan-Loan Chao. 🌈 Cheers!

Education

Ph.D. in Computer Science, Texas A&M University

B.S. in Applied Physics (minor in Computer Science), University of Science and Technology of China

Publications

Journal Articles

  1. S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning," Computer Speech & Language, vol. 72, 2022. pdf demo
  2. G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021. pdf demo
  3. I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "The English pronunciation of Arabic speakers: A data-driven approach to segmental error identification," Language Teaching Research, 2020. pdf summary
  4. G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1649–1660, 2019. pdf code demo
  5. S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Learning structured sparse representations for voice conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 343–354, 2019. pdf demo
  6. S. Ding, C. Liberatore, S. Sonsaat, I. Lučić Rehman, A. Silpachai, G. Zhao, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden speaker builder–An interactive tool for pronunciation training," Speech Communication, vol. 115, pp. 51–66, 2019. pdf code demo

Conference Proceedings

  1. B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, and I. L. Moreno, "Personalizing keyword spotting with speaker information," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025. pdf
  2. Q. Wang*, Y. Huang*, G. Zhao*, E. Clark, W. Xia, and H. Liao, "DiarizationLM: Speaker diarization post-processing with large language models," in Interspeech, 2024. pp. 3754–3758. *Equal contribution. pdf code
  3. Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "On the success and limitations of auxiliary network based word-level end-to-end neural speaker diarization," in Interspeech, 2024. pp. 32–36. pdf
  4. G. Zhao, Y. Wang, J. Pelecanos, Y. Zhang, H. Liao, Y. Huang, H. Lu, and Q. Wang, "USM-SCD: Multilingual speaker change detection based on large pretrained foundation models," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024. pp. 11801–11805. pdf poster
  5. G. Zhao, Q. Wang, H. Lu, Y. Huang, and I. L. Moreno, "Augmenting transformer-transducer based speaker change detection with token-level training loss," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. pdf poster resources
  6. B. Labrador*, G. Zhao*, I. L. Moreno*, A. S. Scarpati, L. Fowl, and Quan Wang, "Exploring sequence-to-sequence transformer-transducer models for keyword spotting," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. *Equal contribution. pdf
  7. A. Hair, G. Zhao, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna, "Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions," in Interspeech, 2021. pp. 2936–2940. pdf
  8. A. Silpachai, I. Lučić Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Khudilaynen, G. Zhao, and R. Gutierrez-Osuna, "Effects of voice type and task on L2 learners' awareness of pronunciation errors," in Interspeech, 2021. pp. 1952–1956. pdf
  9. S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition," in Interspeech, 2020. pp. 776–780. pdf code demo video
  10. A. Das, G. Zhao, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Understanding the effect of voice quality and accent on talker similarity," in Interspeech, 2020. pp. 1763–1767. pdf video
  11. G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign accent conversion by synthesizing speech from phonetic posteriorgrams," in Interspeech, 2019, pp. 2843–2847. pdf code demo slides
  12. G. Zhao, S. Sonsaat, A. Silpachai, I. Lučić Rehman, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English speech corpus," in Interspeech, 2018, pp. 2783–2787. pdf data code slides
  13. S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving sparse representations in exemplar-based voice conversion with a phoneme-selective objective function," in Interspeech, 2018, pp. 476–480. pdf
  14. C. Liberatore, G. Zhao, and R. Gutierrez-Osuna, "Voice conversion through residual warping in a sparse, anchor-based representation of speech," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5284–5288. pdf poster
  15. G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5314–5318. pdf code demo poster
  16. G. Angello, A. B. Manam, G. Zhao, and R. Gutierrez-Osuna, "Training behavior of successful tacton-phoneme learners," in IEEE Haptics Symposium (WIP), 2018. pdf
  17. G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in voice conversion," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 5525–5529. pdf demo poster

Book Chapter

  1. Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Chapter 10 — Image dehazing: Improved techniques," in Deep Learning through Sparse and Low-Rank Modeling: Elsevier, 2019, pp. 251–262. link code

Preprints

  1. Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "Towards word-level end-to-end neural speaker diarization with auxiliary network," arXiv preprint arXiv:2309.08489, 2023. pdf
  2. Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. L. Moreno, "Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering," arXiv preprint arXiv:2210.13690, 2022. pdf code
  3. Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Improved techniques for learning to dehaze and beyond: A collective study," arXiv preprint arXiv:1807.00202, 2018. pdf code
  4. Y. Liu and G. Zhao, "PAD-Net: A perception-aided single image dehazing network," arXiv preprint arXiv:1805.03146, 2018. pdf code
  5. A. Datta, G. Zhao, B. Ramabhadran, E. Weinstein, "LSTM acoustic models learn to align and pronounce with graphemes," arXiv preprint arXiv:2008.06121, 2020. (Work done as an intern at Google NYC during summer 2018.) pdf

Abstracts

  1. I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "Pronunciation errors — A systematic approach to diagnosis," in L2 Pronunciation Research Workshop: Bridging the Gap between Research and Practice, 2019, pp. 23–24. pdf
  2. S. Sonsaat, E. Chukharev-Hudilainen, I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Golden Speaker Builder, an interactive tool for pronunciation training: User studies," in 6th International Conference on English Pronunciation: Issues & Practices (EPIP6), 2019, p. 72. pdf
  3. S. Ding, C. Liberatore, G. Zhao, S. Sonsaat, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden Speaker Builder: an interactive online tool for L2 learners to build pronunciation models," in Pronunciation in Second Language Learning and Teaching (PSLLT), 2017, pp. 25–26. pdf

Professional Service

Reviewer for:

Students mentored:

  • Yu-Neng Chuang, Research Intern @ Google DeepMind, 2025
  • Anastasia Kuznetsova, Research Intern @ Google, 2023. First Employment: Applied Research Scientist, Rev
  • Beltrán Labrador, Research Intern @ Google, 2022 & 2023. First Employment: Machine Learning Engineer, Google DeepMind

L2-ARCTIC Corpus

The L2-ARCTIC corpus is a comprehensive, multi-purpose dataset of non-native English speech. As a lead researcher on this two-year project, I designed the data collection protocols and annotation standards. I also oversaw extensive manual processing and rigorous quality control to ensure high-fidelity recordings and consistent annotations. Data collection was conducted at Iowa State University (ISU) under the direction of Dr. John Levis and his team in the Department of English, with primary annotations completed by Dr. Alif Silpachai and Dr. Ivana Lučić Rehman. Following our initial release at Interspeech 2018, we continuously expanded the dataset, growing the current version to nearly 2.4 times its original size.

Initially developed for accent conversion tasks—which motivated our use of the CMU-ARCTIC prompts—the project quickly evolved to address the scarcity of open-source resources for mispronunciation detection (MPD). We observed that the phonetic complexity of the CMU-ARCTIC sentences naturally elicited a rich variety of pronunciation errors from non-native speakers. Consequently, we expanded our scope to include phonetic error annotations. All annotated subsets were carefully curated by Dr. Levis to target anticipated pronunciation challenges based on the speakers' native languages.

This corpus serves as the foundational dataset for my publications on accent conversion, proving highly effective for evaluating algorithms across diverse accents, fluency levels, ages, and genders. It has also been instrumental in my MPD research; at the time of its 2018 release, it was among the largest open-source annotated MPD corpora available. Access guidelines for integrating the dataset into your own research can be found on the official project site. I strongly encourage its broader application within the research community and welcome any inquiries regarding dataset access or utilization via email.