Guanlong Zhao, Ph.D. 赵冠龙

About

I am a speech researcher and engineer. My research interests broadly focus on the intersection of speech modification and speech perception. I am also interested in speech synthesis (e.g., TTS) and speech recognition (E2E models in particular). Specifically, I spent five years working on topics in accent and voice conversion during my Ph.D. training.

If you would like to reach out to me, my email address is FirstNameLastName at PersonalEmailServiceByGoogle dot com. How do I pronounce my name? In PinYin, it is written as Guàn-Lóng Zhào; the tones are fourth, second, and fourth. Mapping to American English phonemes, it roughly sounds like Guan-Loan Chao. 🌈 Cheers!

Education

Ph.D. in Computer Science, Texas A&M University, 2020

B.S. in Applied Physics (minor in Computer Science), University of Science and Technology of China, 2015

Work Experience

Senior Software Engineer @ Google (Speech), November 2023–Present

Software Engineer @ Google (Speech), August 2020–October 2023

Research and Teaching Assistant @ Texas A&M University (Department of Computer Science and Engineering), September 2015–May 2020

Software Engineering Intern @ Google (Geo Machine Perception), May–August 2019

Software Engineering Intern @ Google (Speech), June–August 2018

Publications

Journal Articles

  1. S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning," Computer Speech & Language, vol. 72, 2022. pdf demo
  2. G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021. pdf demo
  3. I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "The English pronunciation of Arabic speakers: A data-driven approach to segmental error identification," Language Teaching Research, 2020. pdf summary
  4. G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 10, pp. 1649–1660, 2019. pdf code demo
  5. S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Learning structured sparse representations for voice conversion," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 343–354, 2019. pdf demo
  6. S. Ding, C. Liberatore, S. Sonsaat, I. Lučić Rehman, A. Silpachai, G. Zhao, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden speaker builder–An interactive tool for pronunciation training," Speech Communication, vol. 115, pp. 51–66, 2019. pdf code demo

Conference Proceedings

  1. Q. Wang*, Y. Huang*, G. Zhao*, E. Clark, W. Xia, and H. Liao, "DiarizationLM: Speaker diarization post-processing with large language models," in Interspeech, 2024. pp. 3754–3758. *Equal contribution. pdf code
  2. Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "On the success and limitations of auxiliary network based word-level end-to-end neural speaker diarization," in Interspeech, 2024. pp. 32–36. pdf
  3. G. Zhao, Y. Wang, J. Pelecanos, Y. Zhang, H. Liao, Y. Huang, H. Lu, and Q. Wang, "USM-SCD: Multilingual speaker change detection based on large pretrained foundation models," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024. pp. 11801–11805. pdf poster
  4. G. Zhao, Q. Wang, H. Lu, Y. Huang, and I. L. Moreno, "Augmenting transformer-transducer based speaker change detection with token-level training loss," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. pdf poster resources
  5. B. Labrador*, G. Zhao*, I. L. Moreno*, A. S. Scarpati, L. Fowl, and Quan Wang, "Exploring sequence-to-sequence transformer-transducer models for keyword spotting," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023. *Equal contribution. pdf
  6. A. Hair, G. Zhao, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna, "Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions," in Interspeech, 2021. pp. 2936–2940. pdf
  7. A. Silpachai, I. Lučić Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Khudilaynen, G. Zhao, and R. Gutierrez-Osuna, "Effects of voice type and task on L2 learners' awareness of pronunciation errors," in Interspeech, 2021. pp. 1952–1956. pdf
  8. S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition," in Interspeech, 2020. pp. 776–780. pdf code demo video
  9. A. Das, G. Zhao, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Understanding the effect of voice quality and accent on talker similarity," in Interspeech, 2020. pp. 1763–1767. pdf video
  10. G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign accent conversion by synthesizing speech from phonetic posteriorgrams," in Interspeech, 2019, pp. 2843–2847. pdf code demo slides
  11. G. Zhao, S. Sonsaat, A. Silpachai, I. Lučić Rehman, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English speech corpus," in Interspeech, 2018, pp. 2783–2787. pdf data code slides
  12. S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving sparse representations in exemplar-based voice conversion with a phoneme-selective objective function," in Interspeech, 2018, pp. 476–480. pdf
  13. C. Liberatore, G. Zhao, and R. Gutierrez-Osuna, "Voice conversion through residual warping in a sparse, anchor-based representation of speech," in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5284–5288. pdf poster
  14. G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5314–5318. pdf code demo poster
  15. G. Angello, A. B. Manam, G. Zhao, and R. Gutierrez-Osuna, "Training behavior of successful tacton-phoneme learners," in IEEE Haptics Symposium (WIP), 2018. pdf
  16. G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in voice conversion," in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2017, pp. 5525–5529. pdf demo poster

Book Chapter

  1. Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Chapter 10 — Image dehazing: Improved techniques," in Deep Learning through Sparse and Low-Rank Modeling: Elsevier, 2019, pp. 251–262. link code

Preprints

  1. B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, A. Park, and I. L. Moreno, "Personalizing keyword spotting with speaker information," arXiv preprint arXiv:2311.03419, 2023. pdf
  2. Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "Towards word-level end-to-end neural speaker diarization with auxiliary network," arXiv preprint arXiv:2309.08489, 2023. pdf
  3. Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. L. Moreno, "Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering," arXiv preprint arXiv:2210.13690, 2022. pdf code
  4. Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Improved techniques for learning to dehaze and beyond: A collective study," arXiv preprint arXiv:1807.00202, 2018. pdf code
  5. Y. Liu and G. Zhao, "PAD-Net: A perception-aided single image dehazing network," arXiv preprint arXiv:1805.03146, 2018. pdf code
  6. A. Datta, G. Zhao, B. Ramabhadran, E. Weinstein, "LSTM acoustic models learn to align and pronounce with graphemes," arXiv preprint arXiv:2008.06121, 2020. (Work done as an intern at Google NYC during summer 2018.) pdf

Abstracts

  1. I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "Pronunciation errors — A systematic approach to diagnosis," in L2 Pronunciation Research Workshop: Bridging the Gap between Research and Practice, 2019, pp. 23–24. pdf
  2. S. Sonsaat, E. Chukharev-Hudilainen, I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Golden Speaker Builder, an interactive tool for pronunciation training: User studies," in 6th International Conference on English Pronunciation: Issues & Practices (EPIP6), 2019, p. 72. pdf
  3. S. Ding, C. Liberatore, G. Zhao, S. Sonsaat, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden Speaker Builder: an interactive online tool for L2 learners to build pronunciation models," in Pronunciation in Second Language Learning and Teaching (PSLLT), 2017, pp. 25–26. pdf

Professional Service

Reviewer for:

Students mentored:

Teaching

Teaching Assistant: CSCE 482: Senior Capstone Design (Spring 2016)

Honors

  • Outstanding Reviewer Recognition, Organizing Committee of the International Conference on Acoustics Speech and Signal Processing (ICASSP), 2023
  • Graduate Student Travel Award (for Interspeech'19), Department of Computer Science and Engineering, Texas A&M University, 2019
  • Graduate Student Presentation Grant (for ICASSP'17), Office of Graduate and Professional Studies, Texas A&M University, 2017
  • Outstanding Graduate Award, University of Science and Technology of China, 2015
  • Outstanding Undergraduate Student Scholarship, University of Science and Technology of China, 2011–2014
  • Second Prize @ Chinese Chemistry Olympiad (Provincial Level), Chinese Chemical Society, 2010

L2-ARCTIC Corpus

The L2-ARCTIC corpus is a multi-purpose non-native English speech dataset. I took a leading role in this project, where I designed the data collection schemes and the annotation standards. I also spent a lot of time manually cleaning the raw speech recordings and performing quality control to ensure that the speech data and annotations were consistent and high-quality. The recordings were collected at Iowa State University (ISU), led by Dr. John Levis and his students in the Department of English. The annotations were mostly done by Dr. Alif Silpachai and Dr. Ivana Lučić Rehman. The project spanned around two years. We released the first version at Interspeech 2018, and we continued to add more data to the corpus. Its most recent version is almost 2.4x the size of the initial release.

We initially designed the corpus for the accent conversion task, and that was why we chose to use the CMU-ARCTIC prompts in the first place. Along the way, we were also working on some projects related to mispronunciation detection (MPD) and realized that there were limited open-source resources for MPD. We noticed that many of the CMU-ARCTIC sentences were hard for the participants to speak, which, on the one hand, made the recording sessions difficult, on the other hand, elicited rich pronunciation errors in non-native speech productions. As a result, we decided to annotate part of the corpus for phonetic errors. All the sentences we annotated were carefully selected by Dr. Levis to reflect the pronunciation issues that might happen given the speakers' native languages.

I use this corpus in all my publications on accent conversion. I found it is well-suited for the task because it allows me to test the algorithms on speakers with different accents, fluency level, age, and gender. I also use this corpus for MPD research. To the best of my knowledge, this is probably the largest open-source annotated MPD corpus. If you are interested in using the corpus for your projects, you can find access guidelines on its official project site. I would be happy to see it being used in more projects. If you have any questions regarding downloading/using the corpus, please feel free to drop me an email.