About
I am a speech researcher and engineer. My research interests broadly focus on the intersection of speech modification and speech perception.
I am also interested in speech synthesis (e.g., TTS) and speech recognition (E2E models in particular).
Specifically, I spent five years working on topics in accent and voice conversion during my Ph.D. training.
If you would like to reach out to me, my email address is FirstNameLastName at PersonalEmailServiceByGoogle dot com. How do I pronounce my name? In PinYin, it is written as Guàn-Lóng Zhào; the tones
are fourth, second, and fourth. Mapping to American English phonemes, it roughly sounds like Guan-Loan Chao. 🌈 Cheers!
Work Experience
Senior Software Engineer @ Google (Speech), November 2023–Present
Software Engineer @ Google (Speech), August 2020–October 2023
Research and Teaching Assistant @ Texas A&M University (Department of Computer Science and Engineering), September 2015–May 2020
Software Engineering Intern @ Google (Geo Machine Perception), May–August 2019
Software Engineering Intern @ Google (Speech), June–August 2018
Publications
Journal Articles
- S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning,"
Computer Speech & Language, vol. 72, 2022.
pdf
demo
- G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Converting foreign accent speech without a reference," IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2367–2381, 2021.
pdf
demo
- I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "The English pronunciation of Arabic speakers: A data-driven approach to segmental error identification," Language Teaching Research,
2020. pdf summary
- G. Zhao and R. Gutierrez-Osuna, "Using phonetic posteriorgram based frame pairing for segmental accent conversion," IEEE/ACM Transactions on Audio,
Speech, and Language Processing, vol. 27, no. 10, pp. 1649–1660, 2019. pdf code demo
- S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Learning structured sparse representations for voice conversion," IEEE/ACM Transactions
on Audio, Speech, and Language Processing, vol. 28, pp. 343–354, 2019. pdf demo
- S. Ding, C. Liberatore, S. Sonsaat, I. Lučić Rehman, A. Silpachai, G. Zhao, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden speaker
builder–An interactive tool for pronunciation training," Speech Communication, vol. 115, pp. 51–66, 2019. pdf
code
demo
Conference Proceedings
- Q. Wang*, Y. Huang*, G. Zhao*, E. Clark, W. Xia, and H. Liao, "DiarizationLM: Speaker diarization post-processing with large language models,"
in Interspeech, 2024. pp. 3754–3758. *Equal contribution. pdf
code
- Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "On the success and limitations of auxiliary network based word-level end-to-end neural speaker diarization,"
in Interspeech, 2024. pp. 32–36. pdf
- G. Zhao, Y. Wang, J. Pelecanos, Y. Zhang, H. Liao, Y. Huang, H. Lu, and Q. Wang, "USM-SCD: Multilingual speaker change detection based on large pretrained foundation models,"
in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024. pp. 11801–11805. pdf
poster
- G. Zhao, Q. Wang, H. Lu, Y. Huang, and I. L. Moreno, "Augmenting transformer-transducer based speaker change detection with token-level training loss,"
in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2023. pdf
poster
resources
- B. Labrador*, G. Zhao*, I. L. Moreno*, A. S. Scarpati, L. Fowl, and Quan Wang, "Exploring sequence-to-sequence transformer-transducer models for keyword spotting,"
in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2023. *Equal contribution. pdf
- A. Hair, G. Zhao, B. Ahmed, K. Ballard, and R. Gutierrez-Osuna, "Assessing posterior-based mispronunciation detection on field-collected recordings from child speech therapy sessions,"
in Interspeech, 2021. pp. 2936–2940. pdf
- A. Silpachai, I. Lučić Rehman, T. A. Barriuso, J. Levis, E. Chukharev-Khudilaynen, G. Zhao, and R. Gutierrez-Osuna, "Effects of voice type and task on L2 learners' awareness of pronunciation errors,"
in Interspeech, 2021. pp. 1952–1956. pdf
- S. Ding, G. Zhao, and R. Gutierrez-Osuna, "Improving the speaker identity of non-parallel many-to-many voice conversion with adversarial speaker recognition," in Interspeech, 2020. pp. 776–780.
pdf code
demo
video
- A. Das, G. Zhao, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Understanding the effect of voice quality and accent on talker similarity," in Interspeech, 2020. pp. 1763–1767.
pdf
video
- G. Zhao, S. Ding, and R. Gutierrez-Osuna, "Foreign accent conversion by synthesizing speech from phonetic posteriorgrams,"
in Interspeech, 2019, pp. 2843–2847. pdf
code
demo slides
- G. Zhao, S. Sonsaat, A. Silpachai, I. Lučić Rehman, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "L2-ARCTIC: A non-native English
speech corpus," in Interspeech, 2018, pp. 2783–2787. pdf
data code
slides
- S. Ding, G. Zhao, C. Liberatore, and R. Gutierrez-Osuna, "Improving sparse representations in exemplar-based voice conversion with a
phoneme-selective objective function," in Interspeech, 2018, pp. 476–480. pdf
- C. Liberatore, G. Zhao, and R. Gutierrez-Osuna, "Voice conversion through residual warping in a sparse, anchor-based representation of speech,"
in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5284–5288. pdf
poster
- G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez-Osuna, "Accent conversion using phonetic posteriorgrams,"
in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018, pp. 5314–5318. pdf
code
demo poster
- G. Angello, A. B. Manam, G. Zhao, and R. Gutierrez-Osuna, "Training behavior of successful tacton-phoneme learners,"
in IEEE Haptics Symposium (WIP), 2018. pdf
- G. Zhao and R. Gutierrez-Osuna, "Exemplar selection methods in voice conversion," in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 2017, pp. 5525–5529. pdf
demo
poster
Book Chapter
- Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Chapter 10 — Image dehazing: Improved techniques,"
in Deep Learning through Sparse and Low-Rank Modeling: Elsevier, 2019, pp. 251–262.
link
code
Preprints
- B. Labrador, P. Zhu, G. Zhao, A. S. Scarpati, Q. Wang, A. Lozano-Diez, A. Park, and I. L. Moreno, "Personalizing keyword spotting with speaker information,"
arXiv preprint arXiv:2311.03419, 2023. pdf
- Y. Huang, W. Wang, G. Zhao, H. Liao, W. Xia, and Q. Wang, "Towards word-level end-to-end neural speaker diarization with auxiliary network,"
arXiv preprint arXiv:2309.08489, 2023. pdf
- Q. Wang, Y. Huang, H. Lu, G. Zhao, and I. L. Moreno, "Highly efficient real-time streaming and fully on-device speaker diarization with multi-stage clustering,"
arXiv preprint arXiv:2210.13690, 2022. pdf
code
- Y. Liu, G. Zhao, B. Gong, Y. Li, R. Raj, N. Goel, S. Kesav, S. Gottimukkala, Z. Wang, W. Ren, and D. Tao, "Improved techniques for learning to dehaze and
beyond: A collective study," arXiv preprint arXiv:1807.00202, 2018. pdf
code
- Y. Liu and G. Zhao, "PAD-Net: A perception-aided single image dehazing network," arXiv preprint arXiv:1805.03146, 2018.
pdf
code
- A. Datta, G. Zhao, B. Ramabhadran, E. Weinstein, "LSTM acoustic models learn to align and pronounce with graphemes," arXiv preprint arXiv:2008.06121, 2020.
(Work done as an intern at Google NYC during summer 2018.)
pdf
Abstracts
- I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, and R. Gutierrez-Osuna, "Pronunciation errors — A systematic approach to diagnosis,"
in L2 Pronunciation Research Workshop: Bridging the Gap between Research and Practice, 2019, pp. 23–24. pdf
- S. Sonsaat, E. Chukharev-Hudilainen, I. Lučić Rehman, A. Silpachai, J. Levis, G. Zhao, S. Ding, C. Liberatore, and R. Gutierrez-Osuna, "Golden Speaker Builder, an interactive tool for pronunciation training: User
studies," in 6th International Conference on English Pronunciation: Issues & Practices (EPIP6), 2019, p. 72. pdf
- S. Ding, C. Liberatore, G. Zhao, S. Sonsaat, E. Chukharev-Hudilainen, J. Levis, and R. Gutierrez-Osuna, "Golden Speaker Builder: an interactive online tool for L2 learners to build pronunciation models,"
in Pronunciation in Second Language Learning and Teaching (PSLLT), 2017, pp. 25–26. pdf
Professional Service
Reviewer for:
- IEEE/ACM Transactions on Audio, Speech, and Language Processing
- Computer Speech & Language
- Speech Communication
- IEEE Transactions on Image Processing
- IEEE Transactions on Information Forensics and Security
- IEEE Transactions on Computational Social Systems
- Heliyon
- Language Learning & Technology
- IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP): 2020, 2022, 2023, 2024
- Annual Conference of the International Speech Communication Association (Interspeech): 2019, 2021,
2022, 2023, 2024
- IEEE Spoken Language Technology Workshop (SLT): 2024
- IEEE Automatic Speech Recognition and Understanding Workshop (ASRU): 2021, 2023
- IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA): 2023
Students mentored:
Honors
- Outstanding Reviewer Recognition, Organizing Committee of the International Conference on Acoustics Speech and Signal Processing (ICASSP), 2023
- Graduate Student Travel Award (for Interspeech'19), Department of Computer Science and Engineering, Texas A&M University, 2019
- Graduate Student Presentation Grant (for ICASSP'17), Office of Graduate and Professional Studies, Texas A&M University, 2017
- Outstanding Graduate Award, University of Science and Technology of China, 2015
- Outstanding Undergraduate Student Scholarship, University of Science and Technology of China, 2011–2014
- Second Prize @ Chinese Chemistry Olympiad (Provincial Level), Chinese Chemical Society, 2010
L2-ARCTIC Corpus
The L2-ARCTIC corpus is a multi-purpose non-native English speech dataset. I took a leading role in this project, where I designed the data collection
schemes and the annotation standards. I also spent a lot of time manually cleaning the raw speech recordings and performing quality control to ensure that the speech data and annotations were consistent and high-quality. The
recordings were collected at Iowa State University (ISU), led by Dr. John Levis and his students in the Department of English. The annotations were mostly
done by Dr. Alif Silpachai and Dr. Ivana Lučić Rehman.
The project spanned around two years. We released the first version at Interspeech 2018, and we continued to add more data to the corpus. Its most recent version is almost 2.4x the size of the initial
release.
We initially designed the corpus for the accent conversion task, and that was why we chose to use the CMU-ARCTIC prompts in the first place. Along the way, we were also working on some projects related to mispronunciation
detection (MPD) and realized that there were limited open-source resources for MPD. We noticed that many of the CMU-ARCTIC sentences were hard for the participants to speak, which, on the one hand, made the recording
sessions difficult, on the other hand, elicited rich pronunciation errors in non-native speech productions. As a result, we decided to annotate part of the corpus for phonetic errors. All the sentences we annotated were carefully
selected by Dr. Levis to reflect the pronunciation issues that might happen given the speakers' native languages.
I use this corpus in all my publications on accent conversion. I found it is well-suited for the task because it allows me to test the algorithms on speakers with different accents, fluency level, age, and gender. I also use
this corpus for MPD research. To the best of my knowledge, this is probably the largest open-source annotated MPD corpus. If you are interested in using the corpus for your projects, you can find access guidelines on
its official project site. I would be happy to see it being used in more projects. If you have any questions regarding downloading/using the corpus, please
feel free to drop me an email.