Kai Zhen

Applied Scientist @ Amazon AGI
Ph.D. in Computer Science and Cognitive Science at Indiana University Bloomington
Curriculum vitae

News

Dec 14, 2023: Is there a better way than minimum word error rate (MWER) for training ASR Transducers? Our paper (accepted for publication in IEEE ICASSP'24) introduces Max Margin Transducer (MMT). It does a better job of telling apart positive and negative samples among n-best hypotheses.

May 17, 2023: Is self-attention really needed in Conformer? Our paper (accepted for publication in Interspeech'23) argues that "when length of utterances are short and causality is a requirement (for streaming applications), self-attention is not very effective".

Jan 9, 2023: Attending IEEE SLT'22 remotely to present on low bit-depth Conformer [demo]. Send me an email for any related questions.

Oct 3, 2022: One accepted paper on 4-bit quantized causal Conformer for publication at IEEE SLT'22.

Jun 15, 2022: Check out our paper (accepted for publication in Interspeech'22) highlighting Alexa's recent efforts on Sub-8-Bit quantization for on-device ASR!

Apr 26, 2021: I received the Outstanding Research Award from the Cognitive Science program for my recent dissertation research.

Apr 19, 2021: Joining Amazon Alexa Speech as an applied scientist!

Apr 6, 2021: I successfully defended my Ph.D. dissertation, entitled “Neural Waveform Coding: Scalability, Efficiency and Psychoacoustic Calibration.”

Older news...

Research & Development

Since joining Amazon Alexa, I've driven several neural efficiency projects for core ASR components (for both on-the-edge and over-the-air scenarios) to achieve runtime efficiency, lower training cost while preserving the predictive performance.

Our neural efficiency innovations have been productized in various Amazon's voice-controlled assistants, such as Echo, Echo Dot, Echo Show, etc. Overall, our methods reduced the memory footprint and user perceived latency with improved recognition accuracy simultaneously. Millions of customers are using them.

Some of our innovations are published in conference proceedings and patented as well.

C-009 Rupak Vignesh Swaminathan, Grant Strimel, Ariya Rastrow, Harish Mallidi, Kai Zhen, Hieu Nguyen, Nathan Susanj, Athanasios Mouchtaris, "Max-Margin Transducer Loss: Improving Sequence-Discriminative Training Using a Large-Margin Learning Strategy," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Seoul, Korea, 14-19 April, 2024.

C-008 Martin Radfar, Paulina Lyskawa, Brandon Trujillo, Yi Xie, Kai Zhen, Jahn Heymann, Denis Filimonov, Grant Strimel, Nathan Susanj, Athanasios Mouchtaris, "Conmer: Streaming Conformer with No Self-Attention for Interactive Voice Assistants," In Proc. Annual Conference of the International Speech Communication Association (Interspeech), Dublin, Ireland, August 21-24, 2023.

C-007 Kai Zhen, Martin Radfar, Hieu Nguyen, Grant Strimel, Nathan Susanj, Athanasios Mouchtaris, "Sub-8-Bit Quantization for On-Device Speech Recognition: A Regularization-Free Approach," in Proceedings of the 2022 IEEE Spoken Language Technology Workshop (SLT 2022), Doha, Qatar, January 9-12, 2023.
[pdf]

C-006 Kai Zhen, Hieu Duy Nguyen, Raviteja Chinta, Nathan Susanj, Athanasios Mouchtaris, Tariq Afzal, and Ariya Rastrow, "Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition," In Proc. Annual Conference of the International Speech Communication Association (Interspeech), Incheon, Korea, September 18-22, 2022.
[pdf]

C-005 Kai Zhen, Hieu Duy Nguyen, Feng-Ju (Claire) Chang, Athanasios Mouchtaris, and Ariya Rastrow, "Sparsification via Compressed Sensing for Automatic Speech Recognition," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, June 6-12, 2021.
[pdf]

In the course of my Ph.D. studies, I conducted research on neural speech and audio waveform coding, supervised by Prof. Minje Kim. One way to describe this problem is to compress the acoustic waveform into a very compact representation which can be reconstructed with little to no quality degradation. From the neural network quantization's perspective, it is simply to quantize the activation from one specific (usually the bottleneck) layer.

Of course, the data-driven paradigm has built a better ladder; but that may not always "get you to the moon". Usually, the better solution is observed from the marriage between the modern computational framework and conventional domain-specific knowledge. To that end, we proposed ways to incorporated residual coding, linear predictive coding and psychoacoustics in an end-to-end neural waveform codec.

Some of the related publications are

J-002 Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, Minje Kim, "Scalable and Efficient Neural Speech Coding: A Hybrid Design," IEEE/ACM Transactions on Audio, Speech, and Language Processing (IEEE/ACM TASLP), 30 (2021): 12-25.
[pdf]

J-001 Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, and Minje Kim, "Psychoacoustic Calibration of Loss Functions for Efficient End-to-End Neural Audio Coding," IEEE Signal Processing Letters (SPL), vol. 27, pp. 2159-2163, 2020, doi: 10.1109/LSP.2020.3039765.. (Also presented at ICASSP 2022)
[demo] [pdf] [code]

C-004 Haici Yang, Kai Zhen, Seungkwon Beack, Minje Kim, "Source-Aware Neural Speech Coding for Noisy Speech Compression," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toronto, ON, Canada, June 6-12, 2021.
[pdf]

C-003 Kai Zhen, Mi Suk Lee, Jongmo Sung, Seungkwon Beack, and Minje Kim, "Efficient and Scalable Neural Residual Waveform Coding with Collaborative Quantization," in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, May 4-8, 2020.
[demo] [pdf] [code]

C-001 Kai Zhen, Jongmo Sung, Mi Suk Lee, Seungkwon Beack, and Minje Kim, "Cascaded Cross-Module Residual Learning towards Lightweight End-to-End Speech Coding," In Proc. Annual Conference of the International Speech Communication Association (Interspeech), Graz, Austria, September 15-19, 2019.
[demo] [pdf]

P-005 Mi Suk Lee, Seung Kwon Beack, Jongmo Sung, Tae Jin Lee, Jin Soo Choi, Minje Kim, Kai Zhen, "Method and apparatus for processing audio signal," U.S. Patent Application US20210233547A1.

P-004 Minje Kim, Kai Zhen, Mi Suk Lee, Seung Kwon Beack, Jongmo Sung, Tae Jin Lee, Jin Soo Choi "Residual Coding Method of Linear Prediction Coding Coefficient Based on Collaborative Quantization, and Computing Device for Performing the Method," U.S. Patent Application No. 17/098,090.

P-002 Minje Kim, Kai Zhen, Seungkwon Beack, et al, "Audio Signal Encoding Method and Audio Signal Decoding Method, And Encoder And Decoder Performing the Same," US Patent Application, US20200135220A1.

Find a complete list of my publications on my Google Scholar profile .

Professional Activities

Conference Reviewer

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) - 2019 to 2024

ISCA Interspeech - 2022 to 2024

EURASIP European Signal Processing Conference (EUSIPCO) - 2022, 2023

IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) - 2021, 2023

IEEE International Conference on Data Mining (ICDM), 2020

Association for the Advancement of Artificial Intelligence (AAAI) - 2017, 2018

Journal Reviewer

European Association for Signal Processing (EURASIP) Journal on Audio, Speech, and Music Processing

Speech Communication

IEEE MultiMedia

Kai Zhen

Applied Scientist @ Amazon AGI Ph.D. in Computer Science and Cognitive Science at Indiana University Bloomington Curriculum vitae

Applied Scientist @ Amazon AGI
Ph.D. in Computer Science and Cognitive Science at Indiana University Bloomington
Curriculum vitae