A Hybrid RWHISYMP Speech-to-Text Noise Suppression Model: Integration of the Whisper Base Model, RNNoise, and SympSpell Algorithms

Mariann F.  Bragas; Laurence D.  Ganda; Leonila R. Juanatas  MIT; Charisse S. Ronquillo  MIT

doi:10.54536/ajsts.v5i1.7326

Authors

Mariann F. Bragas College of Engineering and Technology Education, Holy Trinity College of General Santos City, General Santos City, Philippines
Laurence D. Ganda College of Engineering and Technology Education, Holy Trinity College of General Santos City, General Santos City, Philippines
Leonila R. Juanatas MIT College of Engineering and Technology Education, Holy Trinity College of General Santos City, General Santos City, Philippines
Charisse S. Ronquillo MIT College of Engineering and Technology Education, Holy Trinity College of General Santos City, General Santos City, Philippines

DOI:

https://doi.org/10.54536/ajsts.v5i1.7326

Keywords:

RNNoise, RWhiSymp, Speech-to-Text, SymSpell, Whisper Base Model

Abstract

Deaf and Hard-of-Hearing (DHH) individuals face difficulty in accessing spoken information without the use of an interpreter and using methods such as lip reading and writing are inadequate. While Automatic Speech Recognition (ASR) technologies offer real-time transcription. Noise interference is a prevalent issue and can lead to transcription inaccuracies. This study introduces the RWhiSymp, a hybrid speech-to-text noise suppression model that integrates three components: RNNoise for noise suppression, Whisper Base Model for ASR, and SymSpell for spelling correction. The integrated system is designed to minimize the Word Error Rate (WER) by suppressing background noise, leading to improved accuracy and correcting misspelled words. The evaluation results shows that the RWhiSymp reduced WER by 2.66% in high-noise conditions at 60-80dB and 2.17% in low-noise conditions at 10-30dB. A spectrogram of the audio using RNNoise shows its effectiveness in reducing noise while preserving speech clarity. Misspelled words are corrected using SymSpell. User evaluation was conducted with DHH participants reported high satisfaction across effectiveness, productivity, and accessibility, with overall ratings interpreted as Very Satisfied. The findings indicate that RWhiSymp offers a practical, real-time, and accessible solution that empowers DHH individuals by enhancing their ability to engage in spoken communication. This research highlights the value of hybrid ASR pipelines in assistive technologies and provides a foundation for future work in speech recognition, noise suppression, and natural language processing for accessibility.

Downloads

Download data is not yet available.

References

Abidin, T. F., Misbullah, A., Ferdhiana, R., Aksana, M. Z., & Farsiah, L. (2020, October 28). Deep neural network for automatic speech recognition from Indonesian audio using several lexicon types. In 2020 International Conference on Electrical Engineering and Informatics (ICELTICs). https://doi.org/10.1109/ICELTICs50595.2020.9315538

Andreyev, A. (2025). Quantization for OpenAI’s Whisper models: A comparative analysis [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2503.09905

Audah, H. A., Yuliawati, A., & Alfina, I. (2023). A Comparison Between SymSpell and a Combination of Damerau-Levenshtein Distance with the Trie Data Structure. Institute of Electrical and Electronics Engineers, 1–6. https://doi.org/10.1109/icaicta59291.2023.10390399

Awati, R., Sheldon, R., & Burke, J. (2025, June 3). What is signal-to-noise ratio and how is it measured? Search Networking. https://www.techtarget.com/searchnetworking/definition/signal-to-noise-ratio

Behera, S. K., & Mitali, M. N. (2020). Natural language processing for text and speech processing: A review paper. International Journal of Advanced Research in Engineering and Technology (IJARET), 11(11). https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3878634

Bevilacqua, A., Saviano, P., Amirante, A., & Romano, S. P. (2024, May 6). Whispy: Adapting STT Whisper models to real-time environments [Preprint]. arXiv. https://arxiv.org/abs/2405.03484

Chaabi, Y., & Allah, F. A. (2021). Amazigh spell checker using Damerau-Levenshtein algorithm and N-gram. Journal of King Saud University - Computer and Information Sciences. https://www.sciencedirect.com/science/article/pii/S1319157821001828

Chen, M., Tan, X., Li, B., Liu, Y., Qin, T., Zhao, S., & Liu, T. . (2021). AdaSpeech: Adaptive Text to Speech for Custom Voice. arXiv (Cornell University).

Doumanidis, C. C., Anagnostou, C., Arvaniti, E.-S., & Papadopoulou, A. (2021, May 25). RNNoise-Ex: Hybrid speech enhancement system based on RNN and spectral features [Preprint]. arXiv. https://arxiv.org/abs/2105.11813

Duarte, J. C., & Colcher, S. (2024). Noise-Robust Automatic Speech Recognition: A Case Study for Communication Interference. Journal on Interactive Systems, 15(1), 670–681. https://doi.org/10.5753/jis.2024.4267

Elakkiya, A., Jaya Surya, K., Konduru Venkatesh, Aakash, S. (2022). Implementation of Speech to Text Conversion Using Hidden Markov Model. Communication and Aerospace Technology (pp. 359–363). Sixth International Conference on Electronics.

Huang, K., Wu, C., Hong, Q., Su, M., & Chen, Y. . (2019). Speech emotion recognition using deep neural network considering verbal and nonverbal speech sounds. CASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://doi.org/10.1109/icassp.2019.8682283.

Park, C., Chen, M., & Hain, T. (2024). Automatic speech recognition system-independent word error rate estimation (arXiv No. 2404.16743) [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2404.16743

Lee, W. Y., Tan, J. T. A., & Kok, J. K. (2022). The struggle to fit in: A qualitative study on the sense of belonging and well-being of deaf people in Ipoh, Perak, Malaysia. Psychological Studies, 67(3), 385-400.

Macháček, D., Dabre, R., & Bojar, O. (2023). Turning Whisper into real-time transcription system [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2307.14743

Mansur, Z., Omar, N., Tiun, S., & Alshari, E. M. (2024). A normalization model for repeated letters in social media hate speech text based on rules and spelling correction. PLoS ONE, 19(3), e0299652. https://doi.org/10.1371/journal.pone.0299652

Mesham, S., Bryant, C., Rei, M., & Yuan, Z. (2023, February 12). An extended sequence tagging vocabulary for grammatical error correction. arXiv.org. https://arxiv.org/abs/2302.05913

Murugan, S., Sankarasubbu, M., & Bakthavatchalam, T. A. (2020). SymSpell and LSTM based Spell-Checkers for Tamil. Tamil Internet Conference. https://www.researchgate.net/publication/349924975_SymSpell_and_LSTM_based_Spell-_Checkers_for_Tamil

Nitin. (2025, April 7). System Development Life Cycle (SDLC): Phases, Models & Best Practices. eLuminous Technologies. https://eluminoustechnologies.com/blog/system-development-life-cycle/

Nogales, A., Caracuel-Cayuela, J., & García-Tejedor, Á. J. (2024). Analyzing the Influence of Diverse Background Noises on Voice Transmission: A Deep Learning Approach to Noise Suppression. Applied Sciences, 14(2), 740. https://doi.org/10.3390/app14020740

Pascual, R., Apuyod, A., Bainto, K., Panit, M. S., & Llamado, J. (2024). Evaluating the Performance of a Commercial Speech-to-Text Application for Filipino Language as an Aid in Encoding Healthcare Data. DLSU Research Congress 2024. https://www.dlsu.edu.ph/wp-content/uploads/pdf/conferences/research-congress-proceedings/2024/HCT-12.pdf

Quarteroni, S. (2018). Natural Language Processing for Industry. Informatik-Spektrum, 41(2), 105–112. https://doi.org/10.1007/s00287-018-1094-1

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2022c, December 6). Robust speech recognition via Large-Scale Weak Supervision. arXiv.org. https://arxiv.org/abs/2212.04356

Samonte, M., Gazmin, R., Soriano, J., & Valencia, M. (2019). BridgeApp: An assistive mobile communication application for the deaf and mute. In 2019 International Conference on ICT Convergence (ICTC). https://doi.org/10.1109/ICTC46691.2019.8939866

Seo, S., Kim, C., & JiKim, J.-H. (2022). Convolutional neural networks using log Mel-spectrogram separation for audio event classification with unknown devices. IEEE Xplore. https://ieeexplore.ieee.org/abstract/document/10251060

Tan, X., Chen, J., Liu, H., Cong, J., Zhang, C., Liu, Y., Wang, X., Leng, Y., Yi, Y., He, L., Zhao, S., Qin, T., Soong, F., & Liu, T. (2024). NaturalSpeech: End-to-end text-to-speech synthesis with human-level quality. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1–12.

Trabelsi, A., Werey, L., Warichet, S., & Helbert, E. (2024). Is noise reduction improving open-source ASR transcription engines quality? In Proceedings of ICAART (3), 1221–1228.

Valin, J. (2017, September 24). A hybrid DSP/deep learning approach to real-time full-band speech enhancement [Preprint]. arXiv. https://arxiv.org/abs/1709.08243

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017, June 12). Attention is all you need [Preprint]. arXiv. https://arxiv.org/abs/1706.03762

Werff, L. & Heeren, W. (2007) Evaluating ASR output for information retrieval. Searching Spontaneous Conversational Speech. https://www.researchgate.net/publication/241880526_Evaluating_ASR_Output_for_Information_Retrieval/citations

Engineering & Technology

Agricultural Science

Environment & Climate

Business & Economics

Arts & Social Science

Multidisciplinary

Medical Science & Others

A Hybrid RWHISYMP Speech-to-Text Noise Suppression Model: Integration of the Whisper Base Model, RNNoise, and SympSpell Algorithms

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

Issue

Section

License

How to Cite

Similar Articles

Make a Submission

Information

Latest publications