โก Quick Summary
The introduction of ViSQA, the first benchmark dataset for Vietnamese Spoken Question Answering (SQA), marks a significant advancement in the field. This dataset includes over 13,000 question-answer pairs and highlights the impact of ASR errors on model performance.
๐ Key Details
- ๐ Dataset: Over 13,000 question-answer pairs aligned with spoken inputs
- ๐งฉ Features used: Clean and noise-degraded audio variants
- โ๏ธ Technology: Transformer-based models including ViT5
- ๐ Performance: ViT5 EM: 62.04% (clean) vs. 36.30% (noisy)
๐ Key Takeaways
- ๐ ViSQA is the first benchmark for Vietnamese SQA, addressing a significant gap in the field.
- ๐ก The dataset extends the UIT-ViQuAD corpus using a reproducible text-to-speech and ASR pipeline.
- ๐ ASR errors were found to substantially degrade model performance.
- ๐ Training on spoken transcriptions improved model robustness significantly.
- ๐ ViSQA enables systematic evaluation of SQA systems in Vietnamese, paving the way for future research.
- ๐ค Five transformer-based models were tested, showcasing varying degrees of performance.
- ๐ The performance drop due to ASR errors emphasizes the need for improved transcription quality.

๐ Background
Spoken Question Answering (SQA) is an emerging area that extends traditional machine reading comprehension to spoken content. While many high-resource languages have established benchmarks, languages like Vietnamese have been largely overlooked. The lack of standardized datasets has hindered the development of effective SQA systems, making the introduction of ViSQA a crucial step forward.
๐๏ธ Study
The study conducted by Minh LT and colleagues aimed to create a comprehensive benchmark for Vietnamese SQA. By extending the UIT-ViQuAD corpus and employing a robust text-to-speech and ASR pipeline, the researchers generated a dataset that includes both clean and noise-degraded audio. This allows for a thorough evaluation of how ASR errors affect downstream language understanding.
๐ Results
The experiments revealed that ASR errors significantly impacted model performance. For instance, the ViT5 model achieved an Exact Match (EM) score of 62.04% on clean audio, which dropped to 36.30% when noise was introduced. However, training on spoken transcriptions improved the model’s robustness, increasing the EM score to 50.70%.
๐ Impact and Implications
The ViSQA benchmark has the potential to transform the landscape of Vietnamese SQA research. By providing a rigorous framework for evaluation, it enables researchers to systematically analyze the effects of ASR errors on reasoning capabilities. This could lead to the development of more resilient SQA systems, ultimately enhancing the accessibility of information for Vietnamese speakers.
๐ฎ Conclusion
The introduction of ViSQA represents a significant milestone in the field of Vietnamese Spoken Question Answering. By addressing the challenges posed by ASR errors and providing a comprehensive dataset, this research opens new avenues for exploration and improvement in SQA systems. The future of Vietnamese language processing looks promising, and further research in this area is highly encouraged!
๐ฌ Your comments
What are your thoughts on the development of the ViSQA benchmark? How do you think it will influence future research in Vietnamese SQA? Let’s discuss! ๐ฌ Leave your thoughts in the comments below or connect with us on social media:
ViSQA: A benchmark dataset and baseline models for Vietnamese spoken question answering.
Abstract
Spoken Question Answering (SQA) extends machine reading comprehension to spoken content and requires models to handle both automatic speech recognition (ASR) errors and downstream language understanding. Although large-scale SQA benchmarks exist for high-resource languages, Vietnamese remains underexplored due to the lack of standardized datasets. This paper introduces ViSQA, the first benchmark for Vietnamese Spoken Question Answering. ViSQA extends the UIT-ViQuAD corpus using a reproducible text-to-speech and ASR pipeline, resulting in over 13,000 question-answer pairs aligned with spoken inputs. The dataset includes clean and noise-degraded audio variants to enable systematic evaluation under varying transcription quality. Experiments with five transformer-based models show that ASR errors substantially degrade performance (e.g., ViT5 EM: 62.04% [Formula: see text] 36.30%), while training on spoken transcriptions improves robustness (ViT5 EM: 36.30% [Formula: see text] 50.70%). ViSQA provides a rigorous benchmark for evaluating Vietnamese SQA systems and enables systematic analysis of the impact of ASR errors on downstream reasoning.
Author: [‘Minh LT’, ‘Thinh ND’, ‘Loc NKT’, ‘Quan LV’, ‘Tam ND’, ‘Son LH’]
Journal: PLoS One
Citation: Minh LT, et al. ViSQA: A benchmark dataset and baseline models for Vietnamese spoken question answering. ViSQA: A benchmark dataset and baseline models for Vietnamese spoken question answering. 2026; 21:e0340771. doi: 10.1371/journal.pone.0340771