Ringg Parrot STT V1.0 ๐ฆ
๐ค Record & Transcribe (WebSocket)
Click the microphone to start recording. Click stop when finished to get transcription.
The entire recording will be transcribed via WebSocket on_final endpoint with TensorRT acceleration.
๐ Upload an audio file for transcription
Supports WAV, MP3, FLAC, M4A, and more.
๐ฏ Performance Benchmarks
Ringg Parrot STT V1 Ranks 1st Among Top Models.
Parrot STT (Ringg AI) | 15.00% | 15.92% |
๐ Benchmarking Strategy
Our model was trained on approximately 3.5k hours of diverse, multi-domain Hindi speech data to ensure robust performance across various acoustic conditions and use cases.
Data Preprocessing & Sanity
Prior to training and evaluation, all transcript text was processed through AI4Bharat's Cadence punctuation restoration model. Cadence is a state-of-the-art multilingual punctuation model based on Gemma-3-1B that supports English and 22 Indic languages. This preprocessing step ensured:
- Consistent punctuation across training data
- Normalized text formatting for better model convergence
- Data sanity and quality assurance
- Improved downstream ASR performance
Training Data Composition:
- 40% Telephony Proprietary Data - Call center conversations, customer support calls, and telephonic interactions
- 30% Graamvani Data - Rural and grassroots community voice recordings
- 30% Other Sources - Including:
- HuggingFace Shrutilipi dataset
- Additional Graamvani samples
- Internet-sourced narration, conversations, and text readings
- TTS (Text-to-Speech) model-generated data
Why This Strategy?
ASR models are particularly susceptible to two critical phenomena:
- Accent Deafening - Performance degradation when encountering accents not represented in training data
- Frequency Deafening - Reduced accuracy on audio with sampling rates or frequency characteristics different from training conditions
By combining multiple data sources spanning diverse domains, accents, recording conditions, and sampling rates, we build domain-invariant models that generalize better to real-world variability. As demonstrated in Narayanan et al. (2018), training on large-scale multi-domain data enables models to achieve robustness comparable to domain-specific models while maintaining superior generalization to unseen conditions.
Our training data specifically includes:
- Internet Data (narration, conversation, people reading text, telephony samples)
- TTS Model Data (synthetic speech for augmentation)
- Telephony Data (real-world call recordings with varied codecs and noise)
This multi-domain approach ensures Parrot STT performs reliably across call centers, voice assistants, mobile apps, and other telephonic/streaming applications.
๐ Acknowledgements
- Built with NVIDIA NeMo models
- Research inspired by Domain-Invariant Speech Recognition via Large Scale Training