Image credit: meta
Meta has taken significant strides in language accessibility by developing and sharing models and code for speech recognition, speech synthesis, and more for over 1,100 languages through its Massively Multilingual Speech (MMS) project.
Speech technology can render information accessible to many more people, especially those who primarily depend on voice. However, the barrier is the need for substantial labeled data – thousands of hours of audio and transcriptions. This challenge is amplified for languages with fewer speakers, leading to a dearth of speech recognition models for thousands of the world's spoken languages.
To tackle this, Meta employed wav2vec 2.0 and a new dataset that provides labeled data for over 1,100 languages and unlabeled data for nearly 4,000 languages in the MMS project. Leveraging religious texts like the Bible that have been translated into various languages, Meta created a dataset of readings of the New Testament in over 1,100 languages, offering an average of 32 hours of data per language.
The new models were trained on about 500,000 hours of speech data in over 1,400 languages, far surpassing any known previous work. The project's models significantly outperformed existing models, providing coverage for 10 times as many languages. For instance, the MMS models were evaluated on the FLEURS benchmark, where it was shown that they achieved half the word error rate compared to OpenAI's Whisper, despite MMS covering 11 times more languages.
Meta has also trained a language identification model for over 4,000 languages and built text-to-speech systems for over 1,100 languages. Despite limitations in the variety of speakers for many languages, the text-to-speech systems proved capable of producing good quality speech.
Publicly sharing their models and code, Meta aims to enable others in the research community to build upon their work, contribute to preserving global language diversity, and envision a future where technology encourages language preservation by enabling access to information and technology use in preferred languages.
Ultimately, the goal is to have a single model that can handle multiple speech tasks across all languages, an advancement that will undoubtedly boost overall performance and enhance global accessibility in speech technology. Source