People, especially those who rely on voice access to information, may now access information more easily because to developments in machine learning and speech recognition technologies. However, one major obstacle to creating high-quality machine-learning models is the absence of labeled data for many languages.
The Meta-led Massively Multilingual Speech (MMS) project has addressed this issue by enhancing the performance of speech recognition and synthesis models and broadening language coverage.
Over 1,100 languages are now supported by existing speech recognition models, an astonishing achievement of the MMS project, which combined self-supervised learning techniques with a wide dataset of religious readings.
Overcoming Linguistic Obstacles
Religious books, like the Bible, which have been translated into many languages, were used by the MMS project to solve the lack of tagged data for the majority of languages.
The translations made the audio recordings of readers of the texts available to the public, which made it possible to compile a dataset of New Testament readings in more than 1,100 languages.
The initiative increased the number of languages it could recognize to almost 4,000 by adding unlabeled recordings of readings from other religions.
The models fared equally well for male and female voices, despite the dataset’s narrow focus and preponderance of male speakers. Meta adds that there was no introduction of religious bias.
Overcoming Obstacles With Self-Supervised Education
It is insufficient to train traditional supervised speech recognition models with only 32 hours of data per language.
The MMS project made use of the advantages of the wav2vec 2.0 self-supervised speech representation learning technique to get over this restriction.
The research greatly decreased the dependency on labeled data by using almost 500,000 hours of speech data in 1,400 different languages to train self-supervised models.
The final models were then adjusted for certain speech tasks, like language identification and multilingual speech recognition.
Remarkable outcomes were found when the models trained on the MMS data were evaluated. The MMS models covered 11 times more languages and had a word mistake rate that was half that of OpenAI’s Whisper in comparison.
In addition, the MMS project developed text-to-speech systems for more than 1,100 languages with success. Notwithstanding the constraint of possessing a limited number of distinct speakers for numerous languages, the speech produced by these systems demonstrated exceptional quality.
Even while the MMS models have shown encouraging results, it is important to recognize that they are not flawless. Words that are inappropriate or erroneous could be the consequence of transcribing errors or misinterpretations made by the speech-to-text technology. To reduce these dangers, the MMS project places a strong emphasis on cooperation within the AI community.