Clicky chatsimple

Overview Of Speech Recognition

Category :


Posted On :

Share This :

Speech recognition, also called speech-to-text, is the initial step of a series of algorithms where voice input from the user is used. Virtual assistants such as Siri, Alexa, and Google Assistant are examples of how this technology is used. It converts sound—whether live or recorded—into text so that other methods can decode it. The meaning of the text phrases encoded by the Speech Recognition Algorithm can subsequently be extracted through the application of Natural Language Processing.

The speech recognition algorithms are trained by supervised learning. This implies that a sizable sample of audio clips and the audio’s text transcription are fed into the algorithm. The system learns rules to match words to audio through extensive training on a vast amount of audio. Many publicly available speech-to-text data sets are available for training; nonetheless, Speech Recognition needs to be trained on words and phrases that users will speak as input to develop a strong model. For more generic Speech Recognition algorithms, the phonemes the system will encounter in its training must be included, rather than words that users would speak.

While there is still much utility for traditional speech recognition methods that employ Hidden Markov Models (HMMs), newer methods that make use of Deep Neural Networks are more efficient. Since traditional methods are more effective at encoding longer speech segments, they are still in use today. In a later post, we will go deeper into the inner workings of these algorithms.

Speech recognition algorithms have a number of obstacles to overcome. The first is input quality. When audio is recorded in a noisy location or has poor transmission quality (audio from the other end of a phone line), accuracy will dramatically diminish. The audio signal must first be cleaned up using pre-processing algorithms before being sent into the Speech Recognition algorithm.

Variability in speech patterns is another problem. Individuals talk with varying accents and in various languages. Algorithms must be trained using a data set that contains those languages and accents in order to correctly recognize speech across languages and accents. When a different language or accent is being used, it must be able to recognize it and apply the rules it has created for encoding that language or accent.

Domain is the final problem that speech recognition algorithms must deal with. You can anticipate that customers will typically discuss money matters if your virtual assistant is designed to provide financial advise. Because you may restrict the data set to a particular domain, the amount of data you will need to train your virtual assistant can be smaller. We refer to this as a “closed domain problem.” The “open domain problem” is what Siri, Alexa, and Google Assistant are attempting to resolve. Due to its users’ ability to discuss almost any topic or use almost any phrase, substantially larger data sets are required, and context becomes increasingly important. In the context of speech recognition, homonyms like “your” and “you’re” are distinguished by context. To understand context, Deep Neural Networks (especially Recurrent Neural Networks) can scan a sentence both forward and backward.

Only the initial stage of deriving meaning from uttered words is speech recognition. Text is sent to Natural Language Processing algorithms for interpretation after it has been encoded. You may study up on Natural Language Processing here.