In this blog post we cover the practical usability of Estonian language speech-to-text and text-to-speech technologies. At AlphaBlues we are focused on building high-end virtual assistants on our own AI platform. Recently we have done work on integrating voice capabilities into our virtual assistants. As we tested Estonian language solutions we thought to share our experience with the aim of giving people who are interested in using such technologies for practical purposes a quick overview.
Voice-controlled devices are growing in their usage. Juniper Research estimates that there will be 8 billion digital voice assistants in use by 2023. Today, AI-based virtual assistants are already capable of solving customer enquiries by text. The natural next step in the evolution will most likely be voice recognition technology implementation. Where is voice recognition technology in Estonia today from a practical standpoint?
First, what we mean under Voice AI. In order for the AI to answer your questions asked with voice, the incoming enquiry has to be transcribed from audio to text (speech-to-text (STT for short)). After that, the AI can analyze text and trigger appropriate answers, which then will be converted back to audio (text-to-speech (TTS)). We did some work with voice assistants and looked at the solutions available for STT and TTS. For natural language understanding (i.e. understanding the meaning of the asked question or intent detection for short) we are using AlphaAI which is an intent-detection solution developed by us at AlphaBlues. We were looking for 3rd party STT to be integrated to AlphaAI for processing speech to text; using that text in our own intent detection solution and then using TTS for converting the answer in text into speech.
For TTS, speech synthesizers are needed. Currently, Institute of the Estonian Language (EKI) offers 4 different speech synthesizers – DNN, Ossian, „Üksuste valiku hääled“ and HTS. In addition to EKI contributions, we tested out Neurokõne and Google Cloud Speech speech synthesizers.
Unfortunately, all of the mentioned speech synthesizers lack in some aspects today. For DNN and Ossian, numbers turned out to be a difficult task. For example, „100. anniversary“ was understood as „one zero zero anniversary“. In addition to that, the loading times were quite slow, which did not help to deliver a smooth user experience. HTS provided better response times and more accurate number detection, but improvements could still be made. EKI contribution „Üksuste valiku hääled“ provided also quite nimble response times and managed to recognize numbers correctly, although sound quality could be a little bit better and the flow of speech smoother.
Google’s technology and capability is understandably in an another league compared to local companies, but as of today there is no support for Estonian language. With Cloud Speech, we achieved the closest result to Estonian language with Finnish, voice type set to WaveNet. The responses were immediate and the number recognition accurate, but the overall tone of the speech and the nuances were Finnish by nature.
Neurokõne, the prototype of University of Tartu NLP research group managed to recognize numbers correctly and response times were reasonable. The only downside with Neurokõne is the fact that the input can only be maximum of 200 characters, which is not ideal but manageable for virtual assistants. Other than that, Neurokõne speech synthesizer with active development team and regular updates seems to be the best solution currently available. Another cool thing about Neurokõne is that they offer custom made speech synthesis models for your projects and could even use your own voice for truly unique experience.
For Estonian STT one can use the real-time full-duplex speech recognition server based on the Kaldi toolkit and the GStreamer framework. It can also be tested in the browser. From practical development point of view, one should consider STT processing speed, when developing Estonian voice-enabled AI solutions. In our test, the STT depended quite heavily on the amount of input words (see table below).
Phrases with less than 10 words are still quite viable if there is desire for somewhat smooth user experience and interaction between the virtual assistant and a human. However, phrases longer than that already need more than 4 seconds of processing time, which is by no means immediate and may end up being restricting for real-time chat.
In conclusion, voice recognition and conversational AI technology have evolved significantly over last years. As voice recognition capabilities increase, use cases where conversational AI and voice recognition is unified, will also increase. However, if one wants to develop fully fledged conversational voice AI with smaller languages support (such as Estonian) today, the options are quite limited. Big tech companies and their technologies have an edge in English-based applications, but these advancements provide little use for smaller languages. Based on our experience from the practical standpoint the challenges with current Estonian Voice AI are correct number recognition, data processing speed and limited input characters. However, as these qualities improve over time, the Estonian language automated voice communication will also reach wider audiences.