The last decades have been generous for the development of AI, machine learning and all things computer related. We know that the hardware capabilities approximately double every two years. Recent findings show that the pace is even quicker and is continuously speeding up. But how about the software side of it?
We have been in the conversational AI business and research for 5 years now. With the rise of AI utilization, the field of machine learning algorithms have grown more diverse. To see how does the algorithms side of the AI have evolved, we decided to test out all of the modern learning algorithms available today. This post is a background post for the theory and methodology. In the second post, we will look into the results and try to give practical tips for AI projects.
Conversational AI today is essentially an ensemble of learning algorithms that try to identify, learn various patterns and make decisions or predictions on their own, relying purely on data instances and at times on human input.
We can categorize learning algorithms into two types: classifiers and language models. In order for the classifiers to work (i.e. learn and establish connections), there needs to be an existing compilation of phrases, which are already categorized into groups or classifications. This method requires human input. A common classifier for example is a spam detection system inside your mailbox. There are 2 classes, spam and not spam, which are regularly filled with new inputs either labeled as „spam“ or „not spam“ by the users.
Language models, on the other hand, can do its work of learning and predicting possible next words on a raw text without the assistance of humans. When starting out with a particular machine learning project, there are usually not that many categorized phrases (classifications) available but plenty of raw text data. Language model based learning algorithms can start to make connections on how the text is constructed in the specified language, which in turn can be useful later when the classifications need to be formed. Because of this, most modern learning algorithms today are language models. In a way, language models behave like a grammar. The available data directs the model to choose the most fitting next word. Those kind of models are implemented in a wide variety of use cases, for example in conversational AI intent prediction, machine translation, speech recognition, sentiment analysis and many others.
To compare how various learning algorithm models might perform in a conversational AI use case, we conducted a comprehensive benchmarking of various learning algorithm models. The main goal of the experiment was to find out how does the model size and the data volume & quality affect the overall performance (accuracy of intent matching, phrase learning) of the models. The models were tested in the environment of Estonian, English and Latvian languages.
The general idea behind the evaluation of a learning model is to conduct the training and testing on different data points. This should display how the model works with a previously not seen input. Think of it as a school exam, except the participants are computer programs. It does not make much sense to show students all the exact test answers before the test, the same applies to machine learning algorithms.
We used in our experiments 90:10 and 10:90 training:test ratios. This means that 90% of the data was used for training and the rest of or 10% for evaluation and vice versa. Those kind of ratios were picked to simulate real life scenarios of virtual assistants. For example, with new customers there might be very little data available, so 10:90 training:test ratio represents the early stages of a virtual assistant. The 90:10 training:test ratio results should display the situation when the virtual assistant has completed a lot of conversations and the models behind it have processed the data.
We did also try out extreme use cases, for example ordinary tokenization usage and the elimination of „other“ intent and its training data.
In order to classify the raw text, a dictionary of the language must be formed. This dictionary must include either the stems of the words or sliced up parts of the words. In English, the former is usually the way to go as the root of the word is unchangeable. In some languages, however, such as Estonian, where the root may change, for example, in declensions or stay the same in compound words (that have completely different meaning), the tokenization of words is needed.
With the elimination of the „other“ intent and its training data, we were looking for an answer to the question how much does the data quality and the decisions that AI trainers make, affect the overall performance of a virtual assistant. When conversational AI trainers do their work, they assign incoming phrases to appropriate topics. When they think that the phrase does not match with any of the topics, is miscellaneous or generally not helpful for the AI, they link it to the „other“ intent. One could argue that every classifier and the phrase examples linked to it need special attention, but in our opinion the analysis of „other“ is a good place to start.
Stay tuned for our part II post, where we look into the test results.