This is the part II of our previous post, where we opened up on the methods and the theory behind the the comparison of learning algorithms. In this post, we are going to dive in to our experiment results and conclusions.
As most of the models are made and optimised for English, we started our testing with English data. In our English experiment, we tested the models of BERT, DistilBERT, RoBERTa, XLM, XLNet, ULMFit, MLP. When we increased the input data (i.e. 90:10 vs 10:90 training:test ratio), the prediction accuracy, on average, improved by 20%.
The most accurate models RoBERTa and XLM also stood out by having the biggest model size (1,4 GB and 2,5 GB respectively). Model MLP, which was inferior in accuracy, functioned in a considerably smaller size of 15 MB. Even though the main variable that we examined in our test was the intent prediction accuracy, the other aspects of the models should not be ignored. A lightweight solution, that is more nimble in real-time use cases and have the capacity for a potential growth in userbase can be beneficial for conversational AI projects.
In the Estonian language environment we tested ULMfit, BERT and MLP models. As the training of the models in different languages is very compute-intensive task and some of the models are language specific, we decided to go with the models that had already proven themselves in the English test. In the Estonian language environment test we saw on average a 17% improvement in accuracy when increasing the training data share to 90%.
In the Estonian language environment we did also try out some extreme cases. We started with running the algorithms on tokenized words. With the tokenized words as an input and 90:10 training:test data ratio we managed to achieve accuracy equal to the ordinary testing results of 10% training and 90% test data. Those results show that one should keep in mind the complexity and nuances of any language as methods that may work in English use cases may not be the optimal choices for other languages.
In addition to that, we eliminated the „other“ intent and its training data and ran the algorithms with 90:10 training:test ratio data. With that, the models improved their accuracy by another 10%. The removal of „other“ training data (i.e. phrases that are not related to any of the topics specified) reflect the importance of having quality training data and carefully picked phrases for each intent. A few of the models and ultimately the virtual assistant will get into trouble when the training data is packed with phrases belonging to wrong intents or single intents are saturated with miscellaneous phrases. This means that AI trainers should be extra cautious and when in doubt think twice when linking phrase examples to the intents.
In the Latvian language environment we tested similarly to Estonian experiment the models of ULMfit, BERT and MLP. The prediction accuracy in the Latvian environment on average rose by 15% when we increased the training data ratio.
The main goal of the experiment was to find out how does the input data volume and quality affect the overall performance (accuracy of intent matching, phrase learning) of the learning models. To do so, we ran the algorithms on English, Estonian and Latvian datasets, using 90:10 and 10:90 training:test data ratios. Our test results showed that the intent prediction accuracy of the models behaved similarly across all the models and languages. With the increase of training data (compared to test data) the models in the English environment got about 20% more accurate. The same models on Estonian and Latvian data improved by 17% and 15% respectively.
In addition to that, we compared the sizes of the models and how this might correlate to the intent prediction accuracy. The most accurate models RoBERTa and XLM were indeed also significantly bigger in size (1,4 GB and 2,5 GB respectively). On the other hand, the model MLP, which was to some degree less accurate, functioned in a considerably smaller size of 15 MB.
Finally, we tested the models in extreme cases (e.g. input data consisting of tokenized words; the elimination of „other“ intent“). With the tokenized words as an input and 90:10 training:test data ratio we managed to achieve intent prediction accuracy just equal to the ordinary testing results of 10% training and 90% test data. With the removal of „other“ intent and its training data, the accuracy improved on average by 10%. While it is important to have accurate algorithms, the real difference might just emerge from the attention to other aspects, such as the size of the models, language specificity, the training data quality and the proper classification of it.