One paper has been accepted in ECAI 2024

Congratulations to our senior project student Fahim Ahmed and research assistant Md Fahim for getting their paper accepted into the core rank A conference, European Conference on AI (ECAI) https://www.ecai2024.eu/ . The acceptance rate was very competitive (24%) this time for ECAI 2024. The title of the paper is, “Improving the Performance of Transformer-based Models Over Classical Baselines in Multiple Transliterated Languages”.

Here is a short description of the paper:

Online discourse, by its very nature, is rife with transliterated text along with code-mixing and code-switching. Transliteration is heavily featured due to the ease of inputting romanized text with standard keyboards over native scripts. Due to its ubiquity, it is a critical area of study to ensure NLP models perform well in real-world scenarios.

In this paper, we analyze the performance of various language model’s performance on classification of romanized/transliterated social media text. We chose the tasks of sentiment analysis and offensive language identification. We carried out experiments for three different languages, namely Bangla, Hindi, and Arabic (for six datasets). To our surprise, we discovered across multiple datasets that the classical machine learning methods (Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), and XGBoost) perform very competitively with fine-tuned transformer-based mono / multilingual language models (BanglishBERT, HingBERT, and DarijaBERT, XLM-RoBERTa, mBERT, and mDeBERTa), tiny LLMs (Gemma-2B, and TinyLLaMa) and ChatGPT for classification tasks in transliterated text. Additionally, we investigated various mitigation strategies such as translation and augmentation via the use of ChatGPT, as well as Masked Language Modelling to dataset-specific pretraining for language models. Depending on the dataset and language, employing those mitigation techniques yields a 2-3% further improvement in accuracy and macro-F1 above baseline.

We demonstrate TF-IDF and BoW-based classifiers achieve performance within around 3% of fine-tuned LMs and thus could thus be considered as a strong baseline for transliterated text-based NLP tasks.