Developing a Multi-Agent Framework for Multimodal Multi-Task Learning

This project is focused on enhancing the capabilities of large multimodal models. Multimodal learning is an area of machine learning where models are designed to process and correlate information from various input modalities, such as text, images, and audio. In this project, we are developing a multi-agent framework where each agent is specialized in understanding a specific modality and task. These agents work in tandem, the framework incorporates specific agents for the tasks they are specialized in dynamically, enabling the system to handle multiple tasks simultaneously. By integrating these multi-agent based ideas into large multi-modal models, our project aims to significantly improve performance in multi-task learning and generalization to new tasks.

Related publications:

  1. Large Multimodal Agents: A Survey
    Xie, J., Chen, Z., Zhang, R., Wan, X., & Li, G. (2024). Large Multimodal Agents: A Survey. arXiv:2402.15116. 
  2. AgentLite: ALightweightLibraryforBuildingandAdvancing Task-Oriented LLM Agent System
    Liu, Z., Yao, W., Zhang, J., Yang, L., Liu, Z., Tan, J., Choubey, P. K., Lan, T., Wu, J., Wang, H., Heinecke, S., Xiong, C., & Savarese, S. (2024). AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System. arXiv:2402.15538.
  3. MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion
    Li, S., Wang, R., Hsieh, C.-J., Cheng, M., & Zhou, T. (2024). MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion. arXiv:2402.12741.

Non-Rigid Distortion Removal via Coordinate Based Image Representation

maging through turbulent refractive medium (e.g., hot air, in-homogeneous gas, fluid flow) is challenging, since the non-linear light transport through the medium (e.g. refraction and scattering) causes non-rigid distortions in perceived images. However, most computer vision algorithms rely on sharp and distortion-free images to achieve the expected performance. Removal of these non-rigid image distortions is therefore critical and beneficial for many vision applications, from segmentation to recognition. To resolve the distortion and blur introduced by air turbulence, conventional turbulence restoration methods leverage optical flow, regions fusion and blind deconvolution to recover images. One avenue that is underexplored for this problem is the use of coordinate based image representations. These methods represent images as the parameters of a neural network ,and they can be used to deform the image grid itself to account for turbulence. In this research, we aim to extend this idea to unseen images with meta learning that can remove both air and water distortions without much customization.

Related publications:

  1. Unsupervised Non-Rigid Image Distortion Removal via Grid Deformation, ICCV 2021

Adaptive LLM-based Tutor for Personalized Python Learning

Because of their varied backgrounds and skill levels, students in the field of programming education frequently confront a variety of difficulties. Personalized learning is typically not supported by traditional learning platforms, which reduces their efficacy. Our goal is to construct an intelligent tutor system based on LLMs that can solve problems and reason in order to provide students with tutor-like guidance. Additionally, we want to establish engaging interactions between students and tutors and during these exchanges, we would like to learn as much as possible about the tutors’ internal decision-making process. Furthermore, in order to deliver a more approachable and natural experience that is in line with the learner’s needs and the curriculum objectives, the system will need to recognize and monitor, as much as possible, the individual preferences and mental state of the learners. 

LLMs in the context of Code-Switching for Banglish Texts

In our increasingly interconnected global society, communication transcends linguistic boundaries, leading to a phenomenon known as code-switching. Code-switching refers to the practice of alternating between two or more languages or language varieties within a single discourse. In recent years, the advent of Language Models (LLMs) has revolutionized the way we interact with and understand languages. While LLMs perform quite well in monolingual queries such as question-answering, sentiment analysis and summarization, etc, their performance is downgraded in the scenario of code-switching. In this work, we are focusing on enhancing LLMs’ performance in the context of code-switching between Bangla and English.

Related publications

  1. Contextual Bangla Neural Stemmer: Finding Contextualized Root-Word Representations for Bangla Words”, 1st Workshop on Bangla Language Processing in conjunction with EMNLP, Association of Computational Linguistics, Singapore, Dec, 2023.
  2. Investigation the Effectiveness of Graph-based Algorithm for Bangla Text Classification, 1st Workshop on Bangla Language Processing in conjunction with EMNLP, Association of Computational Linguistics, Singapore, Dec, 2023.
  3. BaTEClaCor: A Novel Dataset for Bangla Text Error Classification and Correction, 1st Workshop on Bangla Language Processing in conjunction with EMNLP, Association of Computational Linguistics, Singapore, Dec, 2023.

Knowledge Graph and LLMs based QA System

The emergence of advanced large language models (LLMs), such as GPT-4 and LLaMa, marks a significant shift in information retrieval and Question Answering (QA) systems. Unlike traditional keyword-focused searches, these models can generate texts that are more intuitive and human-like. Trained on huge amounts of data, these models apparently “understand” the subtleties of language, context, and user intent.  However, LLMs have a few significant limitations – the models may “hallucinate”and they have limited domain knowledge, common sense etc.. Knowledge Graphs (KGs) can help overcome some of these challenges by providing a structured representation of domain knowledge. A KG is a database that stores information in the form of a graph, with nodes representing entities and edges representing relationships between them. KGs can enhance the reasoning ability of LLMs for QA systems by providing context, domain knowledge related to the questions. In this research, we focus on extracting the domain-specific knowledge sub-graph and enhancing its representation using graph neural networks for solving QA tasks with LLMs.  

Few-Shot Human Activity Recognition from Wearable Sensors

We stand at the forefront of transforming remote healthcare by pioneering sensor-based human activity recognition (HAR). Our primary objective is to develop state-of-the-art ML models specifically designed for deployment on remote devices, enabling the continuous monitoring of patients and elderly individuals who require ongoing support. A significant challenge in this endeavor is the scarcity of labeled data for various activity classes, making training of traditional models difficult. To address this, we are actively working on solving the few-shot learning problem, so that our models can adapt with minimal labeled examples. This work builds up on our work on Self-attention based HAR and assessment of rehabilitation exercises using sensor data.

Related publications

  1. Hierarchical Self Attention Based Autoencoder for Open-Set Human Activity Recognition, 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2021), Springer, May 11-14, 2021, Delhi, India. [arxiv
  2. “Human Activity Recognition from Wearable Sensor Data using SelfAttention”, in the proceedings of 24th European Conference on Artificial Intelligence (ECAI), Spain, 2020. [pdf
  3. Assessment of Rehabilitation Exercises from Depth Sensor Data, International Conference on Computer and Information Technology (ICCIT), Dhaka, Bangladesh, December 18-20, 2021 [pdf
  4. An Integrated System for Stroke Rehabilitation Exercise Assessment using Kinect v2 and Machine Learning, International Conference on Intelligent Human Computer Interaction, Proceedings of LNCS, Springer, Nov, 2023. [link]