Developing a Multi-Agent Framework for Multimodal Multi-Task Learning

This project is focused on enhancing the capabilities of large multimodal models. Multimodal learning is an area of machine learning where models are designed to process and correlate information from various input modalities, such as text, images, and audio. In this project, we are developing a multi-agent framework where each agent is specialized in understanding a specific modality and task. These agents work in tandem, the framework incorporates specific agents for the tasks they are specialized in dynamically, enabling the system to handle multiple tasks simultaneously. By integrating these multi-agent based ideas into large multi-modal models, our project aims to significantly improve performance in multi-task learning and generalization to new tasks.

Related publications:

  1. Large Multimodal Agents: A Survey
    Xie, J., Chen, Z., Zhang, R., Wan, X., & Li, G. (2024). Large Multimodal Agents: A Survey. arXiv:2402.15116. https://doi.org/10.48550/arXiv.2402.15116 
  2. AgentLite: ALightweightLibraryforBuildingandAdvancing Task-Oriented LLM Agent System
    Liu, Z., Yao, W., Zhang, J., Yang, L., Liu, Z., Tan, J., Choubey, P. K., Lan, T., Wu, J., Wang, H., Heinecke, S., Xiong, C., & Savarese, S. (2024). AgentLite: A Lightweight Library for Building and Advancing Task-Oriented LLM Agent System. arXiv:2402.15538. https://doi.org/10.48550/arXiv.2402.155381
  3. MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion
    Li, S., Wang, R., Hsieh, C.-J., Cheng, M., & Zhou, T. (2024). MuLan: Multimodal-LLM Agent for Progressive Multi-Object Diffusion. arXiv:2402.12741. https://doi.org/10.48550/arXiv.2402.12741