Our project, on Mixture-of-Experts (MoE) aims to enhance the performance of large multimodal vision-language models. Multimodal learning is a subfield of machine learning where models are trained to process and relate information from multiple input modalities, such as text, images, and audio. The MoE approach is a machine learning technique where a set of specialized models (the ‘experts’) are coordinated by a gating network that decides which expert to consult based on the input data. Our project is at the forefront of this field, seeking to incorporate the latest advances in multimodal MoE into the large multi-modal model frameworks. This integration is expected to significantly improve the efficiency, scalability and accuracy of large multi-modal models, opening up new possibilities for complex data analysis and prediction.

Related publications:

  1. Visual Instruction Tuning
    Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning. arXiv:2304.08485v2. Retrieved from https://doi.org/10.48550/arXiv.2304.08485 
  2. Mixtral of Experts – Mistral
    Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., & El Sayed, W. (2024). Mixtral of Experts. arXiv preprint arXiv:2401.04088.
  3. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
    Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Huang, J., Zhang, J., Ning, M., & Yuan, L. (2024)1. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models. arXiv:2401.15947v3. https://doi.org/10.48550/arXiv.2401.15947