Overview
The vast amount of multimodal data available today, spanning language, vision, motion, and sensor signals, has enabled the rise of powerful AI models capable of impressive perception and reasoning. Yet, these systems remain constrained by high computational demands, closed-world assumptions, and limited adaptability outside curated benchmarks.
Drawing inspiration from the human brain’s ability to efficiently integrate perception and language, our group investigates how artificial systems can acquire knowledge across modalities in a scalable and adaptive way. We explore large multimodal models for tasks such as open-world image classification, zero-shot action recognition, and temporal action localization, while developing methods for adversarial knowledge distillation, efficient diffusion-based generation, 3D motion understanding, and automated benchmarking of multimodal models
Our research connects computer vision, motion understanding, and computational linguistics, reflecting the department’s mission of studying intelligence at the intersection of neuroscience and language. Ultimately, our goal is to design models that can learn continuously and flexibly across modalities, mirroring the robustness of human cognition and communication, and contributing to a deeper understanding of both artificial and biological intelligence.
Research directions
The key topics of interest are:
- Vision and Language
- Video Understanding
- 3D Motion Understanding
- Open-vocabulary Recognition
Members
- Paolo Rota, Principal Investigator
- Benedetta Liberatori, PhD Student (co-supervised with prof. Elisa Ricci)
- Shiyao Xu, PhD Student
- Yan Shu, PhD student (Co-supervised with Prof. Niculae Sebe)
- Jiaqi Liu - PhD Student (Co-supervised with Prof. Niculae Sebe)
- Ester Riccardi - PhD Student (Co-supervised with Prof. Roberto Bottini)
Publications
For a complete list see Paolo Rota scholar profile.
