Course Information

This will be a seminar + project type class, with a reading list covering some of the fundamentals (Transformers, etc.), Multi-Modal Models (LLaVA, BLIP, etc.), as well as emerging areas including Open-Vocabulary Perception, Vision-Language Reasoning, Multi-Modal Decision-Making Agents, and Embodied AI. Participation will be required given that the course is largely discussion-driven.

Recommended Background

We will assume that you already have some background in machine learning, deep learning/neural networks, and potentially some of the modalities (e.g. computer vision, natural language processing, etc.). Note that we will not expect that you have deep experience in all of the modalities - it is OK if you are strong in one modality and weaker in another. We will also assume that you are comfortable in executing an interesting project in this area.

Instructor


Schedule Note: The topics are still being determined! Feel free to also email for suggestions of important papers in these areas

Week # Date Topic Presenters
1 Introduction to Vision-Language Models
2 Deep Dive into Transformers
3 Vision Transformers
Vision-Language Models: BLIP(-2), etc.
Vision-Language Models: LLaVA 1, 1.5, 1.6
Open-Vocabulary Classification, Detection, Segmentation
General architectures (Unified-IO/etc.) and multi-modal models (audio, etc.)
Multi-Modal Reasoning and Question-Answering
VLMs for Decision-Making Web and GUI Agents
VLMs for Embodied AI

Logistics

Logistics to be determined. The grading will include paper reviews and presentations, participation, and project (which will include various aspects including the project proposal, report, presentation/video, etc.). Participation will be required given that the course is largely discussion-driven.

CS 8803 VLM
Vision-Language Foundation Models

Request a Permit