CS 7476 Advanced Computer Vision

Spring 2024, TR 12:30 to 1:45, Molecular Sciences and Engineering G011
Instructor: James Hays
TA: Akshay Krishnan

Course Description

This course covers advanced research topics in computer vision. Building on the introductory materials in CS 4476/6476 (Computer Vision), this class will prepare graduate students in both the theoretical foundations of computer vision as well as the practical approaches to building real Computer Vision systems. This course investigates current research topics in computer vision. We will examine data sources and learning methods useful for understanding and manipulating visual data. This year, special emphasis will be placed on research at the intersection of computer vision and robotics. Class topics will be pursued through independent reading, class discussion and presentations, and research projects.

The goal of this course is to give students the background and skills necessary to perform research in computer vision and its application domains such as robotics, VR/AR, healthcare, and graphics. Students should understand the strengths and weaknesses of current approaches to research problems and identify interesting open questions and future research directions. Students will hopefully improve their critical reading and communication skills, as well.

Course Requirements

Reading and Discussion Topics

Students will be expected to read one paper for each class. For each assigned paper, students must identify at least one question or topic of interest for class discussion. Interesting topics for discussion could relate to strengths and weaknesses of the paper, possible future directions, connections to other research, uncertainty about the conclusions of the experiments, etc. Questions / Discussion topics must be posted to the course Canvas discussions tab by 11:59pm the day before each class. Feel free to reply to other comments and help each other understanding confusing aspects of the papers. The Canvas discussion will be the starting point for the class discussion. If you are presenting you don't need to post a question to Canvas.

Class participation

Attendence is required for this course. All students are expected to take part in class discussions. If you do not fully understand a paper that is OK. We can work through the unclear aspects of a paper together in class. If you are unable to attend a specific class please let me know ahead of time. Don't come to class if you are sick.

Presentation(s)

Each student will lead the presentation of one paper during the semester (possibly as part of a pair of students). Ideally, students would implement some aspect of the presented material and perform experiments that help us to understand the algorithms. Presentations and all supplemental material should be ready one week before the presentation date so that students can meet with the instructor, go over the presentation, and possibly iterate before the in-class discussion. For the presentations it is fine to use slides and code from outside sources (for example, the paper authors) but be sure to give credit.

Semester group projects

Students will work alone or in pairs to complete a state-of-the-art research project on a topic relevant to the course. Students will propose a research topic early in the semester. After a project topic is finalized, students will meet occasionally with the instructor or TA to discuss progress. Students will report their progress on their semester project twice during the course and the course will end with final project presentations. Students will also produce a conference-formatted write-up of their project. Projects will be published on the this web page. The ideal project is something with a clear enough direction to be completed in a couple of months, and enough novelty such that it could be published in a peer-reviewed venue with some refinement and extension.

Prerequisites

Strong mathematical skills (linear algebra, calculus, probability and statistics) are needed. It is strongly recommended that students have taken one of the following courses (or equivalent courses at other institutions):

Computer Vision CS 4476 / 6476
Machine Learning
Deep Learning
Computational Photography

If you aren't sure whether you have the background needed for the course, you can try reading some of the papers below or you can simply attend lecture during the first weeks.

Textbook

We will not rely on a textbook, although the free, online textbook "Computer Vision: Algorithms and Applications, 2nd edition" by Richard Szeliski is a helpful resource.

Grading

Your final grade will be made up from

20% Reading summaries and questions posted to Canvas
10% Classroom participation and attendance
15% Leading discussion for particular research paper
20% Semester project updates
35% Final semester project presentation and writeup

Office Hours:

James Hays, immediately after lecture on Tuesdays
Akshay Krishnan, Mondays 2:30 - 4:30 pm at Coda

Tentative Schedule

Date	Paper	Paper, Project page	Presenter
Tue, Jan 9	Course overview, paper scheduling		James
Thu, Jan 11	[ViT] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby	arXiv	James
Tue, Jan 16	[CLIP] Learning Transferable Visual Models From Natural Language Supervision. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever	arXiv	James
Thu, Jan 18	[MAE] Masked Autoencoders Are Scalable Vision Learners. Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollar, Ross Girshick	arXiv	James
Tue, Jan 23	[SimCLR] A Simple Framework for Contrastive Learning of Visual Representations. Ting Chen, Simon Kornblith, Mohammad Norouzi, Geoffrey Hinton	arXiv	Vivek
Thu, Jan 25	The Curious Robot: Learning Visual Representations via Physical Interactions. Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, Abhinav Gupta	arXiv	Ethan
Tue, Jan 30	Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks. Michelle A. Lee, Yuke Zhu, Krishnan Srinivasan, Parth Shah, Silvio Savarese, Li Fei-Fei, Animesh Garg, Jeannette Bohg	arXiv	Bowen
Thu, Feb 1	[ConvNext] A ConvNet for the 2020s. Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, Saining Xie	arXiv	Kevin and Saba
	ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, Saining Xie	arXiv	Kevin and Saba
	ConvNets Match Vision Transformers at Scale. Samuel L. Smith, Andrew Brock, Leonard Berrada, Soham De	arXiv	Kevin and Saba
Tue, Feb 6	SENTRY: Selective Entropy Optimization via Committee Consistency for Unsupervised Domain Adaptation. Viraj Prabhu, Shivam Khare, Deeksha Kartik, Judy Hoffman	arXiv	Ziyan
Thu, Feb 8	Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo	arXiv	Katherine
Tue, Feb 13	[Dino] Emerging Properties in Self-Supervised Vision Transformers. Mathilde Caron, Hugo Touvron, Ishan Misra, Herve Jegou, Julien Mairal, Piotr Bojanowski, Armand Joulin	arXiv	Abhijith and Huili
	DINOv2: Learning Robust Visual Features without Supervision. Oquab et al.	arXiv	Abhijith and Huili
Thu, Feb 15	The Unsurprising Effectiveness of Pre-Trained Vision Models for Control. Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, Abhinav Gupta	arXiv	Manthan
Tue, Feb 20	PressureVision: Estimating Hand Pressure from a Single RGB Image. Patrick Grady, Chengcheng Tang, Samarth Brahmbhatt, Christopher D. Twigg, Chengde Wan, James Hays, Charles C. Kemp	arXiv	Gabriela
Thu, Feb 22	Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild. Garrick Brazil, Abhinav Kumar, Julian Straub, Nikhila Ravi, Justin Johnson, Georgia Gkioxari	arXiv	Haotian
Tue, Feb 27	High-Resolution Image Synthesis with Latent Diffusion Models. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, Bjorn Ommer	arXiv	Dipanwita
Thu, Feb 29	RT-1: Robotics Transformer for Real-World Control at Scale. Brohan et al.	arXiv	Dhruv
	RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Brohan et al.	arXiv	Dhruv
Tue, Mar 5	[ControlNet] Adding Conditional Control to Text-to-Image Diffusion Models. Lvmin Zhang, Anyi Rao, Maneesh Agrawala	arXiv	Avinash
Thu, Mar 7	TerrainNet: Visual Modeling of Complex Terrain for High-speed, Off-road Navigation. Xiangyun Meng, Nathan Hatch, Alexander Lambert, Anqi Li, Nolan Wagener, Matthew Schmittle, JoonHo Lee, Wentao Yuan, Zoey Chen, Samuel Deng, Greg Okopal, Dieter Fox, Byron Boots, Amirreza Shaban	arXiv	Changxuan
Tue, Mar 12	OneFormer: One Transformer to Rule Universal Image Segmentation. Jitesh Jain, Jiachen Li, MangTik Chiu, Ali Hassani, Nikita Orlov, Humphrey Shi	arXiv	Kartik
Thu, Mar 15	Segment Anything. Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, Ross Girshick	arXiv	Prateek
Tue, Mar 19	No Classes, Institute Holiday
Thu, Mar 21	No Classes, Institute Holiday
Tue, Mar 26	Robot Learning with Sensorimotor Pre-training. Ilija Radosavovic, Baifeng Shi, Letian Fu, Ken Goldberg, Trevor Darrell, Jitendra Malik	arXiv	?
Thu, Mar 28	NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, Ren Ng	arXiv	Zulfiqar and Kshitij
Tue, Apr 2	3D Gaussian Splatting for Real-Time Radiance Field Rendering. Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuhler, George Drettakis	arXiv	Pranay
Thu, Apr 4	Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation. William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, Phillip Isola	arXiv	Jesse
Tue, Apr 9	Lifelong Robot Learning with Human Assisted Language Planners. Meenal Parakh, Alisha Fong, Anthony Simeonov, Tao Chen, Abhishek Gupta, Pulkit Agrawal	arXiv	Marcus and Mohamed
Thu, Apr 11	The Un-Kidnappable Robot: Acoustic Localization of Sneaking People. Mengyu Yang, Patrick Grady, Samarth Brahmbhatt, Arun Balajee Vasudevan, Charles C. Kemp, James Hays	arXiv	Yilong
Tue, Apr 16	Learning to See Physical Properties with Active Sensing Motor Policies. Gabriel B. Margolis, Xiang Fu, Yandong Ji, Pulkit Agrawal	arXiv	Desiree
Thu, Apr 18	Sequential Modeling Enables Scalable Learning for Large Vision Models. Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan Yuille, Trevor Darrell, Jitendra Malik, Alexei A Efros	arXiv	Manushree and Mehrdad
Tue, Apr 23	AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. Ahn et al.	Project page	Navya and Isha
Final Exam Slot Thursday, May 2, 11:20 to 2:10	Final Project Presentations		Everyone
Saturday, May 4	Final Report due		Everyone