Robot Learning

Ken Nakahara, Cong Wang, Zdravko Dugonjic, Robin Koch, Johannes Busch, Niklas Leukroth and Max Haufe

Learning, Adaptive Systems, and Robotics (LASR) Lab

Research Seminar

v1.0

Structure of the Course

You will have to find a group of 2 students (exceptions possible).
Each group can select a topic to work on.
Each topic consists of 3-4 papers.
The group will have to read the papers, write a report, and present the topic in front of the course.
The presentation, including questions, should take about 45 minutes.
The presentation/report should contain:
- Explanation of the provided papers.
- Information on additional papers that fit the respective topic.
- A comparison/Analysis of common patters of the papers.
- Discussion of the topic.
We will propose a number of topics, but feel free to propose your own.

Evaluation and Grading

Both the presentation and the report will be graded individually.
Every group will partner up with a second group. The groups will
- be tasked to lead the question round after the presentation of the other group.
- write a review of the report of the respective other group. The final grading will be done by the course supervisors, but the reviews will be taken into account.
Attendance will be checked and is mandatory for passing the course.
- Absence must be excused (e.g., with a doctor's note).
- You can miss up to one seminar unexcused without repercussions.
- Missing too many seminars can lead to failing the course.

Schedule

The seminars will take place from 18.11.25-03.02.26, which leaves the first group 4 weeks to prepare. Every seminar will consist of two presentations with questions.
Please sign up by filling out the selection form on Opal until 23.10.
- You can select three preferences from the list presented today or put in your own topic (please confirm the topic with us beforehand).
- You can select three preferences for your presentation date.
- You can also tell us if there are dates that you absolutely cannot present at.
- Preference 1 is your first choice. Do not select the same topic/date twice.

Submission Deadlines

Report/Slideset: 06.02.26
Partner Group Review: 20.03.26

Available Topics

In-Sensor Machine Learning

Papers:

Dudek, Piotr, et al. "Sensor-level computer vision with pixel processor arrays for agile robots."

Bose, Laurie, et al. "A camera that CNNs: Towards embedded neural networks on pixel processor arrays."

Bose, Laurie, et al. "Fully embedding fast convolutional networks on pixel processor arrays."

So, Haley M., et al. "PixelRNN: in-pixel recurrent neural networks for end-to-end-optimized perception with neural sensors."

Goal:

Give an overview over pixel processor arrays and highlight their capabilities for in-sensor ML.

Source: Dudek, et. al.

Depth-First Scheduling for CNNs

Papers:

Verhelst, Marian, Man Shi, and Linyan Mei. "ML processors are going multi-core: A performance dream or a scheduling nightmare?."

Mei, Linyan, et al. "Defines: Enabling fast exploration of the depth-first scheduling space for dnn accelerators through analytical modeling."

Goal:

Give an overview over various techniques to map DNNs/CNNs on multi-core systems and compare

Source: Assran, et. al.

State Space Models as Transformer Alternatives

Overview:

Transformers are computationally
expensive at inference due to quadratic
complexity in sequence length. State space
models (SSMs) offer an efficient alternative.

Goal: Provide an overview of SSMs and
their different representations
(ex. RNN & CNN) [1].
Discuss these representations and improvements in Structured SSMs (S4) [2] and Mamba [3]. Compare Mamba's performance against Transformers in sequence modeling tasks.

Papers:

[1] Gu, Albert, et al. "Combining recurrent, convolutional, and continuous-time models with linear state space layers."
[2] Gu, Albert, Karan Goel, and Christopher Ré. "Efficiently modeling long sequences with structured state spaces."
[3] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces."

Source: [1]

State Space Models for Event Cameras

Papers:

Gu, Albert, et al. "Combining recurrent, convolutional, and continuous-time models with linear state space layers."

Zubic, Nikola, Mathias Gehrig, and Davide Scaramuzza. "State space models for event cameras."

Schöne, Mark, et al. "Scalable event-by-event processing of neuromorphic sensory signals with deep state-space models."

Soydan, Taylan, et al. "S7: Selective and simplified state space layers for sequence modeling."

Goal:

Highlight the evolution of state space models through the lens of processing event based data.

Source: Zubic, et. al.

Joint-Embedding Predicitive Architectures

Papers:

Assran, Mahmoud, et al. "Self-supervised learning from images with a joint-embedding predictive architecture."

Bardes, Adrien, et al. "Revisiting feature prediction for learning visual representations from video."

Balestriero, Randall, and Yann LeCun. "Lejepa: Provable and scalable self-supervised learning without the heuristics."

Goal:

Explain the concept behind self-supervised learning via joint-embedding predicitive architectures and showcase their evolution.

Source: Assran, et. al.

Tactile Pose and Shape Estimation

Papers:

E. K. Gordon, B. Baraki, H. Bui, and M. Posa, “Active Tactile Exploration for Rigid Body Pose and Shape Estimation,”
J. Lee and N. Fazeli, “ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation,”
M. Comi, Y. Lin, A. Church, A. Tonioni, L. Aitchison, and N. F. Lepora, “TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction Using Vision-Based Tactile Sensing,”

Goal:

Present and compare the approaches used for tactile pose and shape estiamtion.

Source:J. Lee and N. Fazeli, “ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation,”

OLD SLIDES

Latent World Models

Papers:

Hafner, et al. - Dream to Control: Learning Behaviors by Latent Imagination

Hafner, et al. - Mastering Atari with Discrete World Models

Hafner, et al. - Mastering Diverse Domains through World Models

Hafner, et al. - Training Agents Inside of Scalable World-Models

Goal:

Show the evolution of Latent World Models. Pay special attention to what features contribute to a suitable representation for latent imagination.

Source: Hafner, et. al.

Diffusion-Generative Models

Papers:

Ho, et. al. - Denoising Diffusion Probabilistic Models

Song, et. al. - Denoising Diffusion Implicit Models

Song, et. al. - Consistency Models

Karras, et. al. - Elucidating the Design Space of Diffusion-Based Generative Models

Goal:

Explain the different mathematical formulations of diffusion models and show how they can be unified. This can be framed as a tutorial on diffusion models.

Source: Sohl-Dickstein, et. al.

Diffusion Policies vs Diffusion Planners

Papers:

Chi, et al. - Diffusion Policy: Visuomotor Policy Learning via Action Diffusion.

Wang, et al. - Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning.

Ajay et al. - Is Conditional Generative Modeling all you need for Decision-Making?

Lu et al. - What makes a good diffusion planner?

Goal:

Explain and contrast different approaches how diffusion models can be used to model distributions of actions (diffusion policies) or state-dynamics (diffusion planners). Outline their respective strengths and weaknesses.

Source: Zhu, et. al.

Language-Conditioned Imitation Learning

Papers:

Reuss et al. - Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

Liang et al. - SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution

Zhang et al. - Language Control Diffusion: Efficiently Scaling Through Space, Time, and Tasks

Peschl et al. - From Code to Action: Hierarchical Learning of Diffusion-VLM Policies

Goal:

Compare different approaches to condition diffusion-based imitation learning pipelines on language queries. Pay special attention to the distinction between diffusion policies and diffusion planning.

Source: language-play.github.io

Mental States Attribution to Robotics

Papers:

Thellman et al. - Mental State Attribution to Robots: A Systematic Review of Conceptions, Methods, and Findings

Thellman et al. - Do You See what I See? Tracking the Perceptual Beliefs of Robots

Thellman et al. - Does the Robot Know It Is Being Distracted? Attitudinal and Behavioral Consequences of Second-Order Mental State Attribution in HRI

Koban et al. - It feels, therefore it is: Associations between mind perception and mind ascription for social robots

Source: Thellman et al.

Robot Intent Expression and Communication

Papers:

Prascher et al. - How to Communicate Robot Motion Intent: A Scoping Review

Bodden et al. -A flexible optimization-based method for synthesizing intent-expressive robot arm motion

Yi et al. - Your Way Or My Way: Improving Human-Robot Co-Navigation Through Robot Intent and Pedestrian Prediction Visualisations.

Abe et al. Human Understanding and Perception of Unanticipated Robot Action in the Context of Physical Interaction

Source: Bodden et al.

Human Intention Estimation in HRI

Papers:

Belardinelli - Gaze-Based Intention Estimation: Principles, Methodologies, and Applications in HRI

Belardinelli et al. - Intention estimation from gaze and motion features for human-robot shared-control object manipulation

Arreghini et al. - Predicting the Intention to Interact with a Service Robot: the Role of Gaze Cues

Xiaoyu Wang et al. Gaze-Based Shared Autonomy Framework With Real-Time Action Primitive Recognition for Robot Manipulators

Source: Xiaoyu Wang et al.

Knowledge Representation in Service Robotics

Papers:

Beetz et al. - Know Rob 2.0 — A 2nd Generation Knowledge Processing Framework for Cognition-Enabled Robotic Agents

Hughes et al. - Foundations of spatial perception for robotics: Hierarchical representations and real-time systems

Paulius et al. - A Survey of Knowledge Representation in Service Robotics

Source: Paulius et al.

Learning Robust Dexterous Grasping

Papers:

Zhang, et al. - RobustDexGrasp: Robust Dexterous Grasping of General Objects from Single-view Perception

Zhong, et al. - DexGrasp Anything: Towards Universal Robotic Dexterous Grasping with Physics Awareness

Zhao, et al. - Embedding high-resolution touch across robotic hands enables adaptive human-like grasping

Source: Zhao, et al.

Bimanual Manipulation with Multi-fingered Robotic Hands

Papers:

Lin, et al. - Learning Visuotactile Skills with Two Multifingered Hands

Chen, et al. - Bi-DexHands: Towards Human-Level Bimanual Dexterous Manipulation

Shaw, et al. - Bimanual Dexterity for Complex Tasks

Source: Shaw, et al.

Imitation Learning for Humanoid Robots

Papers:

He, et al. - OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

Li, et al. - OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

Qiu, et al. - Humanoid Policy ∼ Human Policy

Source: He, et al.

Foundation Models in Robotics

Papers:

Kawaharazuka, et al., Vision-Language-Action Models for Robotics: A Review Towards Real-World Applications

Physical Intelligence - π0.5: a Vision-Language-Action Model with Open-World Generalization

NVIDIA - GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Source: Physical Intelligence

Unified Tactile Representation (1/2)

Source: completechildrenshealth.com.au

Unified Tactile Representation (2/2)

Papers:

Higuera, et al. - Sparsh: Self-supervised touch representations for vision-based tactile sensing

Gupta, et al. - Sensor-Invariant Tactile Representation

Rodriguez et al. - Cross-Sensor Touch Generation

Deep Net

Learning Simulation from Data

Papers:

Alonso, et al. - Diffusion for World Modeling: Visual Details Matter in Atari

Kanervisto, et al. - World and Human Action Models towards gameplay ideation

Chen, et al. - Model as a Game: On Numerical and Spatial Consistency for Generative Games

Motamed, et al. - Do generative video models understand physical principles?

Source: https://diamond-wm.github.io/

Conscious Machine

Papers:

Chalmers - Facing Up to the Problem of Consciousness

Joseph LeDoux, et al. - Consciousness beyond the human case

Farisco, et al. - Is artificial consciousness achievable? Lessons from the human brain

Source: https://www.youtube.com/watch?v=-cfIm06tcfA

Goal: Discuss the term of consciousness and its usage with current state-of-the-art intelligent systems.

Control strategies for Tendon-Driven Hands

Papers:

K. Shaw and D. Pathak, “Demonstrating LEAP Hand v2: Low-Cost, Easy-to-Assemble, High-Performance Hand for Robot Learning,”
Y. Toshimitsu et al., “Getting the Ball Rolling: Learning a Dexterous Policy for a Biomimetic Tendon-Driven Hand with Rolling Contact Joints,”
C. C. Christoph et al., “ORCA: An Open-Source, Reliable, Cost-Effective, Anthropomorphic Robotic Hand for Uninterrupted Dexterous Task Learning,”
A. Zorin et al., “RUKA: Rethinking the Design of Humanoid Hands with Learning,”

Goal:

It is difficult to control tendon-driven robot hands that do not possess a joint-angle encoder. This work should present and compare some of the approaches used to control such tendon-driven robot hands.

Source: A. Zorin et al., “RUKA: Rethinking the Design of Humanoid Hands with Learning,”

Tactile Pose and Shape Estimation

Papers:

E. K. Gordon, B. Baraki, H. Bui, and M. Posa, “Active Tactile Exploration for Rigid Body Pose and Shape Estimation,”
J. Lee and N. Fazeli, “ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation,”
M. Comi, Y. Lin, A. Church, A. Tonioni, L. Aitchison, and N. F. Lepora, “TouchSDF: A DeepSDF Approach for 3D Shape Reconstruction Using Vision-Based Tactile Sensing,”

Goal:

Present and compare the approaches used for tactile pose and shape estiamtion.

Source:J. Lee and N. Fazeli, “ViTaSCOPE: Visuo-tactile Implicit Representation for In-hand Pose and Extrinsic Contact Estimation,”

Lensless Imaging

Papers:

S. Li et al., “Lensless camera: Unraveling the breakthroughs and prospects,”
Y. Perron, E. Bezzam, and M. Vetterli, “A Modular and Robust Physics-Based Approach for Lensless Image Reconstruction,”
S. S. Khan, V. Sundar, V. Boominathan, A. Veeraraghavan, and K. Mitra, “FlatNet: Towards Photorealistic Scene Reconstruction from Lensless Measurements,”
K. Monakhova, J. Yurtsever, G. Kuo, N. Antipa, K. Yanny, and L. Waller, “Learned reconstructions for practical mask-based lensless imaging,”

Goal: Present and compare the approaches used for lensless imaging.

Source: Y. Perron, E. Bezzam, and M. Vetterli, “A Modular and Robust Physics-Based Approach for Lensless Image Reconstruction,”

State Space Models as Transformer Alternatives

Overview:

Transformers are computationally
expensive at inference due to quadratic
complexity in sequence length. State space
models (SSMs) offer an efficient alternative.

Papers:

Source: [1]