Home / News / Cross-modal understanding and generation of multimodal content

Cross-modal understanding and generation of multimodal content

Monday, April 15, 2024

Video generation consists of generating a video sequence so that an object in a source image is animated according to some external information (a conditioning label, a driving video, a piece of text). In this talk I will present some of our recent achievements addressing generating videos without using any annotation or prior information about the specific object to animate. Once trained on a set of videos depicting objects of the same category (e.g. faces, human bodies), our method can be applied to any object of this class. Based on this, I will present our framework to train game-engine-like neural models, solely from monocular annotated videos. The result —a Learnable Game Engine (LGE)— maintains states of the scene, objects and agents in it, and enables rendering the environment from a controllable viewpoint. Similarly to a game engine, it models the logic of the game and the underlying rules of physics, to make it possible for a user to play the game by specifying both high- and low-level action sequences. Our LGE can also unlock the director’s mode, where the game is played by plotting behind the scenes, specifying high-level actions and goals for the agents in the form of language and desired states. This requires learning “game AI”, encapsulated by our animation model, to navigate the scene using high-level constraints, play against an adversary, devise the strategy to win a point.

Post Talk Link: Click Here

Passcode: $BCM@xi7

Speaker/s

Nicu Sebe is a professor in the University of Trento, Italy, where he is leading the research in the areas of multimedia information retrieval and human-computer interaction in computer vision applications. He received his PhD from the University of Leiden, The Netherlands and has been in the past with the University of Amsterdam, The Netherlands and the University of Illinois at Urbana-Champaign, USA. He was involved in the organization of the major conferences and workshops addressing the computer vision and human-centered aspects of multimedia information retrieval, among which as a General Co-Chair of the IEEE Automatic Face and Gesture Recognition Conference, FG 2008, ACM International Conference on Multimedia Retrieval (ICMR) 2017 and ACM Multimedia 2013. He was a program chair of ACM Multimedia 2011 and 2007, ECCV 2016, ICCV 2017, ICPR 2020 and a general chair of ACM Multimedia 2022. He is a fellow of ELLIS, IAPR and a Senior member of ACM and IEEE.

Thursday, July 24, 2025

Cross-modal understanding and generation of multimodal content

Speaker/s

Related

Understanding faith in the age of AI

Formal Methods for Modern Payment Protocols

Polygenic Score Modeling to Investigate Genotype-Phenotype Associations

Cross-modal understanding and generation of multimodal content

Speaker/s

Related

Understanding faith in the age of AI

Formal Methods for Modern Payment Protocols

Polygenic Score Modeling to Investigate Genotype-Phenotype Associations

Subscribe to The Node