Paper Digest: CVPR 2024 Papers & Highlights
Note: CVPR-2024 accepts more than 2,700 papers, this page only includes 500 of them selected by our daily paper digest algorithm. Interested users can choose to read All 2,700 CVPR-2024 papers in a separate page.
To search or review papers within CVPR-2024 related to a specific topic, please use the search by venue (CVPR-2024), review by venue (CVPR-2024) and question answering by venue (CVPR-2024) services. To browse papers by author, here is a list of all authors (CVPR-2024). You may also like to explore our “Best Paper” Digest (CVPR), which lists the most influential CVPR papers since 1988.
This list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that empowers you to write, review, get answers and more. Try us today and unlock the full potential of our services for free!
TABLE 1: Paper Digest: CVPR 2024 Papers & Highlights
Paper | Author(s) | |
---|---|---|
1 | MPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a versatile multi-modal large language model mPLUG-Owl2 which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. |
Qinghao Ye; Haiyang Xu; Jiabo Ye; Ming Yan; Anwen Hu; Haowei Liu; Qi Qian; Ji Zhang; Fei Huang; |
2 | Generating Illustrated Instructions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new task of generating "Illustrated Instructions" i.e. visual instructions customized to a user’s needs. |
Sachit Menon; Ishan Misra; Rohit Girdhar; |
3 | Improved Baselines with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework. |
Haotian Liu; Chunyuan Li; Yuheng Li; Yong Jae Lee; |
4 | ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. |
Mu Cai; Haotian Liu; Siva Karthik Mustikovela; Gregory P. Meyer; Yuning Chai; Dennis Park; Yong Jae Lee; |
5 | Edit One for All: Interactive Batch Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the goal of minimizing human supervision in the editing process this paper presents a novel method for interactive batch image editing using StyleGAN as the medium. |
Thao Nguyen; Utkarsh Ojha; Yuheng Li; Haotian Liu; Yong Jae Lee; |
6 | SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work introduces SplaTAM an approach that for the first time leverages explicit volumetric representations i.e. 3D Gaussians to enable high-fidelity reconstruction from a single unposed RGB-D camera surpassing the capabilities of existing methods. |
Nikhil Keetha; Jay Karhade; Krishna Murthy Jatavallabhula; Gengshan Yang; Sebastian Scherer; Deva Ramanan; Jonathon Luiten; |
7 | A Unified Approach for Text- and Image-guided 4D Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D which features a novel two-stage approach for text-to-4D synthesis leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. |
Yufeng Zheng; Xueting Li; Koki Nagano; Sifei Liu; Otmar Hilliges; Shalini De Mello; |
8 | 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. |
Guanjun Wu; Taoran Yi; Jiemin Fang; Lingxi Xie; Xiaopeng Zhang; Wei Wei; Wenyu Liu; Qi Tian; Xinggang Wang; |
9 | WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body’s global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions such as climbing stairs. |
Soyong Shin; Juyong Kim; Eni Halilaj; Michael J. Black; |
10 | TokenHMR: Advancing Human Mesh Recovery with A Tokenized Pose Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We address the problem of regressing 3D human pose and shape from a single image with a focus on 3D accuracy. |
Sai Kumar Dwivedi; Yu Sun; Priyanka Patel; Yao Feng; Michael J. Black; |
11 | CogAgent: A Visual Language Model for GUI Agents Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce CogAgent an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. |
Wenyi Hong; Weihan Wang; Qingsong Lv; Jiazheng Xu; Wenmeng Yu; Junhui Ji; Yan Wang; Zihan Wang; Yuxiao Dong; Ming Ding; Jie Tang; |
12 | ChatPose: Chatting About 3D Human Pose Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ChatPose a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. |
Yao Feng; Jing Lin; Sai Kumar Dwivedi; Yu Sun; Priyanka Patel; Michael J. Black; |
13 | EMAGE: Towards Unified Holistic Co-Speech Gesture Generation Via Expressive Masked Audio Gesture Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose EMAGE a framework to generate full-body human gestures from audio and masked gestures encompassing facial local body hands and global movements. |
Haiyang Liu; Zihao Zhu; Giorgio Becherini; Yichen Peng; Mingyang Su; You Zhou; Xuefei Zhe; Naoya Iwamoto; Bo Zheng; Michael J. Black; |
14 | WANDR: Intention-guided Human Motion Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this we introduce WANDR a data-driven model that takes an avatar’s initial pose and a goal’s 3D position and generates natural human motions that place the end effector (wrist) on the goal location. |
Markos Diomataris; Nikos Athanasiou; Omid Taheri; Xi Wang; Otmar Hilliges; Michael J. Black; |
15 | VAREN: Very Accurate and Realistic Equine Network Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce VAREN a novel 3D articulated parametric shape model learned from 3D scans of many real horses. |
Silvia Zuffi; Ylva Mellbin; Ci Li; Markus Hoeschle; Hedvig Kjellström; Senya Polikovsky; Elin Hernlund; Michael J. Black; |
16 | SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. |
Hoon Kim; Minje Jang; Wonjun Yoon; Jisoo Lee; Donghyun Na; Sanghyun Woo; |
17 | MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However due to the difficulty and cost of collecting and labeling data existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue we present MTMMC a real-world large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments – campus and factory – across various time weather and season conditions. |
Sanghyun Woo; Kwanyong Park; Inkyu Shin; Myungchul Kim; In So Kweon; |
18 | Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we instead focus on the underexplored text-to-4D setting and synthesize dynamic animated 3D objects using score distillation methods with an additional temporal dimension. |
Huan Ling; Seung Wook Kim; Antonio Torralba; Sanja Fidler; Karsten Kreis; |
19 | DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In our solution we introduce image prompts in fine-grained image editing cooperating with the text prompt to better describe the editing content. |
Chong Mou; Xintao Wang; Jiechong Song; Ying Shan; Jian Zhang; |
20 | InstanceDiffusion: Instance-level Control for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. |
Xudong Wang; Trevor Darrell; Sai Saketh Rambhatla; Rohit Girdhar; Ishan Misra; |
21 | Scaling Laws of Synthetic Images for Model Training … for Now Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models for the training of supervised models: image classifiers with label supervision and CLIP with language supervision. |
Lijie Fan; Kaifeng Chen; Dilip Krishnan; Dina Katabi; Phillip Isola; Yonglong Tian; |
22 | Eyes Wide Shut? Exploring The Visual Shortcomings of Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues we propose a Mixture of Features (MoF) approach demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. |
Shengbang Tong; Zhuang Liu; Yuexiang Zhai; Yi Ma; Yann LeCun; Saining Xie; |
23 | Emu Edit: Precise Image Editing Via Recognition and Generation Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Emu Edit a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. |
Shelly Sheynin; Adam Polyak; Uriel Singer; Yuval Kirstain; Amit Zohar; Oron Ashual; Devi Parikh; Yaniv Taigman; |
24 | On The Content Bias in Frechet Video Distance Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Frechet Video Distance (FVD) a prominent metric for evaluating video generation models is known to conflict with human perception occasionally. In this paper we aim to explore the extent of FVD’s bias toward frame quality over temporal realism and identify its sources. |
Songwei Ge; Aniruddha Mahapatra; Gaurav Parmar; Jun-Yan Zhu; Jia-Bin Huang; |
25 | MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. |
Xiang Yue; Yuansheng Ni; Kai Zhang; Tianyu Zheng; Ruoqi Liu; Ge Zhang; Samuel Stevens; Dongfu Jiang; Weiming Ren; Yuxuan Sun; Cong Wei; Botao Yu; Ruibin Yuan; Renliang Sun; Ming Yin; Boyuan Zheng; Zhenzhu Yang; Yibo Liu; Wenhao Huang; Huan Sun; Yu Su; Wenhu Chen; |
26 | DeepCache: Accelerating Diffusion Models for Free Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce DeepCache a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. |
Xinyin Ma; Gongfan Fang; Xinchao Wang; |
27 | Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a new method for text-driven motion transfer – synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video’s motion and scene layout. |
Danah Yatim; Rafail Fridman; Omer Bar-Tal; Yoni Kasten; Tali Dekel; |
28 | HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Fine-tuning each personalized model needs considerable GPU time investment and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges we propose HyperDreamBooth – a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. |
Nataniel Ruiz; Yuanzhen Li; Varun Jampani; Wei Wei; Tingbo Hou; Yael Pritch; Neal Wadhwa; Michael Rubinstein; Kfir Aberman; |
29 | Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge we propose Make-It-Vivid the first attempt to enable high-quality texture generation from text in UV space. |
Junshu Tang; Yanhong Zeng; Ke Fan; Xuheng Wang; Bo Dai; Kai Chen; Lizhuang Ma; |
30 | NeRFiller: Completing Scenes Via Generative 3D Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose NeRFiller an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. |
Ethan Weber; Aleksander Holynski; Varun Jampani; Saurabh Saxena; Noah Snavely; Abhishek Kar; Angjoo Kanazawa; |
31 | V?: Guided Visual Search As A Core Mechanism in Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details especially when handling high-resolution and visually crowded images. To address this we introduce V* an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. |
Penghao Wu; Saining Xie; |
32 | Pix2gestalt: Amodal Segmentation By Synthesizing Wholes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce pix2gestalt a framework for zero-shot amodal segmentation which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. |
Ege Ozguroglu; Ruoshi Liu; Dídac Surís; Dian Chen; Achal Dave; Pavel Tokmakov; Carl Vondrick; |
33 | MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current multi-modal large language models however passively absorb sensory data as inputs lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area we propose MultiPLY a multisensory embodied large language model that could incorporate multisensory interactive data including visual audio tactile and thermal information into large language models thereby establishing the correlation among words actions and percepts. |
Yining Hong; Zishuo Zheng; Peihao Chen; Yian Wang; Junyan Li; Chuang Gan; |
34 | Learning Vision from Models Rivals Learning Vision from Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce SynCLR a novel approach for learning visual representations exclusively from synthetic images without any real data. |
Yonglong Tian; Lijie Fan; Kaifeng Chen; Dina Katabi; Dilip Krishnan; Phillip Isola; |
35 | GARField: Group Anything with Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose Group Anything with Radiance Fields (GARField) an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. |
Chung Min Kim; Mingxuan Wu; Justin Kerr; Ken Goldberg; Matthew Tancik; Angjoo Kanazawa; |
36 | Generative Proxemics: A Prior for 3D Social Interaction from Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. |
Lea Müller; Vickie Ye; Georgios Pavlakos; Michael Black; Angjoo Kanazawa; |
37 | Readout Guidance: Learning Control from Diffusion Features Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Readout Guidance a method for controlling text-to-image diffusion models with learned signals. |
Grace Luo; Trevor Darrell; Oliver Wang; Dan B Goldman; Aleksander Holynski; |
38 | Mosaic-SDF for 3D Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Mosaic-SDF (M-SDF): a simple 3D shape representation that approximates the Signed Distance Function (SDF) of a given shape by using a set of local grids spread near the shape’s boundary. |
Lior Yariv; Omri Puny; Oran Gafni; Yaron Lipman; |
39 | PixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce pixelSplat a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. |
David Charatan; Sizhe Lester Li; Andrea Tagliasacchi; Vincent Sitzmann; |
40 | InceptionNeXt: When Inception Meets ConvNeXt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Although reducing the kernel size of ConvNeXt can improve speed it results in significant performance degradation which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue inspired by Inceptions we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension i.e. small square kernel two orthogonal band kernels and an identity mapping. |
Weihao Yu; Pan Zhou; Shuicheng Yan; Xinchao Wang; |
41 | Mitigating Object Hallucinations in Large Vision-Language Models Through Visual Contrastive Decoding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Despite their success LVLMs still suffer from the issue of object hallucinations where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue we introduce Visual Contrastive Decoding (VCD) a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. |
Sicong Leng; Hang Zhang; Guanzheng Chen; Xin Li; Shijian Lu; Chunyan Miao; Lidong Bing; |
42 | MVBench: A Comprehensive Multi-modal Video Understanding Benchmark Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However most benchmarks predominantly assess spatial understanding in the static image tasks while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue we introduce a comprehensive Multi-modal Video understanding Benchmark namely MVBench which covers 20 challenging video tasks that cannot be effectively solved with a single frame. |
Kunchang Li; Yali Wang; Yinan He; Yizhuo Li; Yi Wang; Yi Liu; Zun Wang; Jilan Xu; Guo Chen; Ping Luo; Limin Wang; Yu Qiao; |
43 | Image Sculpting: Precise Object Editing with 3D Geometry Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Image Sculpting a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. |
Jiraphon Yenphraphai; Xichen Pan; Sainan Liu; Daniele Panozzo; Saining Xie; |
44 | PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose PerAda a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance especially under test-time distribution shifts. |
Chulin Xie; De-An Huang; Wenda Chu; Daguang Xu; Chaowei Xiao; Bo Li; Anima Anandkumar; |
45 | ShapeWalk: Compositional Shape Editing Through Language-Guided Chains Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Editing 3D shapes through natural language instructions is a challenging task that requires the comprehension of both language semantics and fine-grained geometric details. To bridge this gap we introduce ShapeWalk a carefully designed synthetic dataset designed to advance the field of language-guided shape editing. |
Habib Slim; Mohamed Elhoseiny; |
46 | ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ZeroRF a novel per-scene optimization method addressing the challenge of sparse view 360deg reconstruction in neural field representations. |
Ruoxi Shi; Xinyue Wei; Cheng Wang; Hao Su; |
47 | GaussianDreamer: Fast Generation from Text to 3D Gaussians By Bridging 2D and 3D Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. |
Taoran Yi; Jiemin Fang; Junjie Wang; Guanjun Wu; Lingxi Xie; Xiaopeng Zhang; Wenyu Liu; Qi Tian; Xinggang Wang; |
48 | Reconstructing Hands in 3D with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present an approach that can reconstruct hands in 3D from monocular input. |
Georgios Pavlakos; Dandan Shan; Ilija Radosavovic; Angjoo Kanazawa; David Fouhey; Jitendra Malik; |
49 | Depth Anything: Unleashing The Power of Large-Scale Unlabeled Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents Depth Anything a highly practical solution for robust monocular depth estimation. |
Lihe Yang; Bingyi Kang; Zilong Huang; Xiaogang Xu; Jiashi Feng; Hengshuang Zhao; |
50 | Overcoming Generic Knowledge Loss with Selective Parameter Update Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Leveraging the fact that foundation models have initial knowledge on various tasks and domains we propose a novel approach that instead of updating all parameters equally localizes the updates to a sparse set of parameters relevant to the task being learned. |
Wenxuan Zhang; Paul Janson; Rahaf Aljundi; Mohamed Elhoseiny; |
51 | FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present FoundationPose a unified foundation model for 6D object pose estimation and tracking supporting both model-based and model-free setups. |
Bowen Wen; Wei Yang; Jan Kautz; Stan Birchfield; |
52 | Honeybee: Locality-enhanced Projector for Multimodal LLM Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this study we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens crucial for MLLMs’ overall efficiency and (ii) preservation of local context from visual features vital for spatial understanding. Based on these findings we propose a novel projector design that is both flexible and locality-enhanced effectively satisfying the two desirable properties. |
Junbum Cha; Wooyoung Kang; Jonghwan Mun; Byungseok Roh; |
53 | M&M VTO: Multi-Garment Virtual Try-On and Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present M&M VTO-a mix and match virtual try-on method that takes as input multiple garment images text description for garment layout and an image of a person. |
Luyang Zhu; Yingwei Li; Nan Liu; Hao Peng; Dawei Yang; Ira Kemelmacher-Shlizerman; |
54 | An Edit Friendly DDPM Noise Space: Inversion and Manipulations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this native noise space does not possess a convenient structure and is thus challenging to work with in editing tasks. Here we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). |
Inbar Huberman-Spiegelglas; Vladimir Kulikov; Tomer Michaeli; |
55 | Adversarial Text to Continuous Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we approach the text-to-image task from a different perspective where a 2D image is represented as an implicit neural representation (INR). |
Kilichbek Haydarov; Aashiq Muhamed; Xiaoqian Shen; Jovana Lazarevic; Ivan Skorokhodov; Chamuditha Jayanga Galappaththige; Mohamed Elhoseiny; |
56 | Style Aligned Image Generation Via Shared Attention Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce StyleAligned a novel technique designed to establish style alignment among a series of generated images. |
Amir Hertz; Andrey Voynov; Shlomi Fruchter; Daniel Cohen-Or; |
57 | Prompt-Free Diffusion: Taking "Text" Out of Text-to-Image Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. |
Xingqian Xu; Jiayi Guo; Zhangyang Wang; Gao Huang; Irfan Essa; Humphrey Shi; |
58 | Breathing Life Into Sketches Using Text-to-Video Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present a method that automatically adds motion to a single-subject sketch (hence "breathing life into it") merely by providing a text prompt indicating the desired motion. |
Rinon Gal; Yael Vinker; Yuval Alaluf; Amit Bermano; Daniel Cohen-Or; Ariel Shamir; Gal Chechik; |
59 | Wonder3D: Single Image to 3D Using Cross-Domain Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce Wonder3D a novel method for generating high-fidelity textured meshes from single-view images with remarkable efficiency. |
Xiaoxiao Long; Yuan-Chen Guo; Cheng Lin; Yuan Liu; Zhiyang Dou; Lingjie Liu; Yuexin Ma; Song-Hai Zhang; Marc Habermann; Christian Theobalt; Wenping Wang; |
60 | Diffusion Model Alignment Using Direct Preference Optimization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose Diffusion-DPO a method to align diffusion models to human preferences by directly optimizing on human comparison data. |
Bram Wallace; Meihua Dang; Rafael Rafailov; Linqi Zhou; Aaron Lou; Senthil Purushwalkam; Stefano Ermon; Caiming Xiong; Shafiq Joty; Nikhil Naik; |
61 | PIGEON: Predicting Image Geolocations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new geolocalization system that combines semantic geocell creation multi-task contrastive pretraining and a novel loss function. |
Lukas Haas; Michal Skreta; Silas Alberti; Chelsea Finn; |
62 | VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce VSCode a generalist model with novel 2D prompt learning to jointly address four SOD tasks and three COD tasks. |
Ziyang Luo; Nian Liu; Wangbo Zhao; Xuguang Yang; Dingwen Zhang; Deng-Ping Fan; Fahad Khan; Junwei Han; |
63 | Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce VistaLLM a powerful visual system that addresses coarse- and fine grained VL tasks over single and multiple input images using a unified framework. |
Shraman Pramanick; Guangxing Han; Rui Hou; Sayan Nag; Ser-Nam Lim; Nicolas Ballas; Qifan Wang; Rama Chellappa; Amjad Almahairi; |
64 | GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present GaussianShader a novel method that applies a simplified shading function on 3D Gaussians to enhance the neural rendering in scenes with reflective surfaces while preserving the training and rendering efficiency. |
Yingwenqi Jiang; Jiadong Tu; Yuan Liu; Xifeng Gao; Xiaoxiao Long; Wenping Wang; Yuexin Ma; |
65 | Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Accordingly to establish a video dataset with high-quality captions we propose an automatic approach leveraging multimodal inputs such as textual video description subtitles and individual video frames. |
Tsai-Shien Chen; Aliaksandr Siarohin; Willi Menapace; Ekaterina Deyneka; Hsiang-wei Chao; Byung Eun Jeon; Yuwei Fang; Hsin-Ying Lee; Jian Ren; Ming-Hsuan Yang; Sergey Tulyakov; |
66 | Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Deformable Convolution v4 (DCNv4) a highly efficient and effective operator designed for a broad spectrum of vision applications. |
Yuwen Xiong; Zhiqi Li; Yuntao Chen; Feng Wang; Xizhou Zhu; Jiapeng Luo; Wenhai Wang; Tong Lu; Hongsheng Li; Yu Qiao; Lewei Lu; Jie Zhou; Jifeng Dai; |
67 | HUGS: Human Gaussian Splats Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). |
Muhammed Kocabas; Jen-Hao Rick Chang; James Gabriel; Oncel Tuzel; Anurag Ranjan; |
68 | MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For long videos the computation complexity memory cost and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism we propose the MovieChat to overcome these challenges. |
Enxin Song; Wenhao Chai; Guanhong Wang; Yucheng Zhang; Haoyang Zhou; Feiyang Wu; Haozhe Chi; Xun Guo; Tian Ye; Yanting Zhang; Yan Lu; Jenq-Neng Hwang; Gaoang Wang; |
69 | GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce GaussianAvatars a new method to create photorealistic head avatars that are fully controllable in terms of expression pose and viewpoint. |
Shenhan Qian; Tobias Kirschstein; Liam Schoneveld; Davide Davoli; Simon Giebenhain; Matthias Nießner; |
70 | InternVL: Scaling Up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we design a large-scale vision-language foundation model (InternVL) which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources. |
Zhe Chen; Jiannan Wu; Wenhai Wang; Weijie Su; Guo Chen; Sen Xing; Muyan Zhong; Qinglong Zhang; Xizhou Zhu; Lewei Lu; Bin Li; Ping Luo; Tong Lu; Yu Qiao; Jifeng Dai; |
71 | Grounded Text-to-Image Synthesis with Attention Refocusing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reveal the potential causes of the diffusion model’s cross-attention and self-attention layers. |
Quynh Phung; Songwei Ge; Jia-Bin Huang; |
72 | PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. |
Vidit Goel; Elia Peruzzo; Yifan Jiang; Dejia Xu; Xingqian Xu; Nicu Sebe; Trevor Darrell; Zhangyang Wang; Humphrey Shi; |
73 | Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The challenge in exploration efficiency in such environments makes it difficult for reinforcement-learning-based agents to learn complex tasks. To address this this paper introduces an advanced learning system named Auto MC-Reward that leverages Large Language Models (LLMs) to automatically design dense reward functions thereby enhancing the learning efficiency. |
Hao Li; Xue Yang; Zhaokai Wang; Xizhou Zhu; Jie Zhou; Yu Qiao; Xiaogang Wang; Hongsheng Li; Lewei Lu; Jifeng Dai; |
74 | CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. |
Zineng Tang; Ziyi Yang; Mahmoud Khademi; Yang Liu; Chenguang Zhu; Mohit Bansal; |
75 | Retrieval-Augmented Egocentric Video Captioning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. |
Jilan Xu; Yifei Huang; Junlin Hou; Guo Chen; Yuejie Zhang; Rui Feng; Weidi Xie; |
76 | Putting The Object Back Into Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. |
Ho Kei Cheng; Seoung Wug Oh; Brian Price; Joon-Young Lee; Alexander Schwing; |
77 | MagicAnimate: Temporally Consistent Human Image Animation Using Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce MagicAnimate a diffusion-based framework that aims at enhancing temporal consistency preserving reference image faithfully and improving animation fidelity. |
Zhongcong Xu; Jianfeng Zhang; Jun Hao Liew; Hanshu Yan; Jia-Wei Liu; Chenxu Zhang; Jiashi Feng; Mike Zheng Shou; |
78 | Probing The 3D Awareness of Visual Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we analyze the 3D awareness of visual foundation models. |
Mohamed El Banani; Amit Raj; Kevis-Kokitsi Maninis; Abhishek Kar; Yuanzhen Li; Michael Rubinstein; Deqing Sun; Leonidas Guibas; Justin Johnson; Varun Jampani; |
79 | VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. |
Haoxin Chen; Yong Zhang; Xiaodong Cun; Menghan Xia; Xintao Wang; Chao Weng; Ying Shan; |
80 | Fixed Point Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce the Fixed Point Diffusion Model (FPDM) a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. |
Xingjian Bai; Luke Melas-Kyriazi; |
81 | Gaussian Shell Maps for Efficient 3D Human Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell–based scaffold. |
Rameen Abdal; Wang Yifan; Zifan Shi; Yinghao Xu; Ryan Po; Zhengfei Kuang; Qifeng Chen; Dit-Yan Yeung; Gordon Wetzstein; |
82 | GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a new approach termed GPS-Gaussian for synthesizing novel views of a character in a real-time manner. |
Shunyuan Zheng; Boyao Zhou; Ruizhi Shao; Boning Liu; Shengping Zhang; Liqiang Nie; Yebin Liu; |
83 | Splatter Image: Ultra-Fast Single-View 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce the Splatter Image an ultra-efficient approach for monocular 3D object reconstruction. |
Stanislaw Szymanowicz; Chrisitian Rupprecht; Andrea Vedaldi; |
84 | YOLO-World: Real-Time Open-Vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation we introduce YOLO-World an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. |
Tianheng Cheng; Lin Song; Yixiao Ge; Wenyu Liu; Xinggang Wang; Ying Shan; |
85 | Generative Image Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present an approach to modeling an image-space prior on scene motion. |
Zhengqi Li; Richard Tucker; Noah Snavely; Aleksander Holynski; |
86 | Sequential Modeling Enables Scalable Learning for Large Vision Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. |
Yutong Bai; Xinyang Geng; Karttikeya Mangalam; Amir Bar; Alan L. Yuille; Trevor Darrell; Jitendra Malik; Alexei A. Efros; |
87 | Depth-aware Test-Time Training for Zero-shot Video Object Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a test-time training (TTT) strategy to address the problem. |
Weihuang Liu; Xi Shen; Haolun Li; Xiuli Bi; Bo Liu; Chi-Man Pun; Xiaodong Cun; |
88 | One-step Diffusion with Distribution Matching Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Distribution Matching Distillation (DMD) a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. |
Tianwei Yin; Michaël Gharbi; Richard Zhang; Eli Shechtman; Frédo Durand; William T. Freeman; Taesung Park; |
89 | DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present DreamAvatar a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. |
Yukang Cao; Yan-Pei Cao; Kai Han; Ying Shan; Kwan-Yee K. Wong; |
90 | Neural Clustering Based Visual Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose feature extraction with clustering (FEC) a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. |
Guikun Chen; Xia Li; Yi Yang; Wenguan Wang; |
91 | Orthogonal Adaptation for Modular Customization of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we address a new problem called Modular Customization with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. |
Ryan Po; Guandao Yang; Kfir Aberman; Gordon Wetzstein; |
92 | 3DGS-Avatar: Animatable Avatars Via Deformable 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). |
Zhiyin Qian; Shaofei Wang; Marko Mihajlovic; Andreas Geiger; Siyu Tang; |
93 | InstructDiffusion: A Generalist Modeling Interface for Vision Tasks Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present InstructDiffusion a unified and generic framework for aligning computer vision tasks with human instructions. |
Zigang Geng; Binxin Yang; Tiankai Hang; Chen Li; Shuyang Gu; Ting Zhang; Jianmin Bao; Zheng Zhang; Houqiang Li; Han Hu; Dong Chen; Baining Guo; |
94 | Beyond First-Order Tweedie: Solving Inverse Problems Using Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Second-order Tweedie sampler from Surrogate Loss (STSL) a novel sampler offering efficiency comparable to first-order Tweedie while enabling tractable reverse processes using second-order approximation. |
Litu Rout; Yujia Chen; Abhishek Kumar; Constantine Caramanis; Sanjay Shakkottai; Wen-Sheng Chu; |
95 | AssistGUI: Task-Oriented PC Graphical User Interface Automation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a novel benchmark AssistGUI to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. |
Difei Gao; Lei Ji; Zechen Bai; Mingyu Ouyang; Peiran Li; Dongxing Mao; Qinchen Wu; Weichen Zhang; Peiyi Wang; Xiangwu Guo; Hengxu Wang; Luowei Zhou; Mike Zheng Shou; |
96 | Video-P2P: Video Editing with Cross-attention Control Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: For attention control we introduce a novel decoupled-guidance strategy which uses different guidance strategies for the source and target prompts. |
Shaoteng Liu; Yuechen Zhang; Wenbo Li; Zhe Lin; Jiaya Jia; |
97 | Visual Program Distillation: Distilling Tools and Programmatic Reasoning Into Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Visual Program Distillation (VPD) an instruction-tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. |
Yushi Hu; Otilia Stretcu; Chun-Ta Lu; Krishnamurthy Viswanathan; Kenji Hata; Enming Luo; Ranjay Krishna; Ariel Fuxman; |
98 | Generative Powers of Ten Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a method that uses a text-to-image model to generate consistent content across multiple image scales enabling extreme semantic zooms into a scene e.g. ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. |
Xiaojuan Wang; Janne Kontkanen; Brian Curless; Steven M. Seitz; Ira Kemelmacher-Shlizerman; Ben Mildenhall; Pratul Srinivasan; Dor Verbin; Aleksander Holynski; |
99 | Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce Language Embedded 3D Gaussians a novel scene representation for open-vocabulary query tasks. |
Jin-Chuan Shi; Miao Wang; Hao-Bin Duan; Shao-Hua Guan; |
100 | DiffMorpher: Unleashing The Capability of Diffusion Models for Image Morphing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work we address this limitation via DiffMorpher an approach that enables smooth and natural image interpolation by harnessing the prior knowledge of a pre-trained diffusion model. |
Kaiwen Zhang; Yifan Zhou; Xudong Xu; Bo Dai; Xingang Pan; |
101 | OMG-Seg: Is One Model Good Enough For All Segmentation? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we address various segmentation tasks each traditionally tackled by distinct or partially unified models. |
Xiangtai Li; Haobo Yuan; Wei Li; Henghui Ding; Size Wu; Wenwei Zhang; Yining Li; Kai Chen; Chen Change Loy; |
102 | Visual In-Context Prompting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce a universal visual in-context prompting framework for both tasks as shown in Fig.1. |
Feng Li; Qing Jiang; Hao Zhang; Tianhe Ren; Shilong Liu; Xueyan Zou; Huaizhe Xu; Hongyang Li; Jianwei Yang; Chunyuan Li; Lei Zhang; Jianfeng Gao; |
103 | SEED-Bench: Benchmarking Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work we categorize the capabilities of MLLMs into hierarchical levels from L_0 to L_4 based on the modalities they can accept and generate and propose SEED-Bench a comprehensive benchmark that evaluates the hierarchical capabilities of MLLMs. |
Bohao Li; Yuying Ge; Yixiao Ge; Guangzhi Wang; Rui Wang; Ruimao Zhang; Ying Shan; |
104 | Tune-An-Ellipse: CLIP Has Potential to Find What You Want Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our novel simple yet effective approach i.e. Differentiable Visual Prompting enables CLIP to zero-shot localize: given an image and a text prompt describing an object we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. |
Jinheng Xie; Songhe Deng; Bing Li; Haozhe Liu; Yawen Huang; Yefeng Zheng; Jurgen Schmidhuber; Bernard Ghanem; Linlin Shen; Mike Zheng Shou; |
105 | Optimizing Diffusion Noise Can Serve As Universal Motion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Diffusion Noise Optimization (DNO) a new method that effectively leverages existing motion diffusion models as motion priors for a wide range of motion-related tasks. |
Korrawe Karunratanakul; Konpat Preechakul; Emre Aksan; Thabo Beeler; Supasorn Suwajanakorn; Siyu Tang; |
106 | ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars real-time performance has mostly been demonstrated for static scenes only. To address this we propose ASH an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real time. |
Haokai Pang; Heming Zhu; Adam Kortylewski; Christian Theobalt; Marc Habermann; |
107 | Towards Language-Driven Video Inpainting Via Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a new task — language-driven video inpainting which uses natural language instructions to guide the inpainting process. |
Jianzong Wu; Xiangtai Li; Chenyang Si; Shangchen Zhou; Jingkang Yang; Jiangning Zhang; Yining Li; Kai Chen; Yunhai Tong; Ziwei Liu; Chen Change Loy; |
108 | Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our study introduces Upscale-A-Video a text-guided latent diffusion framework for video upscaling. |
Shangchen Zhou; Peiqing Yang; Jianyi Wang; Yihang Luo; Chen Change Loy; |
109 | Bayes’ Rays: Uncertainty Quantification for Neural Radiance Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce BayesRays a post-hoc framework to evaluate uncertainty in any pretrained NeRF without modifying the training process. |
Lily Goli; Cody Reading; Silvia Sellán; Alec Jacobson; Andrea Tagliasacchi; |
110 | Gaussian Head Avatar: Ultra High-fidelity Head Avatar Via Dynamic Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. |
Yuelang Xu; Benwang Chen; Zhe Li; Hongwen Zhang; Lizhen Wang; Zerong Zheng; Yebin Liu; |
111 | What You See Is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose techniques to scale neural volume rendering to the much higher resolution of native 2D images thereby resolving fine-grained 3D geometry with unprecedented detail. |
Alex Trevithick; Matthew Chan; Towaki Takikawa; Umar Iqbal; Shalini De Mello; Manmohan Chandraker; Ravi Ramamoorthi; Koki Nagano; |
112 | HybridNeRF: Efficient Neural Rendering Via Adaptive Volumetric Surfaces Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a method HybridNeRF that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. |
Haithem Turki; Vasu Agrawal; Samuel Rota Bulò; Lorenzo Porzi; Peter Kontschieder; Deva Ramanan; Michael Zollhöfer; Christian Richardt; |
113 | EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While beneficial the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation we propose EfficientSAMs light-weight SAM models that exhibits decent performance with largely reduced complexity. |
Yunyang Xiong; Bala Varadarajan; Lemeng Wu; Xiaoyu Xiang; Fanyi Xiao; Chenchen Zhu; Xiaoliang Dai; Dilin Wang; Fei Sun; Forrest Iandola; Raghuraman Krishnamoorthi; Vikas Chandra; |
114 | RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. |
Lingteng Qiu; Guanying Chen; Xiaodong Gu; Qi Zuo; Mutian Xu; Yushuang Wu; Weihao Yuan; Zilong Dong; Liefeng Bo; Xiaoguang Han; |
115 | TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. |
Haomiao Ni; Bernhard Egger; Suhas Lohit; Anoop Cherian; Ye Wang; Toshiaki Koike-Akino; Sharon X. Huang; Tim K. Marks; |
116 | OpenEQA: Embodied Question Answering in The Era of Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. |
Arjun Majumdar; Anurag Ajay; Xiaohan Zhang; Pranav Putta; Sriram Yenamandra; Mikael Henaff; Sneha Silwal; Paul Mcvay; Oleksandr Maksymets; Sergio Arnaud; Karmesh Yadav; Qiyang Li; Ben Newman; Mohit Sharma; Vincent Berges; Shiqi Zhang; Pulkit Agrawal; Yonatan Bisk; Dhruv Batra; Mrinal Kalakrishnan; Franziska Meier; Chris Paxton; Alexander Sax; Aravind Rajeswaran; |
117 | Unified Language-driven Zero-shot Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We identify the constraints in the existing language-driven zero-shot domain adaptation task particularly the requirement for domain IDs and domain-specific models which may restrict flexibility and scalability. To overcome these issues we propose a new framework for ULDA consisting of Hierarchical Context Alignment (HCA) Domain Consistent Representation Learning (DCRL) and Text-Driven Rectifier (TDR). |
Senqiao Yang; Zhuotao Tian; Li Jiang; Jiaya Jia; |
118 | SyncTalk: The Devil Is in The Synchronization for Talking Head Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A lifelike talking head requires synchronized coordination of subject identity lip movements facial expressions and head poses. The absence of these synchronizations is a fundamental flaw leading to unrealistic and artificial outcomes. To address the critical issue of synchronization identified as the devil in creating realistic talking heads we introduce SyncTalk. |
Ziqiao Peng; Wentao Hu; Yue Shi; Xiangyu Zhu; Xiaomei Zhang; Hao Zhao; Jun He; Hongyan Liu; Zhaoxin Fan; |
119 | Pixel-Aligned Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we aim to develop a vision-language model that can take locations for example a set of points or boxes as either inputs or outputs. |
Jiarui Xu; Xingyi Zhou; Shen Yan; Xiuye Gu; Anurag Arnab; Chen Sun; Xiaolong Wang; Cordelia Schmid; |
120 | MobileCLIP: Fast Image-Text Models Through Multi-Modal Reinforced Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce MobileCLIP – a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach namely multi-modal reinforced training. |
Pavan Kumar Anasosalu Vasu; Hadi Pouransari; Fartash Faghri; Raviteja Vemulapalli; Oncel Tuzel; |
121 | GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. |
Chong Bao; Yinda Zhang; Yuan Li; Xiyu Zhang; Bangbang Yang; Hujun Bao; Marc Pollefeys; Guofeng Zhang; Zhaopeng Cui; |
122 | UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels – they can see wide without going deep. Following such guidelines our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0% ADE20K mIoU of 55.6% and COCO box AP of 56.4%) demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. |
Xiaohan Ding; Yiyuan Zhang; Yixiao Ge; Sijie Zhao; Lin Song; Xiangyu Yue; Ying Shan; |
123 | MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes Via Diffusion Prior Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting addressing both appearance and geometry aspects. |
Honghua Chen; Chen Change Loy; Xingang Pan; |
124 | Hierarchical Patch Diffusion Models for High-Resolution Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we study patch diffusion models (PDMs) — a diffusion paradigm which models the distribution of patches rather than whole inputs keeping up to 0.7% of the original pixels. |
Ivan Skorokhodov; Willi Menapace; Aliaksandr Siarohin; Sergey Tulyakov; |
125 | UniDepth: Universal Monocular Metric Depth Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a new model UniDepth capable of reconstructing metric 3D scenes from solely single images across domains. |
Luigi Piccinelli; Yung-Hsu Yang; Christos Sakaridis; Mattia Segu; Siyuan Li; Luc Van Gool; Fisher Yu; |
126 | LISA: Reasoning Segmentation Via Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose a new segmentation task — reasoning segmentation. |
Xin Lai; Zhuotao Tian; Yukang Chen; Yanwei Li; Yuhui Yuan; Shu Liu; Jiaya Jia; |
127 | NOPE: Novel Object Pose Estimation from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object’s 3D model and without requiring training time for new objects and categories. |
Van Nguyen Nguyen; Thibault Groueix; Georgy Ponimatkin; Yinlin Hu; Renaud Marlet; Mathieu Salzmann; Vincent Lepetit; |
128 | GigaPose: Fast and Robust Novel Object Pose Estimation Via One Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GigaPose a fast robust and accurate method for CAD-based novel object pose estimation in RGB images. |
Van Nguyen Nguyen; Thibault Groueix; Mathieu Salzmann; Vincent Lepetit; |
129 | SceneTex: High-Quality Texture Synthesis for Indoor Scenes Via Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose SceneTex a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. |
Dave Zhenyu Chen; Haoxuan Li; Hsin-Ying Lee; Sergey Tulyakov; Matthias Nießner; |
130 | SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we introduce SceneFun3D a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. |
Alexandros Delitzas; Ayca Takmaz; Federico Tombari; Robert Sumner; Marc Pollefeys; Francis Engelmann; |
131 | OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. |
Bohao Peng; Xiaoyang Wu; Li Jiang; Yukang Chen; Hengshuang Zhao; Zhuotao Tian; Jiaya Jia; |
132 | VideoCon: Robust Video-Language Alignment Via Contrast Captions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce the VideoCon a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. |
Hritik Bansal; Yonatan Bitton; Idan Szpektor; Kai-Wei Chang; Aditya Grover; |
133 | Multiplane Prior Guided Few-Shot Aerial Scene Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The acquisition of dense aerial views is often prohibitive as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work we introduce Multiplane Prior guided NeRF (MPNeRF) a novel approach tailored for few-shot aerial scene rendering–marking a pioneering effort in this domain. |
Zihan Gao; Licheng Jiao; Lingling Li; Xu Liu; Fang Liu; Puhua Chen; Yuwei Guo; |
134 | On Scaling Up A Multilingual Vision and Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. |
Xi Chen; Josip Djolonga; Piotr Padlewski; Basil Mustafa; Soravit Changpinyo; Jialin Wu; Carlos Riquelme Ruiz; Sebastian Goodman; Xiao Wang; Yi Tay; Siamak Shakeri; Mostafa Dehghani; Daniel Salz; Mario Lucic; Michael Tschannen; Arsha Nagrani; Hexiang Hu; Mandar Joshi; Bo Pang; Ceslee Montgomery; Paulina Pietrzyk; Marvin Ritter; AJ Piergiovanni; Matthias Minderer; Filip Pavetic; Austin Waters; Gang Li; Ibrahim Alabdulmohsin; Lucas Beyer; Julien Amelot; Kenton Lee; Andreas Peter Steiner; Yang Li; Daniel Keysers; Anurag Arnab; Yuanzhong Xu; Keran Rong; Alexander Kolesnikov; Mojtaba Seyedhosseini; Anelia Angelova; Xiaohua Zhai; Neil Houlsby; Radu Soricut; |
135 | Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we circumvent the difficulty of image generation and propose an alternative to build the connection between unpaired images in a compact proxy space. |
Longguang Wang; Juncheng Li; Yingqian Wang; Qingyong Hu; Yulan Guo; |
136 | Feedback-Guided Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In contrast learning in humans often involves additional detailed guidance throughout the interactive learning process i.e. where feedback often via language provides detailed information as to which part of their trial was performed incorrectly or suboptimally and why. Motivated by this observation we introduce an efficient feedback-based framework for improving behavior-cloning-based training of sensorimotor driving agents. |
Jimuyang Zhang; Zanming Huang; Arijit Ray; Eshed Ohn-Bar; |
137 | Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present Omni-SMoLA a multimodal architecture that mixes many multi-modal experts efficiently and achieves both high specialist and generalist performance. |
Jialin Wu; Xia Hu; Yaqing Wang; Bo Pang; Radu Soricut; |
138 | Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In The Wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce SUPIR (Scaling-UP Image Restoration) a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. |
Fanghua Yu; Jinjin Gu; Zheyuan Li; Jinfan Hu; Xiangtao Kong; Xintao Wang; Jingwen He; Yu Qiao; Chao Dong; |
139 | Self-correcting LLM-controlled Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast to existing models that aim to generate images only with their best effort we introduce Self-correcting LLM-controlled Diffusion (SLD). |
Tsung-Han Wu; Long Lian; Joseph E. Gonzalez; Boyi Li; Trevor Darrell; |
140 | See Say and Segment: Teaching LMMs to Overcome False Premises Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say") a form of catastrophic forgetting. In this work we propose a cascading and joint training approach for LMMs to solve this task avoiding catastrophic forgetting of previous skills. |
Tsung-Han Wu; Giscard Biamby; David Chan; Lisa Dunlap; Ritwik Gupta; Xudong Wang; Joseph E. Gonzalez; Trevor Darrell; |
141 | 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we introduce hybrid score distillation sampling an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. |
Sherwin Bahmani; Ivan Skorokhodov; Victor Rong; Gordon Wetzstein; Leonidas Guibas; Peter Wonka; Sergey Tulyakov; Jeong Joon Park; Andrea Tagliasacchi; David B. Lindell; |
142 | Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a semantic panel as the middleware in decoding texts to images supporting the generator to better follow instructions. |
Yutong Feng; Biao Gong; Di Chen; Yujun Shen; Yu Liu; Jingren Zhou; |
143 | OneLLM: One Framework to Align All Modalities with Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present OneLLM an MLLM that aligns eight modalities to language using a unified framework. |
Jiaming Han; Kaixiong Gong; Yiyuan Zhang; Jiaqi Wang; Kaipeng Zhang; Dahua Lin; Yu Qiao; Peng Gao; Xiangyu Yue; |
144 | EvalCrafter: Benchmarking and Evaluating Large Video Generation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Thus we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. |
Yaofang Liu; Xiaodong Cun; Xuebo Liu; Xintao Wang; Yong Zhang; Haoxin Chen; Yang Liu; Tieyong Zeng; Raymond Chan; Ying Shan; |
145 | Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. |
Yijia Weng; Bowen Wen; Jonathan Tremblay; Valts Blukis; Dieter Fox; Leonidas Guibas; Stan Birchfield; |
146 | Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. |
Jiasen Lu; Christopher Clark; Sangho Lee; Zichen Zhang; Savya Khosla; Ryan Marten; Derek Hoiem; Aniruddha Kembhavi; |
147 | Diffusion Models Without Attention Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods such as patchifying expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this we introduce the Diffusion State Space Model (DiffuSSM) an architecture that supplants attention mechanisms with a more scalable state space model backbone. |
Jing Nathan Yan; Jiatao Gu; Alexander M. Rush; |
148 | Sieve: Multimodal Dataset Pruning Using Image Captioning Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a pruning signal Sieve that employs synthetic captions generated by image-captioning models pretrained on small diverse and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. |
Anas Mahmoud; Mostafa Elhoushi; Amro Abbas; Yu Yang; Newsha Ardalani; Hugh Leather; Ari S. Morcos; |
149 | LangSplat: 3D Language Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces LangSplat which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. |
Minghan Qin; Wanhua Li; Jiawei Zhou; Haoqian Wang; Hanspeter Pfister; |
150 | Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. |
Da-Wei Zhou; Hai-Long Sun; Han-Jia Ye; De-Chuan Zhan; |
151 | Digital Life Project: Autonomous 3D Characters with Social Intelligence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present Digital Life Project a framework utilizing language as the universal medium to build autonomous 3D characters who are capable of engaging in social interactions and expressing with articulated body motions thereby simulating life in a digital environment. |
Zhongang Cai; Jianping Jiang; Zhongfei Qing; Xinying Guo; Mingyuan Zhang; Zhengyu Lin; Haiyi Mei; Chen Wei; Ruisi Wang; Wanqi Yin; Liang Pan; Xiangyu Fan; Han Du; Peng Gao; Zhitao Yang; Yang Gao; Jiaqi Li; Tianxiang Ren; Yukun Wei; Xiaogang Wang; Chen Change Loy; Lei Yang; Ziwei Liu; |
152 | DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce DetCLIPv3 a high-performing detector that excels not only at both open-vocabulary object detection but also generating hierarchical labels for detected objects. |
Lewei Yao; Renjie Pi; Jianhua Han; Xiaodan Liang; Hang Xu; Wei Zhang; Zhenguo Li; Dan Xu; |
153 | SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem we propose a novel loss function named Sample-wise affinity Consistency (SaCo) loss which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. |
Sitong Wu; Haoru Tan; Zhuotao Tian; Yukang Chen; Xiaojuan Qi; Jiaya Jia; |
154 | Grounded Question-Answering in Long Egocentric Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we delve into open-ended question-answering (QA) in long egocentric videos which allows individuals or robots to inquire about their own past visual experiences. |
Shangzhe Di; Weidi Xie; |
155 | VBench: Comprehensive Benchmark Suite for Video Generative Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we present VBench a comprehensive benchmark suite that dissects "video generation quality" into specific hierarchical and disentangled dimensions each with tailored prompts and evaluation methods. |
Ziqi Huang; Yinan He; Jiashuo Yu; Fan Zhang; Chenyang Si; Yuming Jiang; Yuanhan Zhang; Tianxing Wu; Qingyang Jin; Nattapol Chanpaisit; Yaohui Wang; Xinyuan Chen; Limin Wang; Dahua Lin; Yu Qiao; Ziwei Liu; |
156 | Adversarial Distillation Based on Slack Matching and Attribution Region Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: During the training process we facilitate the student model in better understanding the teacher model’s behavior by aligning the attribution region that the student model focuses on with that of the teacher model. Concurrently we relax the condition of exact matching in KL divergence and replace it with a more flexible matching criterion thereby enhancing the model’s robustness. |
Shenglin Yin; Zhen Xiao; Mingxuan Song; Jieyi Long; |
157 | LucidDreamer: Towards High-Fidelity Text-to-3D Generation Via Interval Score Matching Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper identifies a notable deficiency in SDS that it brings inconsistent and low-quality updating direction for the 3D model causing the over-smoothing effect. To address this we propose a novel approach called Interval Score Matching (ISM). |
Yixun Liang; Xin Yang; Jiantao Lin; Haodong Li; Xiaogang Xu; Yingcong Chen; |
158 | FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a generative approach to forecast long-term future human behavior in 3D requiring only weak supervision from readily available 2D human action data. |
Christian Diller; Thomas Funkhouser; Angela Dai; |
159 | CG-HOI: Contact-Guided 3D Human-Object Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose CG-HOI the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. |
Christian Diller; Angela Dai; |
160 | VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present VideoCutLER a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. |
Xudong Wang; Ishan Misra; Ziyun Zeng; Rohit Girdhar; Trevor Darrell; |
161 | Unsupervised Universal Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks—instance semantic and panoptic—using a novel unified framework. |
Dantong Niu; Xudong Wang; Xinyang Han; Long Lian; Roei Herzig; Trevor Darrell; |
162 | VCoder: Versatile Vision Encoders for Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Working towards developing an accurate MLLM system for perception and reasoning we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. |
Jitesh Jain; Jianwei Yang; Humphrey Shi; |
163 | Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g. clarity brightness) to the evaluation on image quality there’s still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this we collect the first dataset consisting of human natural language feedback on low-level vision. |
Haoning Wu; Zicheng Zhang; Erli Zhang; Chaofeng Chen; Liang Liao; Annan Wang; Kaixin Xu; Chunyi Li; Jingwen Hou; Guangtao Zhai; Geng Xue; Wenxiu Sun; Qiong Yan; Weisi Lin; |
164 | Intelligent Grimm – Open-ended Visual Storytelling Via Latent Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we focus on a novel yet challenging task of generating a coherent image sequence based on a given storyline denoted as open-ended visual storytelling. |
Chang Liu; Haoning Wu; Yujie Zhong; Xiaoyun Zhang; Yanfeng Wang; Weidi Xie; |
165 | PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this research we introduce a novel universal proposition learning approach called panoramic renal pathology segmentation (PrPSeg) designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. |
Ruining Deng; Quan Liu; Can Cui; Tianyuan Yao; Jialin Yue; Juming Xiong; Lining Yu; Yifei Wu; Mengmeng Yin; Yu Wang; Shilin Zhao; Yucheng Tang; Haichun Yang; Yuankai Huo; |
166 | Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on this framework we propose Prompt-driven Normalization which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. |
Xiaoyang Wu; Zhuotao Tian; Xin Wen; Bohao Peng; Xihui Liu; Kaicheng Yu; Hengshuang Zhao; |
167 | Point Transformer V3: Simpler Faster Stronger Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Drawing inspiration from recent advances in 3D large-scale representation learning we recognize that model performance is more influenced by scale than by intricate design. Therefore we present Point Transformer V3 (PTv3) which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. |
Xiaoyang Wu; Li Jiang; Peng-Shuai Wang; Zhijian Liu; Xihui Liu; Yu Qiao; Wanli Ouyang; Tong He; Hengshuang Zhao; |
168 | SHINOBI: Shape and Illumination Using Neural Object Decomposition Via BRDF Optimization In-the-wild Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SHINOBI an end-to-end framework for the reconstruction of shape material and illumination from object images captured with varying lighting pose and background. |
Andreas Engelhardt; Amit Raj; Mark Boss; Yunzhi Zhang; Abhishek Kar; Yuanzhen Li; Deqing Sun; Ricardo Martin Brualla; Jonathan T. Barron; Hendrik P. A. Lensch; Varun Jampani; |
169 | DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation By Combining 3D GANs and Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel framework DiffusionGAN3D which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. |
Biwen Lei; Kai Yu; Mengyang Feng; Miaomiao Cui; Xuansong Xie; |
170 | FreeU: Free Lunch in Diffusion U-Net Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we uncover the untapped potential of diffusion U-Net which serves as a "free lunch" that substantially improves the generation quality on the fly. |
Chenyang Si; Ziqi Huang; Yuming Jiang; Ziwei Liu; |
171 | GenZI: Zero-Shot 3D Human-Scene Interaction Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose GenZI the first zero-shot approach to generating 3D human-scene interactions. |
Lei Li; Angela Dai; |
172 | Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. |
Zi-Xin Zou; Zhipeng Yu; Yuan-Chen Guo; Yangguang Li; Ding Liang; Yan-Pei Cao; Song-Hai Zhang; |
173 | ZeroNVS: Zero-Shot 360-Degree View Synthesis from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce a 3D-aware diffusion model ZeroNVS for single-image novel view synthesis for in-the-wild scenes. |
Kyle Sargent; Zizhang Li; Tanmay Shah; Charles Herrmann; Hong-Xing Yu; Yunzhi Zhang; Eric Ryan Chan; Dmitry Lagun; Li Fei-Fei; Deqing Sun; Jiajun Wu; |
174 | Three Pillars Improving Vision Foundation Model Distillation for Lidar Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work instead of focusing only on the distillation method we study the effect of three pillars for distillation: the 3D backbone the pretrained 2D backbone and the pretraining 2D+3D dataset. |
Gilles Puy; Spyros Gidaris; Alexandre Boulch; Oriane Siméoni; Corentin Sautier; Patrick Pérez; Andrei Bursuc; Renaud Marlet; |
175 | Shadows Don’t Lie and Lines Can’t Bend! Generative Models Don’t Know Projective Geometry…for Now Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper demonstrates that generated images have geometric features different from those of real images. |
Ayush Sarkar; Hanlin Mai; Amitabh Mahapatra; Svetlana Lazebnik; D.A. Forsyth; Anand Bhattad; |
176 | Prompt Highlighter: Interactive Control for Multi-Modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While manipulating prompt formats could improve outputs designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue we introduce a novel inference method Prompt Highlighter which enables users to highlight specific prompt spans to interactively control the focus during generation. |
Yuechen Zhang; Shengju Qian; Bohao Peng; Shu Liu; Jiaya Jia; |
177 | IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose a novel framework for implicit quadratic video frame interpolation (IQ-VFI) which explores latent acceleration information and accurate intermediate motions via knowledge distillation. |
Mengshun Hu; Kui Jiang; Zhihang Zhong; Zheng Wang; Yinqiang Zheng; |
178 | Symphonize 3D Semantic Scene Completion with Contextual Instance Queries Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present a novel paradigm termed Symphonies (Scene-from-Insts) that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling. |
Haoyi Jiang; Tianheng Cheng; Naiyu Gao; Haoyang Zhang; Tianwei Lin; Wenyu Liu; Xinggang Wang; |
179 | GPT4Point: A Unified Framework for Point-Language Understanding and Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation but their understanding of the 3D world is notably deficient limiting progress in 3D language understanding and generation. To solve this problem we introduce GPT4Point an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. |
Zhangyang Qi; Ye Fang; Zeyi Sun; Xiaoyang Wu; Tong Wu; Jiaqi Wang; Dahua Lin; Hengshuang Zhao; |
180 | Instruct-Imagen: Image Generation with Multi-modal Instruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. |
Hexiang Hu; Kelvin C.K. Chan; Yu-Chuan Su; Wenhu Chen; Yandong Li; Kihyuk Sohn; Yang Zhao; Xue Ben; Boqing Gong; William Cohen; Ming-Wei Chang; Xuhui Jia; |
181 | HDRFlow: Real-Time HDR Video Reconstruction with Large Motions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However they often struggle to handle large complex motions and are computationally expensive. To address these challenges we propose a robust and efficient flow estimator tailored for real-time HDR video reconstruction named HDRFlow. |
Gangwei Xu; Yujin Wang; Jinwei Gu; Tianfan Xue; Xin Yang; |
182 | GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this approach often results in semantically identical points having dissimilar representations leading to a high number of false negatives and introducing a semantic conflict problem. To address this issue we propose GroupContrast a novel approach that combines segment grouping and semantic-aware contrastive learning. |
Chengyao Wang; Li Jiang; Xiaoyang Wu; Zhuotao Tian; Bohao Peng; Hengshuang Zhao; Jiaya Jia; |
183 | Seeing The World Through Your Eyes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reconstruct a radiance field beyond the camera’s line of sight using portrait images containing eye reflections. |
Hadi Alzayer; Kevin Zhang; Brandon Feng; Christopher A. Metzler; Jia-Bin Huang; |
184 | One-Shot Open Affordance Learning with Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce One-shot Open Affordance Learning (OOAL) where a model is trained with just one example per base object category but is expected to identify novel objects and affordances. |
Gen Li; Deqing Sun; Laura Sevilla-Lara; Varun Jampani; |
185 | FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present FreeControl a training-free approach for controllable T2I generation that supports multiple conditions architectures and checkpoints simultaneously. |
Sicheng Mo; Fangzhou Mu; Kuan Heng Lin; Yanli Liu; Bochen Guan; Yin Li; Bolei Zhou; |
186 | Language-only Training of Zero-shot Composed Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework only using language for its training. |
Geonmo Gu; Sanghyuk Chun; Wonjae Kim; Yoohoon Kang; Sangdoo Yun; |
187 | UniMODE: Unified Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However involving various scenarios of data to train models poses challenges due to their significantly different characteristics e.g. diverse geometry properties and heterogeneous domain distributions. To address these challenges we build a detector based on the bird’s-eye-view (BEV) detection paradigm where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. |
Zhuoling Li; Xiaogang Xu; SerNam Lim; Hengshuang Zhao; |
188 | FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Conversely recent advancements in long-range trackers offer extended temporal coverage but at the cost of spatial sparsity. This paper introduces FlowTrack a novel framework designed to bridge this gap. |
Seokju Cho; Jiahui Huang; Seungryong Kim; Joon-Young Lee; |
189 | SPIN: Simultaneous Perception Interaction and Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This causes several limitations such as compounding errors delays in decision-making and no whole-body coordination. In this work we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. |
Shagun Uppal; Ananye Agarwal; Haoyu Xiong; Kenneth Shaw; Deepak Pathak; |
190 | Generative Multimodal Models Are In-Context Learners Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. |
Quan Sun; Yufeng Cui; Xiaosong Zhang; Fan Zhang; Qiying Yu; Yueze Wang; Yongming Rao; Jingjing Liu; Tiejun Huang; Xinlong Wang; |
191 | UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose UnScene3D the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. |
David Rozenberszki; Or Litany; Angela Dai; |
192 | FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models By Inverting Stable Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. |
George Cazenavette; Avneesh Sud; Thomas Leung; Ben Usman; |
193 | GaussianAvatar: Towards Realistic Human Avatar Modeling from A Single Video Via Animatable 3D Gaussians Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GaussianAvatar an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. |
Liangxiao Hu; Hongwen Zhang; Yuxiang Zhang; Boyao Zhou; Boning Liu; Shengping Zhang; Liqiang Nie; |
194 | Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. |
Jiayi Guo; Xingqian Xu; Yifan Pu; Zanlin Ni; Chaofei Wang; Manushree Vasu; Shiji Song; Gao Huang; Humphrey Shi; |
195 | SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Current instruction-based image editing methods such as InstructPix2Pix often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this this paper introduces SmartEdit a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. |
Yuzhou Huang; Liangbin Xie; Xintao Wang; Ziyang Yuan; Xiaodong Cun; Yixiao Ge; Jiantao Zhou; Chao Dong; Rui Huang; Ruimao Zhang; Ying Shan; |
196 | Mip-Splatting: Alias-free 3D Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem we introduce a 3D smoothing filter to constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views. |
Zehao Yu; Anpei Chen; Binbin Huang; Torsten Sattler; Andreas Geiger; |
197 | Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose to decouple video-level referring expression understanding into static and motion perception with a specific emphasis on enhancing temporal comprehension. |
Shuting He; Henghui Ding; |
198 | Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce PANTHER a prototype-based approach rooted in the Gaussian mixture model that summarizes the set of WSI patches into a much smaller set of morphological prototypes. |
Andrew H. Song; Richard J. Chen; Tong Ding; Drew F.K. Williamson; Guillaume Jaume; Faisal Mahmood; |
199 | SkillDiffuser: Interpretable Hierarchical Planning Via Skill Abstractions in Diffusion-Based Task Execution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However generating coherent trajectories from high-level instructions remains challenging especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. |
Zhixuan Liang; Yao Mu; Hengbo Ma; Masayoshi Tomizuka; Mingyu Ding; Ping Luo; |
200 | RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a lightweight and scalable Regional Point-Language Contrastive learning framework namely RegionPLC for open-world 3D scene understanding aiming to identify and recognize open-set objects and categories. |
Jihan Yang; Runyu Ding; Weipeng Deng; Zhe Wang; Xiaojuan Qi; |
201 | Amodal Ground Truth and Completion in The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. |
Guanqi Zhan; Chuanxia Zheng; Weidi Xie; Andrew Zisserman; |
202 | VILA: On Pre-training for Visual Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step con- trollable comparisons. |
Ji Lin; Hongxu Yin; Wei Ping; Pavlo Molchanov; Mohammad Shoeybi; Song Han; |
203 | Rethinking The Objectives of Vector-Quantized Tokenizers for Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. |
Yuchao Gu; Xintao Wang; Yixiao Ge; Ying Shan; Mike Zheng Shou; |
204 | SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. |
Yuanhui Huang; Wenzhao Zheng; Borui Zhang; Jie Zhou; Jiwen Lu; |
205 | G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose G-HOP a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand conditioned on the object category. |
Yufei Ye; Abhinav Gupta; Kris Kitani; Shubham Tulsiani; |
206 | CityDreamer: Compositional Generative Model of Unbounded 3D Cities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Additionally generating 3D cities is more complex than 3D natural scenes since buildings as objects of the same class exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges we propose CityDreamer a compositional generative model designed specifically for unbounded 3D cities. |
Haozhe Xie; Zhaoxi Chen; Fangzhou Hong; Ziwei Liu; |
207 | Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Zero-Painter a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. |
Marianna Ohanyan; Hayk Manukyan; Zhangyang Wang; Shant Navasardyan; Humphrey Shi; |
208 | Brush2Prompt: Contextual Prompt Generator for Object Inpainting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a prompt suggestion model to simplify the process of prompt input. |
Mang Tik Chiu; Yuqian Zhou; Lingzhi Zhang; Zhe Lin; Connelly Barnes; Sohrab Amirghodsi; Eli Shechtman; Humphrey Shi; |
209 | TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work proposes TimeChat a time-sensitive multimodal large language model specifically designed for long video understanding. |
Shuhuai Ren; Linli Yao; Shicheng Li; Xu Sun; Lu Hou; |
210 | Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues we propose a multi-scale 3D Gaussian splatting algorithm which maintains Gaussians at different scales to represent the same scene. |
Zhiwen Yan; Weng Fei Low; Yu Chen; Gim Hee Lee; |
211 | End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we reduce the memory consumption for end-to-end training and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1536 frames leading to significant detection performance. |
Shuming Liu; Chen-Lin Zhang; Chen Zhao; Bernard Ghanem; |
212 | Programmable Motion Generation for Open-Set Motion Control Tasks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response to the complexity of practical motion control we propose and attempt to solve the open-set motion control problem. |
Hanchao Liu; Xiaohang Zhan; Shaoli Huang; Tai-Jiang Mu; Ying Shan; |
213 | AutoAD III: The Prequel – Back to The Pixels Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data and build training and evaluation datasets using these. |
Tengda Han; Max Bain; Arsha Nagrani; Gül Varol; Weidi Xie; Andrew Zisserman; |
214 | NeuRAD: Neural Rendering for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose \modelname a robust novel view synthesis method tailored to dynamic AD data. |
Adam Tonderski; Carl Lindström; Georg Hess; William Ljungbergh; Lennart Svensson; Christoffer Petersson; |
215 | Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. |
Zheren Fu; Lei Zhang; Hou Xia; Zhendong Mao; |
216 | GeoChat: Grounded Large Vision-Language Model for Remote Sensing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Furthermore the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations we propose GeoChat – the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. |
Kartik Kuckreja; Muhammad Sohail Danish; Muzammal Naseer; Abhijit Das; Salman Khan; Fahad Shahbaz Khan; |
217 | Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we investigate strengthening the awareness of video dynamics for DMs for high-quality T2V generation. |
Hao Fei; Shengqiong Wu; Wei Ji; Hanwang Zhang; Tat-Seng Chua; |
218 | AnyDoor: Zero-shot Object-level Image Customization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents AnyDoor a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. |
Xi Chen; Lianghua Huang; Yu Liu; Yujun Shen; Deli Zhao; Hengshuang Zhao; |
219 | RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To enable robust animation for out-of-distribution poses we propose a Motion Distribution Align module to compensate for the discrepancies between the training and testing motion distribution. |
Xiang Deng; Zerong Zheng; Yuxiang Zhang; Jingxiang Sun; Chao Xu; Xiaodong Yang; Lizhen Wang; Yebin Liu; |
220 | Control4D: Efficient 4D Portrait Editing with Text Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Control4D an innovative framework for editing dynamic 4D portraits using text instructions. |
Ruizhi Shao; Jingxiang Sun; Cheng Peng; Zerong Zheng; Boyao Zhou; Hongwen Zhang; Yebin Liu; |
221 | Can I Trust Your Answer? Visually Grounded Video Question Answering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Experiments with different backbones demonstrate that this grounding mechanism improves both grounding and QA. With these efforts we aim to push towards trustworthy VLMs in VQA systems. |
Junbin Xiao; Angela Yao; Yicong Li; Tat-Seng Chua; |
222 | ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce generative model as a data source for synthesizing hard images that benchmark deep models’ robustness. |
Chenshuang Zhang; Fei Pan; Junmo Kim; In So Kweon; Chengzhi Mao; |
223 | Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose Dynamic Reversible Dual-Residual Networks or Dr2Net a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. |
Chen Zhao; Shuming Liu; Karttikeya Mangalam; Guocheng Qian; Fatimah Zohra; Abdulmohsen Alghannam; Jitendra Malik; Bernard Ghanem; |
224 | A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To mitigate the memory bottleneck we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention parameter-efficient image-to-video adaptation input masking and multi-resolution patchification. |
Pinelopi Papalampidi; Skanda Koppula; Shreya Pathak; Justin Chiu; Joe Heyward; Viorica Patraucean; Jiajun Shen; Antoine Miech; Andrew Zisserman; Aida Nematzdeh; |
225 | C3: High-Performance and Low-Complexity Neural Compression from A Single Image or Video Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we introduce C3 a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. |
Hyunjik Kim; Matthias Bauer; Lucas Theis; Jonathan Richard Schwarz; Emilien Dupont; |
226 | RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge we present the Reflective and Imaginative Language Agent (RILA). |
Zeyuan Yang; Jiageng Liu; Peihao Chen; Anoop Cherian; Tim K. Marks; Jonathan Le Roux; Chuang Gan; |
227 | ReconFusion: 3D Reconstruction with Diffusion Priors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present ReconFusion to reconstruct real-world scenes using only a few photos. |
Rundi Wu; Ben Mildenhall; Philipp Henzler; Keunhong Park; Ruiqi Gao; Daniel Watson; Pratul P. Srinivasan; Dor Verbin; Jonathan T. Barron; Ben Poole; Aleksander Ho?y?ski; |
228 | PhotoMaker: Customizing Realistic Human Photos Via Stacked ID Embedding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce PhotoMaker an efficient personalized text-to-image generation method which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. |
Zhen Li; Mingdeng Cao; Xintao Wang; Zhongang Qi; Ming-Ming Cheng; Ying Shan; |
229 | Discriminative Probing and Tuning for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. |
Leigang Qu; Wenjie Wang; Yongqi Li; Hanwang Zhang; Liqiang Nie; Tat-Seng Chua; |
230 | MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Multi-view Ancestral Sampling (MAS) a method for 3D motion generation using 2D diffusion models that were trained on motions obtained from in-the-wild videos. |
Roy Kapon; Guy Tevet; Daniel Cohen-Or; Amit H. Bermano; |
231 | Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Previous techniques mitigate this by reweighting these boxes as pseudo labels but these boxes can still poison the training process. To resolve this problem in this paper we propose a novel pseudo label refinery framework. |
Zhanwei Zhang; Minghao Chen; Shuai Xiao; Liang Peng; Hengjia Li; Binbin Lin; Ping Li; Wenxiao Wang; Boxi Wu; Deng Cai; |
232 | CapsFusion: Rethinking Image-Text Data at Scale Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To provide higher-quality and more scalable multimodal pretraining data we propose CapsFusion an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. |
Qiying Yu; Quan Sun; Xiaosong Zhang; Yufeng Cui; Fan Zhang; Yue Cao; Xinlong Wang; Jingjing Liu; |
233 | One-Prompt to Segment All Medical Images Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces a new paradigm toward the universal medical image segmentation termed ‘One-Prompt Segmentation.’ |
Junde Wu; Min Xu; |
234 | Learning Occupancy for Monocular 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose OccupancyM3D a method of learning occupancy for monocular 3D detection. |
Liang Peng; Junkai Xu; Haoran Cheng; Zheng Yang; Xiaopei Wu; Wei Qian; Wenxiao Wang; Boxi Wu; Deng Cai; |
235 | TCP:Textual-based Class-aware Prompt Tuning for Visual-Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However those textual tokens have a limited generalization ability regarding unseen domains as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. |
Hantao Yao; Rui Zhang; Changsheng Xu; |
236 | Towards Automated Movie Trailer Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the process of creating trailers can be time-consuming and expensive. To streamline this process we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. |
Dawit Mureja Argaw; Mattia Soldan; Alejandro Pardo; Chen Zhao; Fabian Caba Heilbron; Joon Son Chung; Bernard Ghanem; |
237 | Link-Context Learning for Multimodal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose link-context learning (LCL) which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. |
Yan Tai; Weichen Fan; Zhao Zhang; Ziwei Liu; |
238 | CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a novel cost-based approach to adapt vision-language foundation models notably CLIP for the intricate task of semantic segmentation. |
Seokju Cho; Heeseong Shin; Sunghwan Hong; Anurag Arnab; Paul Hongsuck Seo; Seungryong Kim; |
239 | Federated Online Adaptation for Deep Stereo Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce a novel approach for adapting deep stereo networks in a collaborative manner. |
Matteo Poggi; Fabio Tosi; |
240 | Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper focuses on the high computational complexity in Large Language Models (LLMs) a significant challenge in both natural language processing (NLP) and multi-modal tasks. We propose Low-Rank Approximation for Sparse At- tention (LoRA-Sparse) an innovative approach that strate- gically reduces this complexity. |
Lin Song; Yukang Chen; Shuai Yang; Xiaohan Ding; Yixiao Ge; Ying-Cong Chen; Ying Shan; |
241 | 3DSFLabelling: Boosting 3D Scene Flow Estimation By Pseudo Auto-labelling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a novel approach from the perspective of auto-labelling aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR point clouds. |
Chaokang Jiang; Guangming Wang; Jiuming Liu; Hesheng Wang; Zhuang Ma; Zhenqiang Liu; Zhujin Liang; Yi Shan; Dalong Du; |
242 | SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a method to allow precise and extremely fast mesh extraction from 3D Gaussian Splatting. |
Antoine Guédon; Vincent Lepetit; |
243 | PerceptionGPT: Effectively Fusing Visual Perception Into LLM Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a novel end-to-end framework named PerceptionGPT which represent the perception signals using LLM’s dynamic token embedding. |
Renjie Pi; Lewei Yao; Jiahui Gao; Jipeng Zhang; Tong Zhang; |
244 | FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce FRESCO intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. |
Shuai Yang; Yifan Zhou; Ziwei Liu; Chen Change Loy; |
245 | WonderJourney: Going from Anywhere to Everywhere Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce WonderJourney a modular framework for perpetual 3D scene generation. |
Hong-Xing Yu; Haoyi Duan; Junhwa Hur; Kyle Sargent; Michael Rubinstein; William T. Freeman; Forrester Cole; Deqing Sun; Noah Snavely; Jiajun Wu; Charles Herrmann; |
246 | SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset we propose an automatic and scalable generation method to generate question-answer pairs knowledge graphs and rationales by instructing the combinations of LLMs and MLLMs. |
Andong Wang; Bo Wu; Sunli Chen; Zhenfang Chen; Haotian Guan; Wei-Ning Lee; Li Erran Li; Chuang Gan; |
247 | HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data restricting their ability to scale and generalize to more unconstrained interaction settings. To address this we introduce HOLD — the first category-agnostic method that reconstructs an articulated hand and an object jointly from a monocular interaction video. |
Zicong Fan; Maria Parelli; Maria Eleni Kadoglou; Xu Chen; Muhammed Kocabas; Michael J. Black; Otmar Hilliges; |
248 | 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present 3D Paintbrush a technique for automatically texturing local semantic regions on meshes via text descriptions. |
Dale Decatur; Itai Lang; Kfir Aberman; Rana Hanocka; |
249 | OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In particular we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. |
Ganlong Zhao; Guanbin Li; Weikai Chen; Yizhou Yu; |
250 | LMDrive: Closed-Loop End-to-End Driving with Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end this paper introduces LMDrive a novel language-guided end-to-end closed-loop autonomous driving framework. |
Hao Shao; Yuxuan Hu; Letian Wang; Guanglu Song; Steven L. Waslander; Yu Liu; Hongsheng Li; |
251 | In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. |
Yiran Xu; Zhixin Shu; Cameron Smith; Seoung Wug Oh; Jia-Bin Huang; |
252 | LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose LayoutLLM an LLM/MLLM based method for document understanding. |
Chuwei Luo; Yufan Shen; Zhaoqing Zhu; Qi Zheng; Zhi Yu; Cong Yao; |
253 | HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. |
Ce Zhang; Simon Stepputtis; Joseph Campbell; Katia Sycara; Yaqi Xie; |
254 | UFOGen: You Forward Once Large Scale Text-to-Image Generation Via Diffusion GANs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Text-to-image diffusion models have demonstrated remarkable capabilities in transforming text prompts into coherent images yet the computational cost of the multi-step inference remains a persistent challenge. To address this issue we present UFOGen a novel generative model designed for ultra-fast one-step text-to-image generation. |
Yanwu Xu; Yang Zhao; Zhisheng Xiao; Tingbo Hou; |
255 | Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Since for any SDE there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. |
Zike Wu; Pan Zhou; Xuanyu Yi; Xiaoding Yuan; Hanwang Zhang; |
256 | THRONE: An Object-based Hallucination Benchmark for The Free-form Generations of Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. |
Prannay Kaul; Zhizhong Li; Hao Yang; Yonatan Dukler; Ashwin Swaminathan; C. J. Taylor; Stefano Soatto; |
257 | IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose IS-Fusion an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. |
Junbo Yin; Jianbing Shen; Runnan Chen; Wei Li; Ruigang Yang; Pascal Frossard; Wenguan Wang; |
258 | VideoBooth: Diffusion-based Video Generation with Image Prompts Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we study the task of video generation with image prompts which provide more accurate and direct content control beyond the text prompts. |
Yuming Jiang; Tianxing Wu; Shuai Yang; Chenyang Si; Dahua Lin; Yu Qiao; Chen Change Loy; Ziwei Liu; |
259 | Ungeneralizable Examples Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we extend the concept of unlearnable data to conditional data learnability and introduce UnGeneralizable Examples (UGEs). |
Jingwen Ye; Xinchao Wang; |
260 | Distilled Datamodel with Reverse Gradient Matching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce an efficient framework for assessing data impact comprising offline training and online evaluation stages. |
Jingwen Ye; Ruonan Yu; Songhua Liu; Xinchao Wang; |
261 | Privacy-Preserving Optics for Enhancing Protection in Face De-Identification Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While software-level solutions like face de-identification provide a good privacy/utility trade-off they present vulnerabilities to sniffing attacks. In this paper we propose a hardware-level face de-identification method to solve this vulnerability. |
Jhon Lopez; Carlos Hinojosa; Henry Arguello; Bernard Ghanem; |
262 | When StyleGAN Meets Stable Diffusion: A W+ Adapter for Personalized Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Text descriptions intended to guide the facial attributes of the synthesized face may fall short owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues we present the novel use of the extended StyleGAN embedding space \mathcal W _+ to achieve enhanced identity preservation and disentanglement for diffusion models. |
Xiaoming Li; Xinyu Hou; Chen Change Loy; |
263 | Taming Mode Collapse in Score Distillation for Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem manifesting as the Janus artifact in practice. |
Peihao Wang; Dejia Xu; Zhiwen Fan; Dilin Wang; Sreyas Mohan; Forrest Iandola; Rakesh Ranjan; Yilei Li; Qiang Liu; Zhangyang Wang; Vikas Chandra; |
264 | Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite great promise video diffusion models are difficult to control hindering users to apply their creativity rather than amplifying it. To address this challenge we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. |
Shengqu Cai; Duygu Ceylan; Matheus Gadelha; Chun-Hao Paul Huang; Tuanfeng Yang Wang; Gordon Wetzstein; |
265 | Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However owing to the ill-posed nature of this problem there has been no solution that can provide consistent high-quality novel views from camera positions that are significantly different from the training views. In this work we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first we fit a low-rank neural deformation model which then is used as regularization for non-rigid reconstruction in the second stage. |
Devikalyan Das; Christopher Wewer; Raza Yunus; Eddy Ilg; Jan Eric Lenssen; |
266 | Accelerating Diffusion Sampling with Optimized Time Steps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While this is a significant development most sampling methods still employ uniform time steps which is not optimal when using a small number of steps. To address this issue we propose a general framework for designing an optimization problem that seeks more appropriate time steps for a specific numerical ODE solver for DPMs. |
Shuchen Xue; Zhaoqiang Liu; Fei Chen; Shifeng Zhang; Tianyang Hu; Enze Xie; Zhenguo Li; |
267 | VideoLLM-online: Online Video Large Language Model for Streaming Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel Learning-In-Video-Stream (LIVE) framework which enables temporally aligned long-context and real-time dialogue within a continuous video stream. |
Joya Chen; Zhaoyang Lv; Shiwei Wu; Kevin Qinghong Lin; Chenan Song; Difei Gao; Jia-Wei Liu; Ziteng Gao; Dongxing Mao; Mike Zheng Shou; |
268 | SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Fortunately the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance which provides a promising solution to tackle this task. Motivated by this we introduce SAM-6D a novel framework designed to realize the task through two steps including instance segmentation and pose estimation. |
Jiehong Lin; Lihua Liu; Dekun Lu; Kui Jia; |
269 | RegionGPT: Towards Region Understanding Vision Language Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder and the use of coarse-grained training data that lacks detailed region-specific captions. To address this we introduce RegionGPT (short as RGPT) a novel framework designed for complex region-level captioning and understanding. |
Qiushan Guo; Shalini De Mello; Hongxu Yin; Wonmin Byeon; Ka Chun Cheung; Yizhou Yu; Ping Luo; Sifei Liu; |
270 | Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation Using Stable Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. |
Junjiao Tian; Lavisha Aggarwal; Andrea Colaco; Zsolt Kira; Mar Gonzalez-Franco; |
271 | Language Models As Black-Box Optimizers for Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However many VLMs rely on proprietary data and are not open-source which restricts the use of white-box approaches for fine-tuning. As such we aim to develop a black-box approach to optimize VLMs through natural language prompts thereby avoiding the need to access model parameters feature embeddings or even output logits. |
Shihong Liu; Samuel Yu; Zhiqiu Lin; Deepak Pathak; Deva Ramanan; |
272 | NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To support interactions with scarcely available data we propose an automated synthetic data pipeline. |
Nilesh Kulkarni; Davis Rempe; Kyle Genova; Abhijit Kundu; Justin Johnson; David Fouhey; Leonidas Guibas; |
273 | Posterior Distillation Sampling Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Posterior Distillation Sampling (PDS) a novel optimization method for parametric image editing based on diffusion models. |
Juil Koo; Chanho Park; Minhyuk Sung; |
274 | PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we introduce PLGSLAM a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. |
Tianchen Deng; Guole Shen; Tong Qin; Jianyu Wang; Wentao Zhao; Jingchuan Wang; Danwei Wang; Weidong Chen; |
275 | Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Gaussian-Flow a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. |
Youtian Lin; Zuozhuo Dai; Siyu Zhu; Yao Yao; |
276 | HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose an efficient yet effective framework HumanGaussian that generates high-quality 3D humans with fine-grained geometry and realistic appearance. |
Xian Liu; Xiaohang Zhan; Jiaxiang Tang; Ying Shan; Gang Zeng; Dahua Lin; Xihui Liu; Ziwei Liu; |
277 | Rethinking The Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In particular the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. |
Chuangchuang Tan; Yao Zhao; Shikui Wei; Guanghua Gu; Ping Liu; Yunchao Wei; |
278 | Unsupervised Keypoints from Pretrained Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). |
Eric Hedlin; Gopal Sharma; Shweta Mahajan; Xingzhe He; Hossam Isack; Abhishek Kar; Helge Rhodin; Andrea Tagliasacchi; Kwang Moo Yi; |
279 | HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce "HallusionBench" a comprehensive benchmark designed for the evaluation of image-context reasoning. |
Tianrui Guan; Fuxiao Liu; Xiyang Wu; Ruiqi Xian; Zongxia Li; Xiaoyu Liu; Xijun Wang; Lichang Chen; Furong Huang; Yaser Yacoob; Dinesh Manocha; Tianyi Zhou; |
280 | MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis Via Meta-learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges we propose MetaCloak which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. |
Yixin Liu; Chenrui Fan; Yutong Dai; Xun Chen; Pan Zhou; Lichao Sun; |
281 | Traffic Scene Parsing Through The TSP6K Dataset Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However little effort has been put into improving the traffic monitoring scene understanding mainly due to the lack of specific datasets. To fill this gap we introduce a specialized traffic monitoring dataset termed TSP6K containing images from the traffic monitoring scenario with high-quality pixel-level and instance-level annotations. |
Peng-Tao Jiang; Yuqi Yang; Yang Cao; Qibin Hou; Ming-Ming Cheng; Chunhua Shen; |
282 | SODA: Bottleneck Diffusion Models for Representation Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce SODA a self-supervised diffusion model designed for representation learning. |
Drew A. Hudson; Daniel Zoran; Mateusz Malinowski; Andrew K. Lampinen; Andrew Jaegle; James L. McClelland; Loic Matthey; Felix Hill; Alexander Lerchner; |
283 | Classes Are Not Equal: An Empirical Study on Image Recognition Fairness Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present an empirical study on image recognition unfairness i.e. extreme class accuracy disparity on balanced data like ImageNet. |
Jiequan Cui; Beier Zhu; Xin Wen; Xiaojuan Qi; Bei Yu; Hanwang Zhang; |
284 | DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a novel concept of dual and integrated latent topologies (DITTO in short) for implicit 3D reconstruction from noisy and sparse point clouds. |
Jaehyeok Shim; Kyungdon Joo; |
285 | Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel approach pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. |
Zhikai Chen; Fuchen Long; Zhaofan Qiu; Ting Yao; Wengang Zhou; Jiebo Luo; Tao Mei; |
286 | Loopy-SLAM: Dense Neural SLAM with Loop Closures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response we introduce Loopy-SLAM that globally optimizes poses and the dense 3D model. |
Lorenzo Liso; Erik Sandström; Vladimir Yugay; Luc Van Gool; Martin R. Oswald; |
287 | Rethinking Inductive Biases for Surface Normal Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. |
Gwangbin Bae; Andrew J. Davison; |
288 | Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose to improve transformers of a specific modality with irrelevant data from other modalities e.g. improve an ImageNet model with audio or point cloud datasets. |
Yiyuan Zhang; Xiaohan Ding; Kaixiong Gong; Yixiao Ge; Ying Shan; Xiangyu Yue; |
289 | SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: As a result the content of reproduced high-resolution image may have semantic errors deteriorating the super-resolution performance. To address this issue we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. |
Rongyuan Wu; Tao Yang; Lingchen Sun; Zhengqiang Zhang; Shuai Li; Lei Zhang; |
290 | Federated Generalized Category Discovery Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To this end we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). |
Nan Pu; Wenjing Li; Xingyuan Ji; Yalan Qin; Nicu Sebe; Zhun Zhong; |
291 | Structure-Aware Sparse-View X-ray 3D Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a framework Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF) for sparse-view X-ray 3D reconstruction. |
Yuanhao Cai; Jiahao Wang; Alan Yuille; Zongwei Zhou; Angtian Wang; |
292 | Compositional Chain-of-Thought Prompting for Large Multimodal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Moreover finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this inspired by chain-of-thought methods we propose Compositional Chain-of-Thought (CCoT) a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. |
Chancharik Mitra; Brandon Huang; Trevor Darrell; Roei Herzig; |
293 | One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present One-2-3-45++ an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. |
Minghua Liu; Ruoxi Shi; Linghao Chen; Zhuoyang Zhang; Chao Xu; Xinyue Wei; Hansheng Chen; Chong Zeng; Jiayuan Gu; Hao Su; |
294 | Towards Accurate Post-training Quantization for Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose an accurate post-training quantization framework of diffusion models (APQ-DM) for efficient image generation. |
Changyuan Wang; Ziwei Wang; Xiuwei Xu; Yansong Tang; Jie Zhou; Jiwen Lu; |
295 | Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset and benchmark challenge. |
Kristen Grauman; Andrew Westbury; Lorenzo Torresani; Kris Kitani; Jitendra Malik; Triantafyllos Afouras; Kumar Ashutosh; Vijay Baiyya; Siddhant Bansal; Bikram Boote; Eugene Byrne; Zach Chavis; Joya Chen; Feng Cheng; Fu-Jen Chu; Sean Crane; Avijit Dasgupta; Jing Dong; Maria Escobar; Cristhian Forigua; Abrham Gebreselasie; Sanjay Haresh; Jing Huang; Md Mohaiminul Islam; Suyog Jain; Rawal Khirodkar; Devansh Kukreja; Kevin J Liang; Jia-Wei Liu; Sagnik Majumder; Yongsen Mao; Miguel Martin; Effrosyni Mavroudi; Tushar Nagarajan; Francesco Ragusa; Santhosh Kumar Ramakrishnan; Luigi Seminara; Arjun Somayazulu; Yale Song; Shan Su; Zihui Xue; Edward Zhang; Jinxu Zhang; Angela Castillo; Changan Chen; Xinzhu Fu; Ryosuke Furuta; Cristina Gonzalez; Prince Gupta; Jiabo Hu; Yifei Huang; Yiming Huang; Weslie Khoo; Anush Kumar; Robert Kuo; Sach Lakhavani; Miao Liu; Mi Luo; Zhengyi Luo; Brighid Meredith; Austin Miller; Oluwatumininu Oguntola; Xiaqing Pan; Penny Peng; Shraman Pramanick; Merey Ramazanova; Fiona Ryan; Wei Shan; Kiran Somasundaram; Chenan Song; Audrey Southerland; Masatoshi Tateno; Huiyu Wang; Yuchen Wang; Takuma Yagi; Mingfei Yan; Xitong Yang; Zecheng Yu; Shengxin Cindy Zha; Chen Zhao; Ziwei Zhao; Zhifan Zhu; Jeff Zhuo; Pablo Arbelaez; Gedas Bertasius; Dima Damen; Jakob Engel; Giovanni Maria Farinella; Antonino Furnari; Bernard Ghanem; Judy Hoffman; C.V. Jawahar; Richard Newcombe; Hyun Soo Park; James M. Rehg; Yoichi Sato; Manolis Savva; Jianbo Shi; Mike Zheng Shou; Michael Wray; |
296 | DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. |
Tobias Kirschstein; Simon Giebenhain; Matthias Nießner; |
297 | PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce PhysGaussian a new method that seamlessly integrates physically grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel motion synthesis. |
Tianyi Xie; Zeshun Zong; Yuxing Qiu; Xuan Li; Yutao Feng; Yin Yang; Chenfanfu Jiang; |
298 | Learning Inclusion Matching for Animation Paint Bucket Colorization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we introduce a new learning-based inclusion matching pipeline which directs the network to comprehend the inclusion relationships between segments rather than relying solely on direct visual correspondences. |
Yuekun Dai; Shangchen Zhou; Qinyue Li; Chongyi Li; Chen Change Loy; |
299 | Class Tokens Infusion for Weakly Supervised Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work proposes a novel WSSS framework with Class Token Infusion (CTI). |
Sung-Hoon Yoon; Hoyong Kwon; Hyeonseong Kim; Kuk-Jin Yoon; |
300 | DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the long generation time of such algorithms significantly degrades the user experience. To tackle this problem we propose DreamPropeller a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. |
Linqi Zhou; Andy Shih; Chenlin Meng; Stefano Ermon; |
301 | MarkovGen: Structured Prediction for Efficient Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt but also compatible with each other. In this work we propose a light-weight approach to achieving this compatibility between different regions of an image using a Markov Random Field (MRF) model. |
Sadeep Jayasumana; Daniel Glasner; Srikumar Ramalingam; Andreas Veit; Ayan Chakrabarti; Sanjiv Kumar; |
302 | Rethinking FID: Towards A Better Evaluation Metric for Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Through extensive experiments and analysis we demonstrate that FID-based evaluations of text-to-image models may be unreliable and that CMMD offers a more robust and reliable assessment of image quality. |
Sadeep Jayasumana; Srikumar Ramalingam; Andreas Veit; Daniel Glasner; Ayan Chakrabarti; Sanjiv Kumar; |
303 | Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task termed as Text-IF. |
Xunpeng Yi; Han Xu; Hao Zhang; Linfeng Tang; Jiayi Ma; |
304 | Cache Me If You Can: Accelerating Diffusion Models Through Block Caching Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we investigate the behavior of the layers within the network and find that 1) the layers’ output changes smoothly over time 2) the layers show distinct patterns of change and 3) the change from step to step is often very small. |
Felix Wimbauer; Bichen Wu; Edgar Schoenfeld; Xiaoliang Dai; Ji Hou; Zijian He; Artsiom Sanakoyeu; Peizhao Zhang; Sam Tsai; Jonas Kohler; Christian Rupprecht; Daniel Cremers; Peter Vajda; Jialiang Wang; |
305 | OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work for the first time we synergize information from image text and event-data domains and introduce OpenESS to enable scalable ESS in an open-world annotation-efficient manner. |
Lingdong Kong; Youquan Liu; Lai Xing Ng; Benoit R. Cottereau; Wei Tsang Ooi; |
306 | UniPAD: A Universal Pre-training Paradigm for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present UniPAD a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. |
Honghui Yang; Sha Zhang; Di Huang; Xiaoyang Wu; Haoyi Zhu; Tong He; Shixiang Tang; Hengshuang Zhao; Qibo Qiu; Binbin Lin; Xiaofei He; Wanli Ouyang; |
307 | Referring Image Editing: Object-level Image Editing Via Referring Expressions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In response to this challenge we introduce an object-level generative task called Referring Image Editing (RIE) which enables the identification and editing of specific source objects in an image using text prompts. To tackle this task effectively we propose a tailored framework called ReferDiffusion. |
Chang Liu; Xiangtai Li; Henghui Ding; |
308 | ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the methods used by existing frameworks to curate such multimodal data in particular language descriptions for 3D shapes are not scalable and the collected language descriptions are not diverse. To address this we introduce ULIP-2 a simple yet effective tri-modal pretraining framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. |
Le Xue; Ning Yu; Shu Zhang; Artemis Panagopoulou; Junnan Li; Roberto Martín-Martín; Jiajun Wu; Caiming Xiong; Ran Xu; Juan Carlos Niebles; Silvio Savarese; |
309 | GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: With the progress achieved so far it is time to move towards universal navigation models capable of handling various goal types enabling more effective user interaction with robots. To facilitate this goal we propose GOAT-Bench a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). |
Mukul Khanna; Ram Ramrakhya; Gunjan Chhablani; Sriram Yenamandra; Theophile Gervet; Matthew Chang; Zsolt Kira; Devendra Singh Chaplot; Dhruv Batra; Roozbeh Mottaghi; |
310 | Holistic Autonomous Driving Understanding By Bird’s-Eye-View Injected Multi-Modal Large Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. |
Xinpeng Ding; Jianhua Han; Hang Xu; Xiaodan Liang; Wei Zhang; Xiaomeng Li; |
311 | Neural Lineage Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce a novel task known as neural lineage detection aiming at discovering lineage relationships between parent and child models. |
Runpeng Yu; Xinchao Wang; |
312 | EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However these decoding mechanisms usually come with high computational costs. To address this concern we introduce EMCAD a new efficient multi-scale convolutional attention decoder designed to optimize both performance and computational efficiency. |
Md Mostafijur Rahman; Mustafa Munir; Radu Marculescu; |
313 | VMC: Video Motion Customization Using Temporal Attention Adaption for Text-to-Video Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: For example straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this here we present the Video Motion Customization (VMC) framework a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. |
Hyeonho Jeong; Geon Yeong Park; Jong Chul Ye; |
314 | RLHF-V: Towards Trustworthy MLLMs Via Behavior Alignment from Fine-grained Correctional Human Feedback Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the challenge we present RLHF-V which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. |
Tianyu Yu; Yuan Yao; Haoye Zhang; Taiwen He; Yifeng Han; Ganqu Cui; Jinyi Hu; Zhiyuan Liu; Hai-Tao Zheng; Maosong Sun; Tat-Seng Chua; |
315 | Addressing Background Context Bias in Few-Shot Segmentation Through Iterative Modulation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This phenomenon known as background context bias can hinder the effectiveness of support prototypes in guiding query image segmentation. In this work we propose a novel framework with an iterative structure to address this problem. |
Lanyun Zhu; Tianrun Chen; Jianxiong Yin; Simon See; Jun Liu; |
316 | LLaFS: When Large Language Models Meet Few-Shot Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes LLaFS the first attempt to leverage large language models (LLMs) in few-shot segmentation. |
Lanyun Zhu; Tianrun Chen; Deyi Ji; Jieping Ye; Jun Liu; |
317 | GPT-4V(ision) Is A Human-Aligned Evaluator for Text-to-3D Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents an automatic versatile and human-aligned evaluation metric for text-to-3D generative models. |
Tong Wu; Guandao Yang; Zhibing Li; Kai Zhang; Ziwei Liu; Leonidas Guibas; Dahua Lin; Gordon Wetzstein; |
318 | SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Building upon this technique we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians respectively. |
Yi-Hua Huang; Yang-Tian Sun; Ziyi Yang; Xiaoyang Lyu; Yan-Pei Cao; Xiaojuan Qi; |
319 | SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new 4D motion modeling paradigm SurMo that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. |
Tao Hu; Fangzhou Hong; Ziwei Liu; |
320 | GauHuman: Articulated Gaussian Splatting from Monocular Human Videos Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present GauHuman a 3D human model with Gaussian Splatting for both fast training (1 2 minutes) and real-time rendering (up to 189 FPS) compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame. |
Shoukang Hu; Tao Hu; Ziwei Liu; |
321 | Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We address this gap with MoRE a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as Living Scenes and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances whose accuracy and completeness increase over time. |
Liyuan Zhu; Shengyu Huang; Konrad Schindler; Iro Armeni; |
322 | MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce MeshGPT a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. |
Yawar Siddiqui; Antonio Alliegro; Alexey Artemov; Tatiana Tommasi; Daniele Sirigatti; Vladislav Rosov; Angela Dai; Matthias Nießner; |
323 | ViTamin: Designing Scalable Vision Models in The Vision-Language Era Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. |
Jieneng Chen; Qihang Yu; Xiaohui Shen; Alan Yuille; Liang-Chieh Chen; |
324 | AlignSAM: Aligning Segment Anything Model to Open Context Via Reinforcement Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a novel framework termed AlignSAM designed for automatic prompting for aligning SAM to an open context through reinforcement learning. |
Duojun Huang; Xinyu Xiong; Jie Ma; Jichang Li; Zequn Jie; Lin Ma; Guanbin Li; |
325 | GSVA: Generalized Segmentation Via Multimodal Large Language Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. |
Zhuofan Xia; Dongchen Han; Yizeng Han; Xuran Pan; Shiji Song; Gao Huang; |
326 | LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a tensor decomposition and low-rank recovery approach (LowRankOcc) for vision-based 3D semantic occupancy prediction. |
Linqing Zhao; Xiuwei Xu; Ziwei Wang; Yunpeng Zhang; Borui Zhang; Wenzhao Zheng; Dalong Du; Jie Zhou; Jiwen Lu; |
327 | A Vision Check-up for Language Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As language models lack the ability to consume or output visual information as pixels we use code to represent images in our study. |
Pratyusha Sharma; Tamar Rott Shaham; Manel Baradad; Stephanie Fu; Adrian Rodriguez-Munoz; Shivam Duggal; Phillip Isola; Antonio Torralba; |
328 | SimAC: A Simple Anti-Customization Method for Protecting Face Privacy Against Text-to-Image Synthesis of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unfortunately most of these methods adopt straightforward designs such as end-to-end optimization with a focus on adversarially maximizing the original training loss thereby neglecting nuanced internal properties intrinsic to the diffusion model and even leading to ineffective optimization in some diffusion time steps. In this paper we strive to bridge this gap by undertaking a comprehensive exploration of these inherent properties to boost the performance of current anti-customization approaches. |
Feifei Wang; Zhentao Tan; Tianyi Wei; Yue Wu; Qidong Huang; |
329 | MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models By Mirage Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we argue that the crux of the matter resides in the basic premise of existing projection strategies that the medium is homogeneous thereby projection rays propagate along straight lines and behind objects are occluded by front ones. |
Haowen Sun; Yueqi Duan; Juncheng Yan; Yifan Liu; Jiwen Lu; |
330 | OPERA: Alleviating Hallucination in Multi-Modal Large Language Models Via Over-Trust Penalty and Retrospection-Allocation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present OPERA a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy serving as a nearly free lunch to alleviate the hallucination issue without additional data knowledge or training. |
Qidong Huang; Xiaoyi Dong; Pan Zhang; Bin Wang; Conghui He; Jiaqi Wang; Dahua Lin; Weiming Zhang; Nenghai Yu; |
331 | SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in The Real World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we show that imitating shortest-path planners in simulation produces agents that given a language instruction can proficiently navigate explore and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). |
Kiana Ehsani; Tanmay Gupta; Rose Hendrix; Jordi Salvador; Luca Weihs; Kuo-Hao Zeng; Kunal Pratap Singh; Yejin Kim; Winson Han; Alvaro Herrasti; Ranjay Krishna; Dustin Schwenk; Eli VanderBilt; Aniruddha Kembhavi; |
332 | ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes ConsistDreamer – a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency thus enabling high-fidelity instruction-guided scene editing. |
Jun-Kun Chen; Samuel Rota Bulò; Norman Müller; Lorenzo Porzi; Peter Kontschieder; Yu-Xiong Wang; |
333 | UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present UDiFF a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. |
Junsheng Zhou; Weiqi Zhang; Baorui Ma; Kanle Shi; Yu-Shen Liu; Zhizhong Han; |
334 | PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Fueled by innovations in stacked image sensor fabrication emerging sensor–processors offer programmability and processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture PixelRNN that encodes spatio-temporal features on the sensor using purely binary operations. |
Haley M. So; Laurie Bose; Piotr Dudek; Gordon Wetzstein; |
335 | EASE-DETR: Easing The Competition Among Object Queries Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To help the leading query stands out this paper proposes EASE-DETR which eases the competition by introducing bias that favours the leading one. |
Yulu Gao; Yifan Sun; Xudong Ding; Chuyang Zhao; Si Liu; |
336 | TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TextureDreamer a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. |
Yu-Ying Yeh; Jia-Bin Huang; Changil Kim; Lei Xiao; Thu Nguyen-Phuoc; Numair Khan; Cheng Zhang; Manmohan Chandraker; Carl S Marshall; Zhao Dong; Zhengqin Li; |
337 | Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Since video content is highly redundant we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity visual quality and impairs scalability. In this work we build Snap Video a video-first model that systematically addresses these challenges. |
Willi Menapace; Aliaksandr Siarohin; Ivan Skorokhodov; Ekaterina Deyneka; Tsai-Shien Chen; Anil Kag; Yuwei Fang; Aleksei Stoliar; Elisa Ricci; Jian Ren; Sergey Tulyakov; |
338 | Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we aim at filling the gap with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. |
Yazhou Xing; Yingqing He; Zeyue Tian; Xintao Wang; Qifeng Chen; |
339 | MindBridge: A Cross-Subject Brain Decoding Framework Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present a novel approach MindBridge that achieves cross-subject brain decoding by employing only one model. |
Shizun Wang; Songhua Liu; Zhenxiong Tan; Xinchao Wang; |
340 | Physical Backdoor: Towards Temperature-based Backdoor Attacks in The Physical World Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks spanning both the digital and physical realms. |
Wen Yin; Jian Lou; Pan Zhou; Yulai Xie; Dan Feng; Yuhua Sun; Tailai Zhang; Lichao Sun; |
341 | EscherNet: A Generative Model for Scalable View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce EscherNet a multi-view conditioned diffusion model for view synthesis. |
Xin Kong; Shikun Liu; Xiaoyang Lyu; Marwan Taher; Xiaojuan Qi; Andrew J. Davison; |
342 | Describing Differences in Image Sets with Natural Language Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff which first captions the images and prompts a language model to propose candidate descriptions then re-ranks these descriptions using CLIP. |
Lisa Dunlap; Yuhui Zhang; Xiaohan Wang; Ruiqi Zhong; Trevor Darrell; Jacob Steinhardt; Joseph E. Gonzalez; Serena Yeung-Levy; |
343 | FedAS: Bridging Inconsistency in Personalized Federated Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This results in their under-trained personalized models and impedes the collaborative training stage for other clients. In this paper we present a novel PFL framework named FedAS which uses Federated Parameter-Alignment and Client-Synchronization to overcome above challenges. |
Xiyuan Yang; Wenke Huang; Mang Ye; |
344 | Text-to-3D Using Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response this paper proposes GSGEN a novel method that adopts Gaussian Splatting a recent state-of-the-art representation to text-to-3D generation. |
Zilong Chen; Feng Wang; Yikai Wang; Huaping Liu; |
345 | Desigen: A Pipeline for Controllable Design Template Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present Desigen an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. |
Haohan Weng; Danqing Huang; Yu Qiao; Zheng Hu; Chin-Yew Lin; Tong Zhang; C. L. Philip Chen; |
346 | From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. |
Evonne Ng; Javier Romero; Timur Bagautdinov; Shaojie Bai; Trevor Darrell; Angjoo Kanazawa; Alexander Richard; |
347 | ViT-Lens: Towards Omni-modal Representations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. |
Weixian Lei; Yixiao Ge; Kun Yi; Jianfeng Zhang; Difei Gao; Dylan Sun; Yuying Ge; Ying Shan; Mike Zheng Shou; |
348 | 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While easy to collect synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap we introduce 4D-DRESS the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. |
Wenbo Wang; Hsuan-I Ho; Chen Guo; Boxiang Rong; Artur Grigorev; Jie Song; Juan Jose Zarate; Otmar Hilliges; |
349 | PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. |
Zhengyao Lv; Yuxiang Wei; Wangmeng Zuo; Kwan-Yee K. Wong; |
350 | Memory-based Adapters for Online 3D Scene Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a new framework for online 3D scene perception. |
Xiuwei Xu; Chong Xia; Ziwei Wang; Linqing Zhao; Yueqi Duan; Jie Zhou; Jiwen Lu; |
351 | XCube: Large-Scale 3D Generative Modeling Using Sparse Voxel Hierarchies Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present XCube a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. |
Xuanchi Ren; Jiahui Huang; Xiaohui Zeng; Ken Museth; Sanja Fidler; Francis Williams; |
352 | It’s All About Your Sketch: Democratising Sketch Control in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A pilot study underscores the necessity revealing that deformities in existing models stem from spatial-conditioning. To rectify this we propose an abstraction-aware framework utilising a sketch adapter adaptive time-step sampling and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model working synergistically to reinforce fine-grained sketch-photo association. |
Subhadeep Koley; Ayan Kumar Bhunia; Deeptanshu Sekhri; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
353 | Text-to-Image Diffusion Models Are Great Sketch-Photo Matchmakers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In order to harness pre-trained diffusion models effectively we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
354 | How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
355 | You’ll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text orchestrating a duet between the two. |
Subhadeep Koley; Ayan Kumar Bhunia; Aneeshan Sain; Pinaki Nath Chowdhury; Tao Xiang; Yi-Zhe Song; |
356 | Few-shot Learner Parameterization By Diffusion Time-steps Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes i.e. as the forward diffusion adds noise to an image at each time-step nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this we propose Time-step Few-shot (TiF) learner. |
Zhongqi Yue; Pan Zhou; Richang Hong; Hanwang Zhang; Qianru Sun; |
357 | GLID: Pre-training A Generalist Encoder-Decoder Vision Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. |
Jihao Liu; Jinliang Zheng; Yu Liu; Hongsheng Li; |
358 | AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite the impressive results achieved these methods suffer from 1) loss of valuable contextual information via cropping 2) introducing distractions and 3) lacking inter-association among different persons and body parts inevitably causing performance degradation especially for crowded scenes. To address these issues we introduce a novel all-in-one-stage framework AiOS for multiple expressive human pose and shape recovery without an additional human detection step. |
Qingping Sun; Yanjun Wang; Ailing Zeng; Wanqi Yin; Chen Wei; Wenjia Wang; Haiyi Mei; Chi-Sing Leung; Ziwei Liu; Lei Yang; Zhongang Cai; |
359 | Structured Gradient-based Interpretations Via Norm-Regularized Adversarial Training Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. |
Shizhan Gong; Qi Dou; Farzan Farnia; |
360 | Multi-Attribute Interactions Matter for 3D Visual Grounding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a multi-attribute aware Transformer for 3D visual grounding learning the multi-attribute interactions to refine the intra-modal and inter-modal grounding cues. |
Can Xu; Yuehui Han; Rui Xu; Le Hui; Jin Xie; Jian Yang; |
361 | Unsegment Anything By Simulating Deformation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Foundation segmentation models while powerful pose a significant risk: they enable users to effortlessly extract any objects from any digital content with a single click potentially leading to copyright infringement or malicious misuse. To mitigate this risk we introduce a new task "Anything Unsegmentable" to grant any image "the right to be unsegmented". |
Jiahao Lu; Xingyi Yang; Xinchao Wang; |
362 | RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We hence release the first real-world large-scale RCooper dataset to bloom the research on practical roadside cooperative perception including detection and tracking. |
Ruiyang Hao; Siqi Fan; Yingru Dai; Zhenlin Zhang; Chenxi Li; Yuntian Wang; Haibao Yu; Wenxian Yang; Jirui Yuan; Zaiqing Nie; |
363 | Emotional Speech-driven 3D Body Animation Via Disentangled Latent Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Instead these methods directly output animations from speech without control over the expressed emotion. To address this limitation we present AMUSE an emotional speech-driven body animation model based on latent diffusion. |
Kiran Chhatre; Radek Dan??ek; Nikos Athanasiou; Giorgio Becherini; Christopher Peters; Michael J. Black; Timo Bolkart; |
364 | SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SCULPT a novel 3D generative model for clothed and textured 3D meshes of humans. |
Soubhik Sanyal; Partha Ghosh; Jinlong Yang; Michael J. Black; Justus Thies; Timo Bolkart; |
365 | BT-Adapter: Video Conversation Is Feasible Without Video Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we propose Branching Temporal Adapter (BT-Adapter) a novel method for extending image-language pretrained models into the video domain. |
Ruyang Liu; Chen Li; Yixiao Ge; Thomas H. Li; Ying Shan; Ge Li; |
366 | CLiC: Concept Learning in Context Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: It involves acquiring a visual concept (e.g. an ornament) from a source image and subsequently applying it to an object (e.g. a chair) in a target image. Our key idea is to perform in-context concept learning acquiring the local visual concept within the broader context of the objects they belong to. |
Mehdi Safaee; Aryan Mikaeili; Or Patashnik; Daniel Cohen-Or; Ali Mahdavi-Amiri; |
367 | Visual Point Cloud Forecasting Enables Scalable Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Hence it shows superiority in various downstream tasks. To cope with this new problem we present ViDAR a general model to pre-train downstream visual encoders. |
Zetong Yang; Li Chen; Yanan Sun; Hongyang Li; |
368 | RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However video editing models have not yet reached the same level of visual quality and user control. To address this we introduce RAVE a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. |
Ozgur Kara; Bariscan Kurtkaya; Hidir Yesiltepe; James M. Rehg; Pinar Yanardag; |
369 | Analyzing and Improving The Training Dynamics of Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture without altering its high-level structure. |
Tero Karras; Miika Aittala; Jaakko Lehtinen; Janne Hellsten; Timo Aila; Samuli Laine; |
370 | Multi-Modal Hallucination Control By Visual Information Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To reduce hallucinations we introduce Multi-Modal Mutual-Information Decoding (M3ID) a new sampling method for prompt amplification. |
Alessandro Favero; Luca Zancato; Matthew Trager; Siddharth Choudhary; Pramuditha Perera; Alessandro Achille; Ashwin Swaminathan; Stefano Soatto; |
371 | Rich Human Feedback for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text and (ii) annotating which words in the text prompt are misrepresented or missing on the image. |
Youwei Liang; Junfeng He; Gang Li; Peizhao Li; Arseniy Klimovskiy; Nicholas Carolan; Jiao Sun; Jordi Pont-Tuset; Sarah Young; Feng Yang; Junjie Ke; Krishnamurthy Dj Dvijotham; Katherine M. Collins; Yiwen Luo; Yang Li; Kai J Kohlhoff; Deepak Ramachandran; Vidhya Navalpakkam; |
372 | NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present NeLF-Pro a novel representation to model and reconstruct light fields in diverse natural scenes that vary in extent and spatial granularity. |
Zinuo You; Andreas Geiger; Anpei Chen; |
373 | Plug and Play Active Learning for Object Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However these specialized approaches are not readily adaptable to different object detectors due to the significant engineering effort required for integration. To overcome this challenge we introduce Plug and Play Active Learning (PPAL) a simple and effective AL strategy for object detection. |
Chenhongyi Yang; Lichao Huang; Elliot J. Crowley; |
374 | Mitigating Motion Blur in Neural Radiance Fields with Events and Frames Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However they face challenges in recovering accurate color content or constrain the NeRF to a set of predefined camera poses harming reconstruction quality in challenging conditions. This paper proposes a novel formulation addressing these issues by leveraging both model- and learning-based modules. |
Marco Cannici; Davide Scaramuzza; |
375 | Situational Awareness Matters in 3D Vision Language Reasoning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work we demonstrate that a critical and distinct challenge in 3D vision language reasoning is the situational awareness which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. |
Yunze Man; Liang-Yan Gui; Yu-Xiong Wang; |
376 | DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce DPMesh an innovative framework for occluded human mesh recovery that capitalizes on the profound knowledge about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. |
Yixuan Zhu; Ao Li; Yansong Tang; Wenliang Zhao; Jie Zhou; Jiwen Lu; |
377 | FlowIE: Efficient Image Enhancement Via Rectified Flow Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In response we propose FlowIE a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. |
Yixuan Zhu; Wenliang Zhao; Ao Li; Yansong Tang; Jie Zhou; Jiwen Lu; |
378 | Robust Emotion Recognition in Context Debiasing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation causing severe performance bottlenecks and confounding valuable context priors. In this paper we propose a counterfactual emotion inference (CLEF) framework to address the above issue. |
Dingkang Yang; Kun Yang; Mingcheng Li; Shunli Wang; Shuaibing Wang; Lihua Zhang; |
379 | Efficient Stitchable Task Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we present a novel framework Efficient Stitchable Task Adaptation (ESTA) to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. |
Haoyu He; Zizheng Pan; Jing Liu; Jianfei Cai; Bohan Zhuang; |
380 | Transcriptomics-guided Slide Representation Learning in Computational Pathology Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we leverage complementary information from gene expression profiles to guide slide representation learning using multi-modal pre-training. |
Guillaume Jaume; Lukas Oldenburg; Anurag Vaidya; Richard J. Chen; Drew F.K. Williamson; Thomas Peeters; Andrew H. Song; Faisal Mahmood; |
381 | Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way? |
Guillaume Jaume; Anurag Vaidya; Richard J. Chen; Drew F.K. Williamson; Paul Pu Liang; Faisal Mahmood; |
382 | Towards Generalizable Tumor Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and furthermore the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g. hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT) whether they originate in the liver pancreas or kidneys. |
Qi Chen; Xiaoxi Chen; Haorui Song; Zhiwei Xiong; Alan Yuille; Chen Wei; Zongwei Zhou; |
383 | Fooling Polarization-Based Vision Using Locally Controllable Polarizing Projection Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we warn the community of the vulnerability of polarization-based vision which can be more serious than RGB-based vision. |
Zhuoxiao Li; Zhihang Zhong; Shohei Nobuhara; Ko Nishino; Yinqiang Zheng; |
384 | Seeing The Unseen: Visual Common Sense for Semantic Placement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Datasets for image description are typically constructed by curating relevant images (e.g. via image search with object names) and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context (which is easy to find online) and remove that object from the image via inpainting. |
Ram Ramrakhya; Aniruddha Kembhavi; Dhruv Batra; Zsolt Kira; Kuo-Hao Zeng; Luca Weihs; |
385 | SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The main challenge lies in inferring unknown body shapes appearances and clothing details in areas not visible in the images. To address this we propose SiTH a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. |
Hsuan- I Ho; Jie Song; Otmar Hilliges; |
386 | The Manga Whisperer: Automatically Generating Transcriptions for Comics Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Yet the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work we seek to address this substantial barrier with the aim of ensuring that manga can be appreciated and actively engaged by everyone. |
Ragav Sachdeva; Andrew Zisserman; |
387 | Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We contribute the Habitat Synthetic Scene Dataset a dataset of 211 high-quality 3D scenes and use it to test navigation agent generalization to realistic 3D environments. |
Mukul Khanna; Yongsen Mao; Hanxiao Jiang; Sanjay Haresh; Brennan Shacklett; Dhruv Batra; Alexander Clegg; Eric Undersander; Angel X. Chang; Manolis Savva; |
388 | Motion Diversification Networks Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Motion Diversification Networks a novel framework for learning to generate realistic and diverse 3D human motion. |
Hee Jae Kim; Eshed Ohn-Bar; |
389 | Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a compressed 3D Gaussian splat representation that utilizes sensitivity-aware vector clustering with quantization-aware training to compress directional colors and Gaussian parameters. |
Simon Niedermayr; Josef Stumpfegger; Rüdiger Westermann; |
390 | Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we introduce Animatable Gaussians a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. |
Zhe Li; Zerong Zheng; Lizhen Wang; Yebin Liu; |
391 | VRP-SAM: SAM with Visual Reference Prompt Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation creating the VRP-SAM model. |
Yanpeng Sun; Jiahui Chen; Shan Zhang; Xinyu Zhang; Qiang Chen; Gang Zhang; Errui Ding; Jingdong Wang; Zechao Li; |
392 | Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts rather than learning features that are widely applicable across various forgeries. To address this issue we propose a simple yet effective detector called LSDA (\underline L atent \underline S pace \underline D ata \underline A ugmentation) which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary thereby mitigating the overfitting of method-specific features (see Fig. 1). |
Zhiyuan Yan; Yuhao Luo; Siwei Lyu; Qingshan Liu; Baoyuan Wu; |
393 | Bootstrapping SparseFormers from Vision Foundation Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. |
Ziteng Gao; Zhan Tong; Kevin Qinghong Lin; Joya Chen; Mike Zheng Shou; |
394 | Cinematic Behavior Transfer Via NeRF-based Differentiable Filming Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections neglecting 3D statuses. To address these issues we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. |
Xuekun Jiang; Anyi Rao; Jingbo Wang; Dahua Lin; Bo Dai; |
395 | Functional Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose functional diffusion a generative diffusion model focused on infinite-dimensional function data samples. |
Biao Zhang; Peter Wonka; |
396 | Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges we introduce Monkey to enhance LMM capabilities. |
Zhang Li; Biao Yang; Qiang Liu; Zhiyin Ma; Shuo Zhang; Jingxu Yang; Yabo Sun; Yuliang Liu; Xiang Bai; |
397 | Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Scaffold-GS which uses anchor points to distribute local 3D Gaussians and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. |
Tao Lu; Mulin Yu; Linning Xu; Yuanbo Xiangli; Limin Wang; Dahua Lin; Bo Dai; |
398 | ControlRoom3D: Room Generation Using Semantic Proxy Rooms Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet many of these automatically generated 3D meshes do not adhere to typical room layouts compromising their plausibility e.g. by placing several beds in one bedroom. To address these challenges we present ControlRoom3D a novel method to generate high-quality room meshes. |
Jonas Schult; Sam Tsai; Lukas Höllein; Bichen Wu; Jialiang Wang; Chih-Yao Ma; Kunpeng Li; Xiaofang Wang; Felix Wimbauer; Zijian He; Peizhao Zhang; Bastian Leibe; Peter Vajda; Ji Hou; |
399 | MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However existing well-annotated datasets are biased towards autonomous driving scenarios while unlabelled SLAM datasets are quickly over-fitted and often lack environment and domain variations. To expand the frontier of these fields we introduce a comprehensive dataset named MCD (Multi-Campus Dataset) featuring a wide range of sensing modalities high-accuracy ground truth and diverse challenging environments across three Eurasian university campuses. |
Thien-Minh Nguyen; Shenghai Yuan; Thien Hoang Nguyen; Pengyu Yin; Haozhi Cao; Lihua Xie; Maciej Wozniak; Patric Jensfelt; Marko Thiel; Justin Ziegenbein; Noel Blunder; |
400 | Equivariant Multi-Modality Image Fusion Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. |
Zixiang Zhao; Haowen Bai; Jiangshe Zhang; Yulun Zhang; Kai Zhang; Shuang Xu; Dongdong Chen; Radu Timofte; Luc Van Gool; |
401 | DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To provide consistent and controllable editing we propose the image-based video-NeRF editing pipeline with a set of innovative designs including multi-view multi-pose Score Distillation Sampling (SDS) from both the 2D personalized diffusion prior and 3D diffusion prior reconstruction losses text-guided local parts super-resolution and style transfer. |
Jia-Wei Liu; Yan-Pei Cao; Jay Zhangjie Wu; Weijia Mao; Yuchao Gu; Rui Zhao; Jussi Keppo; Ying Shan; Mike Zheng Shou; |
402 | Relightable Gaussian Codec Avatars Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present Relightable Gaussian Codec Avatars a method to build high-fidelity relightable head avatars that can be animated to generate novel expressions. |
Shunsuke Saito; Gabriel Schwartz; Tomas Simon; Junxuan Li; Giljoo Nam; |
403 | Condition-Aware Neural Network for Controlled Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present Condition-Aware Neural Network (CAN) a new method for adding control to image generative models. |
Han Cai; Muyang Li; Qinsheng Zhang; Ming-Yu Liu; Song Han; |
404 | Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation such as a flip or rotation. We propose a simple zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. |
Daniel Geng; Inbum Park; Andrew Owens; |
405 | Authentic Hand Avatar from A Phone Scan Via Universal Hand Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present a universal hand model (UHM) which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. |
Gyeongsik Moon; Weipeng Xu; Rohan Joshi; Chenglei Wu; Takaaki Shiratori; |
406 | Rapid 3D Model Generation with Intuitive 3D Input Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Here we propose Deep3DVRSketch the first 3D model generation network that inputs 3D VR sketches from novice users and generates highly consistent 3D models in multiple categories within seconds irrespective of the users’ drawing abilities. |
Tianrun Chen; Chaotao Ding; Shangzhan Zhang; Chunan Yu; Ying Zang; Zejian Li; Sida Peng; Lingyun Sun; |
407 | OmniViD: A Generative Framework for Universal Video Understanding Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In contrast natural language processing benefits from a unified output space i.e. text sequences which simplifies the training of powerful foundational language models such as GPT-3 with extensive training corpora. Inspired by this we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. |
Junke Wang; Dongdong Chen; Chong Luo; Bo He; Lu Yuan; Zuxuan Wu; Yu-Gang Jiang; |
408 | TexOct: Generating Textures of 3D Models with Octree-based Diffusion Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Nevertheless achieving dense point clouds to accurately represent texture details poses a challenge due to limited computational resources. To address these challenges we propose an efficient octree-based diffusion pipeline called TexOct. |
Jialun Liu; Chenming Wu; Xinqi Liu; Xing Liu; Jinbo Wu; Haotian Peng; Chen Zhao; Haocheng Feng; Jingtuo Liu; Errui Ding; |
409 | SPAD: Spatially Aware Multi-View Diffusers Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present SPAD a novel approach for creating consistent multi-view images from text prompts or single images. |
Yash Kant; Aliaksandr Siarohin; Ziyi Wu; Michael Vasilkovsky; Guocheng Qian; Jian Ren; Riza Alp Guler; Bernard Ghanem; Sergey Tulyakov; Igor Gilitschenski; |
410 | T4P: Test-Time Training of Trajectory Prediction Via Masked Autoencoder and Actor-specific Token Memory Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: First previous works underfit and overfit as they only optimize the last layer of motion decoder. To this end we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second utilizing the sequential nature of driving data we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. |
Daehee Park; Jaeseok Jeong; Sung-Hoon Yoon; Jaewoo Jeong; Kuk-Jin Yoon; |
411 | LEAD: Learning Decomposition for Source-free Universal Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose a new idea of LEArning Decomposition (LEAD) which decouples features into source-known and -unknown components to identify target-private data. |
Sanqing Qu; Tianpei Zou; Lianghua He; Florian Röhrbein; Alois Knoll; Guang Chen; Changjun Jiang; |
412 | Joint-Task Regularization for Partially Labeled Multi-Task Learning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Unfortunately curating such datasets can be prohibitively expensive and impractical especially for dense prediction tasks which require per-pixel labels for each image. With this in mind we propose Joint-Task Regularization (JTR) an intuitive technique which leverages cross-task relations to simultaneously regularize all tasks in a single joint-task latent space to improve learning when data is not fully labeled for all tasks. |
Kento Nishi; Junsik Kim; Wanhua Li; Hanspeter Pfister; |
413 | Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper presents Paint3D a novel coarse-to-fine generative framework that is capable of producing high-resolution lighting-less and diverse 2K UV texture maps for untextured 3D meshes conditioned on text or image inputs. |
Xianfang Zeng; Xin Chen; Zhongqi Qi; Wen Liu; Zibo Zhao; Zhibin Wang; Bin Fu; Yong Liu; Gang Yu; |
414 | PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. |
Jingbo Wang; Zhengyi Luo; Ye Yuan; Yixuan Li; Bo Dai; |
415 | MACE: Mass Concept Erasure in Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce MACE a finetuning framework for the task of mass concept erasure. |
Shilin Lu; Zilan Wang; Leyang Li; Yanzhu Liu; Adams Wai-Kin Kong; |
416 | DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by KinectFusion’s incremental alignment and fusion of local TSDF volumes we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes facilitating the generation of an entire room environment. |
Xiaoliang Ju; Zhaoyang Huang; Yijin Li; Guofeng Zhang; Yu Qiao; Hongsheng Li; |
417 | ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Most existing studies are devoted to designing vision-specific transformers to solve the above problems which introduce additional pre-training costs. Therefore we present a plain pre-training-free and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction named ViT-CoMer which facilitates bidirectional interaction between CNN and transformer. |
Chunlong Xia; Xinliang Wang; Feng Lv; Xin Hao; Yifeng Shi; |
418 | ProxyCap: Real-time Monocular Full-body Capture in World Space Via Human-Centric Proxy-to-Motion Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we introduce ProxyCap a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. |
Yuxiang Zhang; Hongwen Zhang; Liangxiao Hu; Jiajun Zhang; Hongwei Yi; Shengping Zhang; Yebin Liu; |
419 | Relation Rectification in Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To resolve this we introduce a novel task termed Relation Rectification aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). |
Yinwei Wu; Xingyi Yang; Xinchao Wang; |
420 | FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To this end we propose FreeCustom a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts using only one image per concept as input. |
Ganggui Ding; Canyu Zhao; Wen Wang; Zhen Yang; Zide Liu; Hao Chen; Chunhua Shen; |
421 | GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Yet category-level pose refinement is a more challenging problem due to large shape variations within a category and the discrepancies between the target object and the shape prior. To address these challenges we introduce a novel architecture for category-level object pose refinement. |
Linfang Zheng; Tze Ho Elden Tse; Chen Wang; Yinghan Sun; Hua Chen; Ales Leonardis; Wei Zhang; Hyung Jin Chang; |
422 | Osprey: Pixel Understanding with Visual Instruction Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. |
Yuqian Yuan; Wentong Li; Jian Liu; Dongqi Tang; Xinjie Luo; Chi Qin; Lei Zhang; Jianke Zhu; |
423 | Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper proposes an image-vector dual diffusion model for generative layout design. |
Mohammad Amin Shabani; Zhaowen Wang; Difan Liu; Nanxuan Zhao; Jimei Yang; Yasutaka Furukawa; |
424 | VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. |
Syed Talal Wasim; Muzammal Naseer; Salman Khan; Ming-Hsuan Yang; Fahad Shahbaz Khan; |
425 | TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Specifically we propose a Temporal LiDAR Aggregation and Distillation (TLAD) algorithm which leverages historical priors to assign different aggregation steps for different classes. |
Xiaopei Wu; Yuenan Hou; Xiaoshui Huang; Binbin Lin; Tong He; Xinge Zhu; Yuexin Ma; Boxi Wu; Haifeng Liu; Deng Cai; Wanli Ouyang; |
426 | DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However generating high-resolution images with diffusion models is still challenging due to the enormous computational costs resulting in a prohibitive latency for interactive applications. In this paper we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. |
Muyang Li; Tianle Cai; Jiaxin Cao; Qinsheng Zhang; Han Cai; Junjie Bai; Yangqing Jia; Kai Li; Song Han; |
427 | Hallucination Augmented Contrastive Learning for Multimodal Large Language Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we address hallucinations in MLLMs from a novel perspective of representation learning. |
Chaoya Jiang; Haiyang Xu; Mengfan Dong; Jiaxing Chen; Wei Ye; Ming Yan; Qinghao Ye; Ji Zhang; Fei Huang; Shikun Zhang; |
428 | Boosting Spike Camera Image Reconstruction from A Perspective of Dealing with Spike Fluctuations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present an approach to deal with spike fluctuations and boost spike camera image reconstruction. |
Rui Zhao; Ruiqin Xiong; Jing Zhao; Jian Zhang; Xiaopeng Fan; Zhaofei Yu; Tiejun Huang; |
429 | VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To embark on video editing with shape change we explore customized video subject swapping in this work where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences we introduce the VideoSwap framework that exploits semantic point correspondences inspired by our observation that only a small number of semantic points are necessary to align the subject’s motion trajectory and modify its shape. |
Yuchao Gu; Yipin Zhou; Bichen Wu; Licheng Yu; Jia-Wei Liu; Rui Zhao; Jay Zhangjie Wu; David Junhao Zhang; Mike Zheng Shou; Kevin Tang; |
430 | Ensemble Diversity Facilitates Adversarial Transferability Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address the issues we propose a novel method of Stochastic Mini-batch black-box attack with Ensemble Reweighing using reinforcement learning (SMER) to produce highly transferable adversarial examples. |
Bowen Tang; Zheng Wang; Yi Bin; Qi Dou; Yang Yang; Heng Tao Shen; |
431 | Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While attempts have been made to develop fast neural rendering approaches for static scenes these methods cannot be simply employed to support realistic facial expressions such as in the case of a dynamic facial performance. To address these challenges we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. |
Ziqian Bai; Feitong Tan; Sean Fanello; Rohit Pandey; Mingsong Dou; Shichen Liu; Ping Tan; Yinda Zhang; |
432 | LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper we present LaMPilot a novel framework that integrates LLMs into AD systems enabling them to follow user instructions by generating code that leverages established functional primitives. |
Yunsheng Ma; Can Cui; Xu Cao; Wenqian Ye; Peiran Liu; Juanwu Lu; Amr Abdelraouf; Rohit Gupta; Kyungtae Han; Aniket Bera; James M. Rehg; Ziran Wang; |
433 | CPR-Coach: Recognizing Composite Error Actions Based on Single-class Training Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve the unavoidable "Single-class Training & Multi-class Testing" problem we propose a human-cognition-inspired framework named ImagineNet to improve the model’s multi-error recognition performance under restricted supervision. |
Shunli Wang; Shuaibing Wang; Dingkang Yang; Mingcheng Li; Haopeng Kuang; Xiao Zhao; Liuzhen Su; Peng Zhai; Lihua Zhang; |
434 | GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Inspired by recent 3D Gaussian splatting we propose a systematic framework named GaussianEditor to edit 3D scenes delicately via 3D Gaussians with text instructions. |
Junjie Wang; Jiemin Fang; Xiaopeng Zhang; Lingxi Xie; Qi Tian; |
435 | X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the adaptation of these models to egocentric videos has been largely unexplored. To address this gap we propose a simple yet effective cross-modal adaptation framework which we call X-MIC. |
Anna Kukleva; Fadime Sener; Edoardo Remelli; Bugra Tekin; Eric Sauser; Bernt Schiele; Shugao Ma; |
436 | Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We present Diff3F as a simple robust and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). |
Niladri Shekhar Dutt; Sanjeev Muralikrishnan; Niloy J. Mitra; |
437 | Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present Diffusion-EDFs a novel SE(3)-equivariant diffusion-based approach for visual robotic manipulation tasks. |
Hyunwoo Ryu; Jiwoo Kim; Hyunseok An; Junwoo Chang; Joohwan Seo; Taehan Kim; Yubin Kim; Chaewon Hwang; Jongeun Choi; Roberto Horowitz; |
438 | DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However prior works are constrained to local editing since they disrupt the structure path of the pre-trained T2I model. To overcome this we propose a novel plug-in method called DreamMatcher which reformulates T2I personalization as semantic matching. |
Jisu Nam; Heesu Kim; DongJae Lee; Siyoon Jin; Seungryong Kim; Seunggyu Chang; |
439 | COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations we propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. |
Qihang Ma; Xin Tan; Yanyun Qu; Lizhuang Ma; Zhizhong Zhang; Yuan Xie; |
440 | Building A Strong Pre-Training Baseline for Universal 3D Large-Scale Perception Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict \textit i.e. the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges we propose a CSC framework that puts a scene-level semantic consistency in the heart bridging the connection of the similar semantic segments across various scenes. |
Haoming Chen; Zhizhong Zhang; Yanyun Qu; Ruixin Zhang; Xin Tan; Yuan Xie; |
441 | Motion Blur Decomposition with Cross-shutter Guidance Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper inspired by the complementary exposure characteristics of a global shutter (GS) camera and a rolling shutter (RS) camera we propose to utilize the ordered scanline-wise delay in a rolling shutter image to robustify motion decomposition of a single blurry image. |
Xiang Ji; Haiyang Jiang; Yinqiang Zheng; |
442 | Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we present Total-Decom a novel method for decomposed 3D reconstruction with minimal human interaction. |
Xiaoyang Lyu; Chirui Chang; Peng Dai; Yang-Tian Sun; Xiaojuan Qi; |
443 | LEDITS++: Limitless Image Editing Using Text-to-Image Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: They either require time-consuming fine-tuning deviate unnecessarily strongly from the input image and/or lack support for multiple simultaneous edits. To address these issues we introduce LEDITS++ an efficient yet versatile and precise textual image manipulation technique. |
Manuel Brack; Felix Friedrich; Katharia Kornmeier; Linoy Tsaban; Patrick Schramowski; Kristian Kersting; Apolinario Passos; |
444 | Free3D: Consistent Novel View Synthesis Without 3D Representation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We introduce Free3D a simple accurate method for monocular open-set novel view synthesis (NVS). |
Chuanxia Zheng; Andrea Vedaldi; |
445 | ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce ECLIPSE a novel contrastive learning method that is both parameter and data-efficient. |
Maitreya Patel; Changhoon Kim; Sheng Cheng; Chitta Baral; Yezhou Yang; |
446 | Do Vision and Language Encoders Represent The World Similarly? Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In the absence of statistical similarity in aligned encoders like CLIP we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods – a Fast Quadratic Assignment Problem optimization and a novel localized CKA metric-based matching/retrieval. |
Mayug Maniparambil; Raiymbek Akshulakov; Yasser Abdelaziz Dahou Djilali; Mohamed El Amine Seddik; Sanath Narayan; Karttikeya Mangalam; Noel E. O’Connor; |
447 | MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Volumetric optical microscopy using non-diffracting beams enables rapid imaging of 3D volumes by projecting them axially to 2D images but lacks crucial depth information. Addressing this we introduce MicroDiffusion a pioneering tool facilitating high-quality depth-resolved 3D volume reconstruction from limited 2D projections. |
Mude Hui; Zihao Wei; Hongru Zhu; Fei Xia; Yuyin Zhou; |
448 | ASAM: Boosting Segment Anything Model with Adversarial Tuning Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This paper introduces ASAM a novel methodology that amplifies SAM’s performance through adversarial tuning. |
Bo Li; Haoke Xiao; Lv Tang; |
449 | Multimodal Representation Learning By Alternating Unimodal Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning resulting in suboptimal performance. To address this challenge we propose MLA (Multimodal Learning with Alternating Unimodal Adaptation). |
Xiaohui Zhang; Jaehong Yoon; Mohit Bansal; Huaxiu Yao; |
450 | 360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We propose a virtual camera approach to generate lower-FoV query frames from 360^\circ images which ensures a fair comparison of performance among different query types in visual localization tasks. |
Huajian Huang; Changkun Liu; Yipeng Zhu; Hui Cheng; Tristan Braud; Sai-Kit Yeung; |
451 | Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present Photo-SLAM a novel SLAM framework with a hyper primitives map. |
Huajian Huang; Longwei Li; Hui Cheng; Sai-Kit Yeung; |
452 | Fair Federated Learning Under Domain Skew with Local Consistency and Domain Diversity Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: It leads to biased model convergence objective and distinct performance among domains. We discover a pronounced directional update consistency in Federated Learning and propose a novel framework to tackle above issues. |
Yuhang Chen; Wenke Huang; Mang Ye; |
453 | HiFi4G: High-Fidelity Human Performance Rendering Via Compact Gaussian Splatting Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present HiFi4G an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. |
Yuheng Jiang; Zhehao Shen; Penghao Wang; Zhuo Su; Yu Hong; Yingliang Zhang; Jingyi Yu; Lan Xu; |
454 | Revisiting Single Image Reflection Removal In The Wild Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. |
Yurui Zhu; Xueyang Fu; Peng-Tao Jiang; Hao Zhang; Qibin Sun; Jinwei Chen; Zheng-Jun Zha; Bo Li; |
455 | Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Hence understanding the specific content of discriminative pattern and adjusting their representation in target domain become the important key to overcome SFDA. To achieve such a vision this paper proposes a novel explanation paradigm "Discriminative Pattern Calibration (DPC)" mechanism on solving SFDA issue. |
Haifeng Xia; Siyu Xia; Zhengming Ding; |
456 | Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: The shallow feature from the shallow layers contains nuanced detail information which is critical for effective cross-modality learning but is disregarded regrettably by the existing methods. To address the above issues we design a Shallow-Deep Collaborative Learning (SDCL) framework based on the transformer with shallow-deep contrastive learning incorporating Collaborative Neighbor Learning (CNL) and Collaborative Ranking Association (CRA) module. |
Bin Yang; Jun Chen; Mang Ye; |
457 | BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing methods that attempt to decompose neural fields in the frequency domain either resort to heuristics or require extensive modifications to the neural field architecture. We show that via a simple modification one can obtain neural fields that are low-pass filtered and in turn show how this can be exploited to obtain a frequency decomposition of the entire signal. |
Akhmedkhan Shabanov; Shrisudhan Govindarajan; Cody Reading; Lily Goli; Daniel Rebain; Kwang Moo Yi; Andrea Tagliasacchi; |
458 | Neural Fields As Distributions: Signal Processing Beyond Euclidean Space Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However in contrast to classical discrete digital signal processing the portfolio of tools to process such representations is still severely limited and restricted to Euclidean domains. In this paper we address this problem by showing how a probabilistic re-interpretation of neural fields can enable their training and inference processes to become "filter-aware". |
Daniel Rebain; Soroosh Yazdani; Kwang Moo Yi; Andrea Tagliasacchi; |
459 | MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To address this issue we present our pioneering work that enables parameter-efficient VTR using a pre-trained model with only a small number of tunable parameters during training. Towards this goal we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. |
Xiaojie Jin; Bowen Zhang; Weibo Gong; Kai Xu; Xueqing Deng; Peng Wang; Zhao Zhang; Xiaohui Shen; Jiashi Feng; |
460 | Aligning Logits Generatively for Principled Black-Box Knowledge Distillation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we formalize a two-step workflow consisting of deprivatization and distillation and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. |
Jing Ma; Xiang Xiang; Ke Wang; Yuchuan Wu; Yongbin Li; |
461 | EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Being able to map the activities of others into one’s own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability we introduce EgoExoLearn a large-scale dataset that emulates the human demonstration following process in which individuals record egocentric videos as they execute tasks guided by demonstration videos. |
Yifei Huang; Guo Chen; Jilan Xu; Mingfang Zhang; Lijin Yang; Baoqi Pei; Hongjie Zhang; Lu Dong; Yali Wang; Limin Wang; Yu Qiao; |
462 | Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose DiffusionES a method that combines gradient-free optimization with trajectory denoising to optimize black-box non-differentiable objectives while staying in the data manifold. |
Brian Yang; Huangyuan Su; Nikolaos Gkanatsios; Tsung-Wei Ke; Ayush Jain; Jeff Schneider; Katerina Fragkiadaki; |
463 | MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). |
Chaoyi Zhang; Kevin Lin; Zhengyuan Yang; Jianfeng Wang; Linjie Li; Chung-Ching Lin; Zicheng Liu; Lijuan Wang; |
464 | HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work aims to investigate various hallucinations (i.e. object relation attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. |
Qifan Yu; Juncheng Li; Longhui Wei; Liang Pang; Wentao Ye; Bosheng Qin; Siliang Tang; Qi Tian; Yueting Zhuang; |
465 | Generalized Predictive Model for Autonomous Driving Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce the first large-scale video prediction model in the autonomous driving discipline. |
Jiazhi Yang; Shenyuan Gao; Yihang Qiu; Li Chen; Tianyu Li; Bo Dai; Kashyap Chitta; Penghao Wu; Jia Zeng; Ping Luo; Jun Zhang; Andreas Geiger; Yu Qiao; Hongyang Li; |
466 | Boosting Order-Preserving and Transferability for Neural Architecture Search: A Joint Architecture Refined Search and Fine-tuning Approach Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we analyze the order-preserving ability on the whole search space (global) and a sub-space of top architectures (local) and empirically show that the local order-preserving for current two-stage NAS methods still need to be improved. |
Beichen Zhang; Xiaoxing Wang; Xiaohan Qin; Junchi Yan; |
467 | DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However due to its reliance on generative adversarial networks (GANs) its generality is limited by the capacity of pretrained GAN models. In this work we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. |
Yujun Shi; Chuhui Xue; Jun Hao Liew; Jiachun Pan; Hanshu Yan; Wenqing Zhang; Vincent Y. F. Tan; Song Bai; |
468 | Learning Multi-Dimensional Human Preference for Text-to-Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. |
Sixian Zhang; Bohan Wang; Junqiang Wu; Yan Li; Tingting Gao; Di Zhang; Zhongyuan Wang; |
469 | DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. |
Jiuming Liu; Guangming Wang; Weicai Ye; Chaokang Jiang; Jinru Han; Zhe Liu; Guofeng Zhang; Dalong Du; Hesheng Wang; |
470 | Holo-Relighting: Controllable Volumetric Portrait Relighting from A Single Image Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose Holo-Relighting a volumetric relighting method that is capable of synthesizing novel viewpoints and novel lighting from a single image. |
Yiqun Mei; Yu Zeng; He Zhang; Zhixin Shu; Xuaner Zhang; Sai Bi; Jianming Zhang; HyunJoon Jung; Vishal M. Patel; |
471 | DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans Via Natural Language Feedback Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present DRESS a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. |
Yangyi Chen; Karan Sikka; Michael Cogswell; Heng Ji; Ajay Divakaran; |
472 | DiverGen: Improving Instance Segmentation By Learning Wider Data Distribution with More Diverse Generative Data Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation these approaches do not efficiently harness the full potential of generative models. To address these issues we introduce a more efficient strategy to construct generative datasets for data augmentation termed DiverGen. |
Chengxiang Fan; Muzhi Zhu; Hao Chen; Yang Liu; Weijia Wu; Huaqi Zhang; Chunhua Shen; |
473 | Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency. |
Xin Zhou; Dingkang Liang; Wei Xu; Xingkui Zhu; Yihan Xu; Zhikang Zou; Xiang Bai; |
474 | MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Inspired by the idea of divide and conquer we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. |
Dewei Zhou; You Li; Fan Ma; Xiaoting Zhang; Yi Yang; |
475 | PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we propose a novel framework called the Pose-enhanced Vision-Language (PeVL) model to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. |
Haosong Zhang; Mei Chee Leong; Liyuan Li; Weisi Lin; |
476 | SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose SingularTrajectory a diffusion-based universal trajectory prediction framework to reduce the performance gap across the five tasks. |
Inhwan Bae; Young-Jae Park; Hae-Gon Jeon; |
477 | Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Here we propose a beam-search-based most-likely prediction and a temperature-based multimodal prediction to implement both deterministic and stochastic inferences. |
Inhwan Bae; Junoh Lee; Hae-Gon Jeon; |
478 | MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap we present MVHumanNet a dataset that comprises multi-view human action sequences of 4500 human identities. |
Zhangyang Xiong; Chenghong Li; Kenkun Liu; Hongjie Liao; Jianqiao Hu; Junyi Zhu; Shuliang Ning; Lingteng Qiu; Chongjie Wang; Shijie Wang; Shuguang Cui; Xiaoguang Han; |
479 | GART: Gaussian Articulated Template Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Gaussian Articulated Template Model (GART) an explicit efficient and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. |
Jiahui Lei; Yufu Wang; Georgios Pavlakos; Lingjie Liu; Kostas Daniilidis; |
480 | MorpheuS: Neural Dynamic 360deg Surface Reconstruction from Monocular RGB-D Video Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Despite this real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge we introduce MorpheuS a framework for dynamic 360deg surface reconstruction from a casually captured RGB-D video. |
Hengyi Wang; Jingwen Wang; Lourdes Agapito; |
481 | SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue we present SecondPose a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. |
Yamei Chen; Yan Di; Guangyao Zhai; Fabian Manhardt; Chenyangguang Zhang; Ruida Zhang; Federico Tombari; Nassir Navab; Benjamin Busam; |
482 | DemoCaricature: Democratising Caricature Generation with A Rough Sketch Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we democratise caricature generation empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. |
Dar-Yen Chen; Ayan Kumar Bhunia; Subhadeep Koley; Aneeshan Sain; Pinaki Nath Chowdhury; Yi-Zhe Song; |
483 | ProS: Prompting-to-simulate Generalized Knowledge for Universal Cross-Domain Retrieval Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Concretely in Prompt Units Learning stage we introduce two Prompt Units to individually capture domain and semantic knowledge in a mask-and-align way. |
Kaipeng Fang; Jingkuan Song; Lianli Gao; Pengpeng Zeng; Zhi-Qi Cheng; Xiyao Li; Heng Tao Shen; |
484 | Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this work we propose the self-supervised generative map (SGM) a modular method that learns the explicit context relation via self-supervised learning. |
Sixian Zhang; Xinyao Yu; Xinhang Song; Xiaohan Wang; Shuqiang Jiang; |
485 | LASO: Language-guided Affordance Segmentation on 3D Object Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This oversight not only limits their generalization to unseen objects but more importantly hinders their synergy with large language models (LLMs) which are excellent task planners that can decompose an overarching command into agent-actionable instructions. With this regard we propose a novel task Language-guided Affordance Segmentation on 3D Object (LASO) which challenges a model to segment a 3D object’s part relevant to a given affordance question. |
Yicong Li; Na Zhao; Junbin Xiao; Chun Feng; Xiang Wang; Tat-seng Chua; |
486 | HHMR: Holistic Hand Mesh Recovery By Enhancing The Multimodal Controllability of Graph Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation inpainting reconstruction and fitting in a single framework which we name as Holistic Hand Mesh Recovery (HHMR). |
Mengcheng Li; Hongwen Zhang; Yuxiang Zhang; Ruizhi Shao; Tao Yu; Yebin Liu; |
487 | FreePoint: Unsupervised Point Cloud Instance Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Based on the point features we perform a bottom-up multicut algorithm to segment point clouds into coarse instance masks as pseudo labels which are used to train a point cloud instance segmentation model. We propose an id-as-feature strategy at this stage to alleviate the randomness of the multicut algorithm and improve the pseudo labels’ quality. |
Zhikai Zhang; Jian Ding; Li Jiang; Dengxin Dai; Guisong Xia; |
488 | Exact Fusion Via Feature Distribution Matching for Few-shot Image Generation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we propose an exact Fusion via Feature Distribution matching Generative Adversarial Network (F2DGAN) for few-shot image generation. |
Yingbo Zhou; Yutong Ye; Pengyu Zhang; Xian Wei; Mingsong Chen; |
489 | DART: Implicit Doppler Tomography for Radar Novel View Synthesis Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However simulating realistic radar scans is a challenging task that requires an accurate model of the scene radio frequency material properties and a corresponding radar synthesis function. Rather than specifying these models explicitly we propose DART – Doppler Aided Radar Tomography a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. |
Tianshu Huang; John Miller; Akarsh Prabhakara; Tao Jin; Tarana Laroia; Zico Kolter; Anthony Rowe; |
490 | COCONut: Modernizing COCO Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this study we undertake a comprehensive reevaluation of the COCO segmentation annotations. |
Xueqing Deng; Qihang Yu; Peng Wang; Xiaohui Shen; Liang-Chieh Chen; |
491 | A Picture Is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: To show the value of dense and highly-aligned image-text pairs we collect the Densely Captioned Images (DCI) dataset containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image we can evaluate vision-language models’ (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. |
Jack Urbanek; Florian Bordes; Pietro Astolfi; Mary Williamson; Vasu Sharma; Adriana Romero-Soriano; |
492 | Calibrating Multi-modal Representations: A Pursuit of Group Robustness Without Annotations Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: We first following recent work on Deep Feature Reweighting (DFR) verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them we advocate a lightweight representation calibration method for fine-tuning CLIP by first generating a calibration set using the pretrained CLIP and then calibrating representations of samples within this set through contrastive learning all without the need for group labels. |
Chenyu You; Yifei Min; Weicheng Dai; Jasjeet S. Sekhon; Lawrence Staib; James S. Duncan; |
493 | Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: In this paper we introduce a so-called content-adaptive non-local convolution (CANConv) a novel method tailored for remote sensing image pansharpening. |
Yule Duan; Xiao Wu; Haoyu Deng; Liang-Jian Deng; |
494 | CFAT: Unleashing Triangular Windows for Image Super-resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. |
Abhisek Ray; Gaurav Kumar; Maheshkumar H. Kolekar; |
495 | Multi-Space Alignments Towards Universal LiDAR Segmentation Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: This work presents M3Net a one-of-a-kind framework for fulfilling multi-task multi-dataset multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. |
Youquan Liu; Lingdong Kong; Xiaoyang Wu; Runnan Chen; Xin Li; Liang Pan; Ziwei Liu; Yuexin Ma; |
496 | SimDA: Simple Diffusion Adapter for Efficient Video Generation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model adapting it to video generation in a parameter-efficient way. |
Zhen Xing; Qi Dai; Han Hu; Zuxuan Wu; Yu-Gang Jiang; |
497 | GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: However the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. |
Yunsong Wang; Hanlin Chen; Gim Hee Lee; |
498 | Image Processing GNN: Breaking Rigidity in Super-Resolution Related Papers Related Patents Related Grants Related Venues Related Experts Related Code View Highlight: Alternatively we leverage the flexibility of graphs and propose the Image Processing GNN (IPG) model to break the rigidity that dominates previous SR methods. |
Yuchuan Tian; Hanting Chen; Chao Xu; Yunhe Wang; |
499 | Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work we present a unified framework to predict both point-wise correspondences and shape interpolation between 3D shapes. |
Dongliang Cao; Marvin Eisenberger; Nafie El Amrani; Daniel Cremers; Florian Bernard; |
500 | CCEdit: Creative and Controllable Video Editing Via Diffusion Models Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we present CCEdit a versatile generative video editing framework based on diffusion models. |
Ruoyu Feng; Wenming Weng; Yanhui Wang; Yuhui Yuan; Jianmin Bao; Chong Luo; Zhibo Chen; Baining Guo; |
This table only includes 500 papers selected by our daily digest algorithm. To continue with the full list (~2,700 papers), please visit Paper Digest: CVPR-2024 (Full List).