Most Influential ECCV Papers (2024-09)
The European Conference on Computer Vision (ECCV) is one of the top computer vision conferences in the world. Paper Digest Team analyzes all papers published on ECCV in the past years, and presents the 15 most influential papers for each year. This ranking list is automatically constructed based upon citations from both research papers and granted patents, and will be frequently updated to reflect the most recent changes. To find the latest version of this list or the most influential papers from other conferences/journals, please visit Best Paper Digest page. Note: the most influential papers may or may not include the papers that won the best paper awards. (Version: 2024-09)
To search or review papers within ECCV related to a specific topic, please use the search by venue (ECCV) and review by venue (ECCV) services. To browse the most productive ECCV authors by year ranked by #papers accepted, here are the most productive ECCV authors grouped by year.
This list is created by the Paper Digest Team. Experience the cutting-edge capabilities of Paper Digest, an innovative AI-powered research platform that empowers you to write, review, get answers and more.
Paper Digest Team
New York City, New York, 10017
team@paperdigest.org
TABLE 1: Most Influential ECCV Papers (2024-09)
Year | Rank | Paper | Author(s) |
---|---|---|---|
2024 | 1 | Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we develop an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. |
SHILONG LIU et. al. |
2024 | 2 | MMBENCH: Is Your Multi-Modal Model An All-around Player? IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Meanwhile, subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model’s abilities by incorporating human labor, which is not scalable and may display significant bias. In response to these challenges, we propose MMBench, a bilingual benchmark for assessing the multi-modal capabilities of VLMs. |
YUAN LIU et. al. |
2024 | 3 | YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. |
Chien-Yao Wang; I-Hau Yeh; Hong-Yuan Mark Liao; |
2024 | 4 | ShareGPT4V: Improving Large Multi-Modal Models with Better Captions IF:5 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we delve into the influence of training data on LMMs, uncovering three pivotal findings: 1) Highly detailed captions enable more nuanced vision-language alignment, significantly boosting the performance of LMMs in diverse benchmarks, surpassing outcomes from brief captions or VQA data 2) Cutting-edge LMMs can be close to the captioning capability of costly human annotators, and open-source LMMs could reach similar quality after lightweight fine-tuning 3) The performance of LMMs scales with the number of detailed captions, exhibiting remarkable improvements across a range from thousands to millions of captions. |
LIN CHEN et. al. |
2024 | 5 | Adversarial Diffusion Distillation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that efficiently samples large-scale foundational image diffusion models in just 1–4 steps while maintaining high image quality. |
Axel Sauer; Dominik Lorenz; Andreas Blattmann; Robin Rombach; |
2024 | 6 | LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce Large Multi-View Gaussian Model (LGM), a novel framework designed to generate high-resolution 3D models from text prompts or single-view images. |
JIAXIANG TANG et. al. |
2024 | 7 | CoTracker: It Is Better to Track Together IF:4 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce , a transformer-based model that tracks a large number of 2D points in long video sequences. |
NIKITA KARAEV et. al. |
2024 | 8 | LLaMA-VID: An Image Is Worth 2 Tokens in Large Language Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. |
Yanwei Li; Chengyao Wang; Jiaya Jia; |
2024 | 9 | DriveLM: Driving with Graph Visual Question Answering IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We instantiate datasets (DriveLM-Data) built upon nuScenes and CARLA, and propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving. |
CHONGHAO SIMA et. al. |
2024 | 10 | GRiT: A Generative Region-to-text Transformer for Object Understanding IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. |
JIALIAN WU et. al. |
2024 | 11 | PointLLM: Empowering Large Language Models to Understand Point Clouds IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The unprecedented advancements in Large Language Models (LLMs) have shown a profound impact on natural language processing but are yet to fully embrace the realm of 3D understanding. This paper introduces PointLLM, a preliminary effort to fill this gap, empowering LLMs to understand point clouds and offering a new avenue beyond 2D data. |
RUNSEN XU et. al. |
2024 | 12 | VideoMamba: State Space Model for Efficient Video Understanding IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Addressing the dual challenges of local redundancy and global dependencies in video understanding, this work innovatively adapts the Mamba to the video domain. |
KUNCHANG LI et. al. |
2024 | 13 | Photorealistic Video Generation with Diffusion Models IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present , a diffusion transformer for photorealistic video generation from text prompts. |
AGRIM GUPTA et. al. |
2024 | 14 | DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Traditional image animation techniques mainly focus on animating natural scenes with stochastic dynamics (e.g. clouds and fluid) or domain-specific motions (e.g. human hair or body motions), and thus limits their applicability to more general visual content. To overcome this limitation, we explore the synthesis of dynamic content for open-domain images, converting them into animated videos. |
JINBO XING et. al. |
2024 | 15 | DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving IF:3 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: However, a critical limitation in relevant research lies in its predominant focus on gaming environments or simulated settings, thereby lacking the representation of real-world driving scenarios. Therefore, we introduce DriveDreamer, a pioneering world model entirely derived from real-world driving scenarios. |
XIAOFENG WANG et. al. |
2022 | 1 | Visual Prompt Tuning IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. |
MENGLIN JIA et. al. |
2022 | 2 | ByteTrack: Multi-Object Tracking By Associating Every Detection Box IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: The objects with low detection scores, e.g. occluded objects, are simply thrown away, which brings non-negligible true object missing and fragmented trajectories. To solve this problem, we present a simple, effective and generic association method, tracking by associating almost every detection box instead of only the high score ones. |
YIFU ZHANG et. al. |
2022 | 3 | TensoRF: Tensorial Radiance Fields IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present TensoRF, a novel approach to model and reconstruct radiance fields. |
Anpei Chen; Zexiang Xu; Andreas Geiger; Jingyi Yu; Hao Su; |
2022 | 4 | BEVFormer: Learning Bird’s-Eye-View Representation from Multi-Camera Images Via Spatiotemporal Transformers IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. |
ZHIQI LI et. al. |
2022 | 5 | Exploring Plain Vision Transformer Backbones for Object Detection IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. |
Yanghao Li; Hanzi Mao; Ross Girshick; Kaiming He; |
2022 | 6 | Simple Baselines for Image Restoration IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose a simple baseline that exceeds the SOTA methods and is computationally efficient. |
Liangyu Chen; Xiaojie Chu; Xiangyu Zhang; Jian Sun; |
2022 | 7 | Detecting Twenty-Thousand Classes Using Image-Level Supervision IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Detic, which simply trains the classifiers of a detector on image classification data and thus expands the vocabulary of detectors to tens of thousands of concepts. |
Xingyi Zhou; Rohit Girdhar; Armand Joulin; Philipp Krä,henbü,hl; Ishan Misra; |
2022 | 8 | MaxViT: Multi-axis Vision Transformer IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. |
ZHENGZHONG TU et. al. |
2022 | 9 | Make-a-Scene: Scene-Based Text-to-Image Generation with Human Priors IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. |
ORAN GAFNI et. al. |
2022 | 10 | PETR: Position Embedding Transformation for Multi-View 3D Object Detection IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we develop position embedding transformation (PETR) for multi-view 3D object detection. |
Yingfei Liu; Tiancai Wang; Xiangyu Zhang; Jian Sun; |
2022 | 11 | MOTR: End-to-End Multiple-Object Tracking with TRansformer IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose MOTR, which extends DETR \cite{carion2020detr} and introduces “track query” to model the tracked instances in the entire video. |
FANGAO ZENG et. al. |
2022 | 12 | SLIP: Self-Supervision Meets Language-Image Pre-training IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. |
Norman Mu; Alexander Kirillov; David Wagner; Saining Xie; |
2022 | 13 | Compositional Visual Generation with Composable Diffusion Models IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose an alternative structured approach for compositional generation using diffusion models. |
Nan Liu; Shuang Li; Yilun Du; Antonio Torralba; Joshua B. Tenenbaum; |
2022 | 14 | Masked Autoencoders for Point Cloud Self-Supervised Learning IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: As a promising scheme of self-supervised learning, masked autoencoding has significantly advanced natural language processing and computer vision. Inspired by this, we propose a neat scheme of masked autoencoders for point cloud self-supervised learning, addressing the challenges posed by point cloud’s properties, including leakage of location information and uneven information density. |
YATIAN PANG et. al. |
2022 | 15 | VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance IF:6 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Current methods rely heavily on training to a specific domain (e.g., only faces), manual work or algorithm tuning to latent vector discovery, and manual effort in mask selection to alter only a part of an image. We address all of these usability constraints while producing images of high visual and semantic quality through a unique combination of OpenAI’s CLIP (Radford et al., 2021), VQGAN (Esser et al., 2021), and a generation augmentation strategy to produce VQGAN-CLIP. |
KATHERINE CROWSON et. al. |
2020 | 1 | End-to-End Object Detection With Transformers IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a new method that views object detection as a direct set prediction. |
NICOLAS CARION et. al. |
2020 | 2 | NeRF: Representing Scenes As Neural Radiance Fields For View Synthesis IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We present a method that achieves state-of-the-art results for synthesizing novel views of complex scenes by optimizing an underlying continuous volumetric scene function using a sparse set of input views. |
BEN MILDENHALL et. al. |
2020 | 3 | Contrastive Multiview Coding IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We study this hypothesis under the framework of multiview contrastive learning, where we learn a representation that aims to maximize mutual information between different views of the same scene but is otherwise compact. |
Yonglong Tian; Dilip Krishnan; Phillip Isola; |
2020 | 4 | RAFT: Recurrent All-Pairs Field Transforms For Optical Flow IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We introduce Recurrent All-Pairs Field Transforms (RAFT), a new deep network architecture for estimating optical flow. |
Zachary Teed; Jia Deng; |
2020 | 5 | UNITER: UNiversal Image-TExt Representation Learning IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. |
YEN-CHUN CHEN et. al. |
2020 | 6 | Oscar: Object-Semantics Aligned Pre-training For Vision-Language Tasks IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: While existing methods simply concatenate image region features and text features as input to the model to be pre-trained and use self-attention to learn image-text semantic alignments in a brute force manner, in this paper, we propose a new learning method Oscar, which uses object tags detected in images as anchor points to significantly ease the learning of alignments. |
XIUJUN LI et. al. |
2020 | 7 | Object-Contextual Representations For Semantic Segmentation IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. |
Yuhui Yuan; Xilin Chen; Jingdong Wang; |
2020 | 8 | Big Transfer (BiT): General Visual Representation Learning IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We scale up pre-training, and propose a simple recipe that we call Big Transfer (BiT). |
ALEXANDER KOLESNIKOV et. al. |
2020 | 9 | Contrastive Learning For Unpaired Image-to-Image Translation IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a straightforward method for doing so — maximizing mutual information between the two, using a framework based on contrastive learning. |
Taesung Park Alexei A. Efros Richard Zhang Jun-Yan Zhu; |
2020 | 10 | Tracking Objects As Points IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present a simultaneous detection and tracking algorithm that is simpler, faster, and more accurate than the state of the art. |
Xingyi Zhou; Vladlen Koltun; Philipp Krähenbühl; |
2020 | 11 | Single Path One-Shot Neural Architecture Search With Uniform Sampling IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work propose a Single Path One-Shot model to address the challenge in the training. |
ZICHAO GUO et. al. |
2020 | 12 | Convolutional Occupancy Networks IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. |
Songyou Peng; Michael Niemeyer; Lars Mescheder; Marc Pollefeys; Andreas Geiger; |
2020 | 13 | Square Attack: A Query-efficient Black-box Adversarial Attack Via Random Search IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose the Square Attack, a score-based black-box $l_2$- and $l_\infty$- adversarial attack that does not rely on local gradient information and thus is not affected by gradient masking. |
Maksym Andriushchenko; Francesco Croce; Nicolas Flammarion; Matthias Hein; |
2020 | 14 | Rethinking Few-shot Image Classification: A Good Embedding Is All You Need? IF:7 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we show that a simple baseline: learning a supervised or self-supervised representation on the meta-training set, followed by training a linear classifier on top of this representation, outperforms state-of-the-art few-shot learning methods. |
Yonglong Tian; Yue Wang; Dilip Krishnan; Joshua B. Tenenbaum; Phillip Isola; |
2020 | 15 | Lift, Splat, Shoot: Encoding Images From Arbitrary Camera Rigs By Implicitly Unprojecting To 3D IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a new end-to-end architecture that directly extracts a bird’s-eye-view representation of a scene given image data from an arbitrary number of cameras. |
Jonah Philion; Sanja Fidler; |
2018 | 1 | CBAM: Convolutional Block Attention Module IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose Convolutional Block Attention Module (CBAM), a simple and effective attention module that can be integrated with any feed-forward convolutional neural networks. |
Sanghyun Woo; Jongchan Park; Joon-Young Lee; In So Kweon; |
2018 | 2 | Encoder-Decoder With Atrous Separable Convolution For Semantic Image Segmentation IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we propose to combine the advantages from both methods. |
Liang-Chieh Chen; Yukun Zhu; George Papandreou; Florian Schroff; Hartwig Adam; |
2018 | 3 | ShuffleNet V2: Practical Guidelines For Efficient CNN Architecture Design IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: Taking these factors into account, this work proposes practical guidelines for efficient network de- sign. |
Ningning Ma; Xiangyu Zhang; Hai-Tao Zheng; Jian Sun; |
2018 | 4 | Image Super-Resolution Using Very Deep Residual Channel Attention Networks IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To solve these problems, we propose the very deep residual channel attention networks (RCAN). |
YULUN ZHANG et. al. |
2018 | 5 | CornerNet: Detecting Objects As Paired Keypoints IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose CornerNet, a new approach to object detection where we detect an object bounding box as a pair of keypoints, the top-left corner and the bottom-right corner, using a single convolution neural network. |
Hei Law; Jia Deng; |
2018 | 6 | Group Normalization IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we present Group Normalization (GN) as a simple alternative to BN. |
Yuxin Wu; Kaiming He; |
2018 | 7 | Multimodal Unsupervised Image-to-image Translation IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: To address this limitation, we propose a Multimodal Unsupervised Image-to-image Translation (MUNIT) framework. |
Xun Huang; Ming-Yu Liu; Serge Belongie; Jan Kautz; |
2018 | 8 | BiSeNet: Bilateral Segmentation Network For Real-time Semantic Segmentation IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we address this dilemma with a novel Bilateral Segmentation Network (BiSeNet). |
CHANGQIAN YU et. al. |
2018 | 9 | Progressive Neural Architecture Search IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. |
CHENXI LIU et. al. |
2018 | 10 | Image Inpainting For Irregular Holes Using Partial Convolutions IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We propose to use partial convolutions, where the convolution is masked and renormalized to be conditioned on only valid pixels. |
GUILIN LIU et. al. |
2018 | 11 | Deep Clustering For Unsupervised Learning Of Visual Features IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. |
Mathilde Caron; Piotr Bojanowski; Armand Joulin; Matthijs Douze; |
2018 | 12 | Simple Baselines For Human Pose Estimation And Tracking IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: This work provides simple and effective baseline methods. |
Bin Xiao; Haiping Wu; Yichen Wei; |
2018 | 13 | Unified Perceptual Parsing For Scene Understanding IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we study a new task called Unified Perceptual Parsing, which requires the machine vision systems to recognize as many visual concepts as possible from a given image. |
Tete Xiao; Yingcheng Liu; Bolei Zhou; Yuning Jiang; Jian Sun; |
2018 | 14 | Memory Aware Synapses: Learning What (not) To Forget IF:8 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: In this paper, we argue that, given the limited model capacity and the unlimited new information to be learned, knowledge has to be preserved or erased selectively. |
Rahaf Aljundi; Francesca Babiloni; Mohamed Elhoseiny; Marcus Rohrbach; Tinne Tuytelaars; |
2018 | 15 | ICNet For Real-Time Semantic Segmentation On High-Resolution Images IF:9 Related Papers Related Patents Related Grants Related Venues Related Experts View Highlight: We focus on the challenging task of real-time semantic segmentation in this paper. |
Hengshuang Zhao; Xiaojuan Qi; Xiaoyong Shen; Jianping Shi; Jiaya Jia; |