Paper Digest: ACM Multimedia 2020 Highlights
Readers are also encouraged to read our ACM Multimedia 2020 Papers with Code/Data Page, which lists those papers that have published their code or data.
The ACM Multimedia Conference is one of the top multimedia conferences in the world. In 2020, it is to be held virtually due to covid-19 pandemic.
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
team@paperdigest.org
TABLE 1: Paper Digest: ACM Multimedia 2020 Highlights
Title | Authors | Highlight | |
---|---|---|---|
1 | Image Inpainting Based on Multi-frequency Probabilistic Inference Model | Jin Wang; Chen Wang; Qingming Huang; Yunhui Shi; Jian-Feng Cai; Qing Zhu; Baocai Yin; | This paper handles this problem from a novel perspective of predicting low-frequency semantic structural contents and high-frequency detailed textures respectively, and proposes a multi-frequency probabilistic inference model(MPI model) to predict the multi-frequency information of missing regions by estimating the parametric distribution of multi-frequency features over the corresponding latent spaces. |
2 | Dual Adversarial Network for Unsupervised Ground/Satellite-to-Aerial Scene Adaptation | Jianzhe Lin; Lichao Mou; Tianze Yu; Xiaoxiang Zhu; Z. Jane Wang; | Motivated by this, we propose a dual adversarial network for domain adaptation, where two adversarial learning processes are conducted iteratively, in correspondence with the feature adaptation and the classification task respectively. |
3 | Adversarial Bipartite Graph Learning for Video Domain Adaptation | Yadan Luo; Zi Huang; Zijian Wang; Zheng Zhang; Mahsa Baktashmotlagh; | To overcome this limitation, in this paper, we learn a domain-agnostic video classifier instead of learning domain-invariant representations, and propose an Adversarial Bipartite Graph (ABG) learning framework which directly models the source-target interactions with a network topology of the bipartite graph. |
4 | Give Me Something to Eat: Referring Expression Comprehension with Commonsense Knowledge | Peng Wang; Dongyang Liu; Hui Li; Qi Wu; | In this paper, we collect a new referring expression dataset, called KB-Ref, containing $43$k expressions on 16k images. |
5 | Single Image De-noising via Staged Memory Network | Weijiang Yu; Jian Liang; Lu Li; Nong Xiao; | In this paper, we firstly propose a Staged Memory Network (SMNet) consisting of noise memory stage and image memory stage for explicitly exploring the staged memories of our network in single image de-noising with different noise levels. |
6 | Self-supervised Dance Video Synthesis Conditioned on Music | Xuanchi Ren; Haoran Li; Zijian Huang; Qifeng Chen; | We present a self-supervised approach with pose perceptual loss for automatic dance video generation. |
7 | Dynamic GCN: Context-enriched Topology Learning for Skeleton-based Action Recognition | Fanfan Ye; Shiliang Pu; Qiaoyong Zhong; Chao Li; Di Xie; Huiming Tang; | In this paper, we propose Dynamic GCN, in which a novel convolutional neural network named Context-encoding Network (CeN) is introduced to learn skeleton topology automatically. |
8 | Meta Parsing Networks: Towards Generalized Few-shot Scene Parsing with Adaptive Metric Learning | Peike Li; Yunchao Wei; Yi Yang; | In this work, we advance this few-shot segmentation paradigm towards a more challenging yet general scenario, i.e., Generalized Few-shot Scene Parsing (GFSP). |
9 | CODAN: Counting-driven Attention Network for Vehicle Detection in Congested Scenes | Wei Li; Zhenting Wang; Xiao Wu; Ji Zhang; Qiang Peng; Hongliang Li; | In this paper, we explore the dense vehicle detection given the number of vehicles. |
10 | Webly Supervised Image Classification with Metadata: Automatic Noisy Label Correction via Visual-Semantic Graph | Jingkang Yang; Weirong Chen; Litong Feng; Xiaopeng Yan; Huabin Zheng; Wayne Zhang; | In this paper, we propose an automatic label corrector VSGraph-LC based on the visual-semantic graph. |
11 | CRSSC: Salvage Reusable Samples from Noisy Data for Robust Learning | Zeren Sun; Xian-Sheng Hua; Yazhou Yao; Xiu-Shen Wei; Guosheng Hu; Jian Zhang; | To this end, we propose a certainty-based reusable sample selection and correction approach, termed as CRSSC, for coping with label noise in training deep FG models with web images. |
12 | Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism | Jen-Chun Lin; Wen-Li Wei; Yen-Yu Lin; Tyng-Luh Liu; Hong-Yuan Mark Liao; | In this study, we present a deep interactive learning (DIL) mechanism for building a compact yet accurate sequence-to-sequence model to accomplish the task. |
13 | TextRay: Contour-based Geometric Modeling for Arbitrary-shaped Scene Text Detection | Fangfang Wang; Yifeng Chen; Fei Wu; Xi Li; | In this work, we propose an arbitrary-shaped text detection method, namely TextRay, which conducts top-down contour-based geometric modeling and geometric parameter learning within a single-shot anchor-free framework. |
14 | Weakly Supervised Real-time Image Cropping based on Aesthetic Distributions | Peng Lu; Jiahui Liu; Xujun Peng; Xiaojie Wang; | In order to tackle this problem, a weakly supervised cropping framework is proposed, where the distribution dissimilarity between high quality images and cropped images is used to guide the coordinate predictor’s training and the ground truths of cropping windows are not required by the proposed method. |
15 | Towards Unsupervised Crowd Counting via Regression-Detection Bi-knowledge Transfer | Yuting Liu; Zheng Wang; Miaojing Shi; Shin’ichi Satoh; Qijun Zhao; Hongyu Yang; | In this paper, we explore it in a transfer learning setting where we learn to detect and count persons in an unlabeled target set by transferring bi-knowledge learnt from regression- and detection-based models in a labeled source set. |
16 | Occluded Prohibited Items Detection: An X-ray Security Inspection Benchmark and De-occlusion Attention Module | Yanlu Wei; Renshuai Tao; Zhangjie Wu; Yuqing Ma; Libo Zhang; Xianglong Liu; | In this work, we contribute the first high-quality object detection dataset for security inspection, named Occluded Prohibited Items X-ray (OPIXray) image benchmark. |
17 | Temporally Guided Music-to-Body-Movement Generation | Hsuan-Kai Kao; Li Su; | This paper presents a neural network model to generate virtual violinist’s 3-D skeleton movements from music audio. |
18 | Compositional Few-Shot Recognition with Primitive Discovery and Enhancing | Yixiong Zou, Shanghang Zhang, Ke Chen, Yonghong Tian, Yaowei Wang, José M. F. Moura; | Inspired by such capability of humans, to imitate humans’ ability of learning visual primitives and composing primitives to recognize novel classes, we propose an approach to FSL to learn a feature representation composed of important primitives, which is jointly trained with two parts, i.e. primitive discovery and primitive enhancing. |
19 | InteractGAN: Learning to Generate Human-Object Interaction | Chen Gao; Si Liu; Defa Zhu; Quan Liu; Jie Cao; Haoqian He; Ran He; Shuicheng Yan; | In this work, we introduce an Interact-GAN to solve this challenging task. |
20 | Category-specific Semantic Coherency Learning for Fine-grained Image Recognition | Shijie Wang; Zhihui Wang; Haojie Li; Wanli Ouyang; | To address this issue, we propose an end-to-end Category-specific Semantic Coherency Network (CSC-Net) to semantically align the discriminative regions of the same subcategory. |
21 | Scene-Aware Context Reasoning for Unsupervised Abnormal Event Detection in Videos | Che Sun; Yunde Jia; Yao Hu; Yuwei Wu; | In this paper, we propose a scene-aware context reasoning method that exploits context information from visual features for unsupervised abnormal event detection in videos, which bridges the semantic gap between visual context and the meaning of abnormal events. |
22 | Light Field Super-resolution via Attention-Guided Fusion of Hybrid Lenses | Jing Jin; Junhui Hou; Jie Chen; Sam Kwong; Jingyi Yu; | To tackle this challenge, we propose a novel end-to-end learning-based approach, which can comprehensively utilize the specific characteristics of the input from two complementary and parallel perspectives. |
23 | Trajectory Prediction in Heterogeneous Environment via Attended Ecology Embedding | Wei-Cheng Lai; Zi-Xiang Xia; Hao-Siang Lin; Lien-Feng Hsu; Hong-Han Shuai; I-Hong Jhuo; Wen-Huang Cheng; | In this paper, we consider the practical environment of predicting trajectory in the heterogeneous traffic ecology. |
24 | Text-Embedded Bilinear Model for Fine-Grained Visual Recognition | Liang Sun; Xiang Guan; Yang Yang; Lei Zhang; | In this paper, we propose a Text-Embedded Bilinear (TEB) model which incorporates texts as extra guidance for fine-grained recognition. |
25 | Learning Scales from Points: A Scale-aware Probabilistic Model for Crowd Counting | Zhiheng Ma; Xing Wei; Xiaopeng Hong; Yihong Gong; | In this paper, we propose a scale-aware probabilistic model to handle this problem. |
26 | Learning Global Structure Consistency for Robust Object Tracking | Bi Li; Chengquan Zhang; Zhibin Hong; Xu Tang; Jingtuo Liu; Junyu Han; Errui Ding; Wenyu Liu; | Specifically, we propose an effective and efficient short-term model that learns to exploit the global structure consistency in a short time and thus can handle fast variations and distractors. |
27 | Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical Understanding of Outdoor Scene | Xinke Li; Chongshou Li; Zekun Tong; Andrew Lim; Junsong Yuan; Yuwei Wu; Jing Tang; Raymond Huang; | To facilitate the research of this area, we present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks and also an effective learning framework for its hierarchical segmentation task. |
28 | Instability of Successive Deep Image Compression | Jun-Hyuk Kim; Soobeom Jang; Jun-Ho Choi; Jong-Seok Lee; | In this paper, we conduct comprehensive analysis of successive deep image compression. |
29 | ALANET: Adaptive Latent Attention Network for Joint Video Deblurring and Interpolation | Akash Gupta; Abhishek Aich; Amit K. Roy-Chowdhury; | We introduce a novel architecture, Adaptive Latent Attention Network (ALANET), which synthesizes sharp high frame-rate videos with no prior knowledge of input frames being blurry or not, thereby performing the task of both deblurring and interpolation. |
30 | PCPL: Predicate-Correlation Perception Learning for Unbiased Scene Graph Generation | Shaotian Yan; Chen Shen; Zhongming Jin; Jianqiang Huang; Rongxin Jiang; Yaowu Chen; Xian-Sheng Hua; | Therefore, we propose a novel Predicate-Correlation Perception Learning (PCPL for short) scheme to adaptively seek out appropriate loss weights by directly perceiving and utilizing the correlation among predicate classes. |
31 | Discriminative Spatial Feature Learning for Person Re-Identification | Peixi Peng; Yonghong Tian; Yangru Huang; Xiangqian Wang; Huilong An; | To handle this challenge, a novel method is proposed to learn the discriminative spatial features. |
32 | AdaHGNN: Adaptive Hypergraph Neural Networks for Multi-Label Image Classification | Xiangping Wu; Qingcai Chen; Wei Li; Yulun Xiao; Baotian Hu; | In this paper, we propose a high-order semantic learning model based on adaptive hypergraph neural networks (AdaHGNN) to boost multi-label classification performance. |
33 | Reinforced Similarity Learning: Siamese Relation Networks for Robust Object Tracking | Dawei Zhang; Zhonglong Zheng; Minglu Li; Xiaowei He; Tianxiang Wang; Liyuan Chen; Riheng Jia; Feilong Lin; | In this paper, we pay more attention to learn an outstanding similarity measure for robust tracking. |
34 | Deep Structural Contour Detection | Ruoxi Deng; Shengjun Liu; | In this work, we aim to develop a high-performance contour detection system. |
35 | Cross-modal Non-linear Guided Attention and Temporal Coherence in Multi-modal Deep Video Models | Saurabh Sahu; Palash Goyal; Shalini Ghosh; Chul Lee; | We show how using non-linear guided cross-modal signals and temporal coherence can improve the performance of multi-modal machine learning (ML) models for video analysis tasks like categorization. |
36 | IR-GAN: Image Manipulation with Linguistic Instruction by Increment Reasoning | Zhenhuan Liu; Jincan Deng; Liang Li; Shaofei Cai; Qianqian Xu; Shuhui Wang; Qingming Huang; | To address this issue, we propose an Increment Reasoning Generative Adversarial Network (IR-GAN), which aims to reason the consistency between visual increment in images and semantic increment in instructions. |
37 | Fine-Grained Similarity Measurement between Educational Videos and Exercises | Xin Wang; Wei Huang; Qi Liu; Yu Yin; Zhenya Huang; Le Wu; Jianhui Ma; Xue Wang; | In this paper, we explore to measure the fine-grained similarity by leveraging multimodal information. |
38 | One-shot Text Field labeling using Attention and Belief Propagation for Structure Information Extraction | Mengli Cheng; Minghui Qiu; Xing Shi; Jun Huang; Wei Lin; | To alleviate these problems, we proposed a novel deep end-to-end trainable approach for one-shot text field labeling, which makes use of attention mechanism to transfer the layout information between document images. |
39 | Grad: Learning for Overhead-aware Adaptive Video Streaming with Scalable Video Coding | Yunzhuo Liu; Bo Jiang; Tian Guo; Ramesh K. Sitaraman; Don Towsley; Xinbing Wang; | In this work, we propose a deep reinforcement learning method called Grad for designing ABR algorithms that take advantage of the quality upgrade mechanism of SVC. |
40 | Efficient Adaptation of Neural Network Filter for Video Compression | Yat-Hong Lam; Alireza Zare; Francesco Cricri; Jani Lainema; Miska M. Hannuksela; | We present an efficient finetuning methodology for neural-network filters which are applied as a postprocessing artifact-removal step in video coding pipelines. |
41 | SonoSpace: Visual Feedback of Timbre with Unsupervised Learning | Naoki Kimura; Keisuke Shiro; Yota Takakura; Hiromi Nakamura; Jun Rekimoto; | Our goal is to develop a low-cost learning system that substitutes the teacher. |
42 | Single Image Deraining via Scale-space Invariant Attention Neural Network | Bo Pang; Deming Zhai; Junjun Jiang; Xianming Liu; | In this paper, we tackle the notion of scale that deals with visual changes in appearance of rain steaks with respect to the camera. |
43 | Every Moment Matters: Detail-Aware Networks to Bring a Blurry Image Alive | Kaihao Zhang; Wenhan Luo; Björn Stenger; Wenqi Ren; Lin Ma; Hongdong Li; | In order to alleviate this problem, we propose a detail-aware network with three consecutive stages to improve the reconstruction quality by addressing specific aspects in the recovery process. |
44 | ISIA Food-500: A Dataset for Large-Scale Food Recognition via Stacked Global-Local Attention Network | Weiqing Min; Linhu Liu; Zhiling Wang; Zhengdong Luo; Xiaoming Wei; Xiaolin Wei; Shuqiang Jiang; | To encourage further progress in food recognition, we introduce the dataset ISIA Food-500 with 500 categories from the list in the Wikipedia and 399,726 images, a more comprehensive food dataset that surpasses existing popular benchmark datasets by category coverage and data volume. |
45 | An Egocentric Action Anticipation Framework via Fusing Intuition and Analysis | Tianyu Zhang; Weiqing Min; Ying Zhu; Yong Rui; Shuqiang Jiang; | In this paper, we focus on egocentric action anticipation from videos, which enables various applications, such as helping intelligent wearable assistants understand users’ needs and enhance their capabilities in the interaction process. |
46 | Multi-Person Action Recognition in Microwave Sensors | Diangang Li; Jianquan Liu; Shoji Nishimura; Yuka Hayashi; Jun Suzuki; Yihong Gong; | To address the challenges, we propose a novel learning framework by designed original loss functions with the considerations on weakly-supervised multi-label learning and attention mechanism to improve the accuracy for action recognition. |
47 | Coupling Deep Textural and Shape Features for Sketch Recognition | Qi Jia; Xin Fan; Meiyu Yu; Yuqing Liu; Dingrong Wang; Longin Jan Latecki; | In this paper, we explicitly explore the shape properties of sketches, which has almost been neglected before in the context of deep learning, and propose a sequential dual learning strategy that combines both shape and texture features. |
48 | Look, Read and Feel: Benchmarking Ads Understanding with Multimodal Multitask Learning | Huaizheng Zhang; Yong Luo; Qiming Ai; Yonggang Wen; Han Hu; | This motivates us to develop a novel deep multimodal multitask framework that integrates multiple modalities to achieve effective topic and sentiment prediction simultaneously for ads understanding. |
49 | Not made for each other- Audio-Visual Dissonance-based Deepfake Detection and Localization | Komal Chugh; Parul Gupta; Abhinav Dhall; Ramanathan Subramanian; | We propose detection of deepfake videos based on the dissimilarity between the audio and visual modalities, termed as the Modality Dissonance Score (MDS). |
50 | Hearing like Seeing: Improving Voice-Face Interactions and Associations via Adversarial Deep Semantic Matching Network | Kai Cheng; Xin Liu; Yiu-ming Cheung; Rui Wang; Xing Xu; Bineng Zhong; | In this paper, we present a novel adversarial deep semantic matching network for efficient voice-face interactions and associations, which can well learn the correspondence between voices and faces for various cross-modal matching and retrieval tasks. |
51 | Multimodal Multi-Task Financial Risk Forecasting | Ramit Sawhney; Puneet Mathur; Ayush Mangal; Piyush Khanna; Rajiv Ratn Shah; Roger Zimmermann; | In this work, we present a multi-task solution that utilizes domain specialized textual features and audio attentive alignment for predictive financial risk and price modeling. |
52 | Down to the Last Detail: Virtual Try-on with Fine-grained Details | Jiahang Wang; Tong Sha; Wei Zhang; Zhoujun Li; Tao Mei; | In this work, we propose a multi-stage framework to synthesize person images, where fine-grained details can be well preserved. |
53 | Temporal Denoising Mask Synthesis Network for Learning Blind Video Temporal Consistency | Yifeng Zhou; Xing Xu; Fumin Shen; Lianli Gao; Huimin Lu; Heng Tao Shen; | In this paper, we consider enforcing temporal consistency in a video as a temporal denoising problem that removing the flickering effect in given unstable pre-processed frames. |
54 | A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild | K R Prajwal; Rudrabha Mukhopadhyay; Vinay P. Namboodiri; C.V. Jawahar; | In this work, we investigate the problem of lip-syncing a talking face video of an arbitrary identity to match a target speech segment. |
55 | MEmoR: A Dataset for Multimodal Emotion Reasoning in Videos | Guangyao Shen; Xin Wang; Xuguang Duan; Hongzhi Li; Wenwu Zhu; | In this work, we present the task of multimodal emotion reasoning in videos. |
56 | Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations | Dong Zhang; Weisheng Zhang; Shoushan Li; Qiaoming Zhu; Guodong Zhou; | Therefore, this paper proposes a bidirectional dynamic dual influence network for real-time emotion detection in conversations, which can simultaneously model both intra- and inter-modal influence with bidirectional information propagation for current utterance and its historical utterances. |
57 | Transformer-based Label Set Generation for Multi-modal Multi-label Emotion Detection | Xincheng Ju; Dong Zhang; Junhui Li; Guodong Zhou; | To simultaneously model above two kinds of dependency, we propose a unified approach, namely multi-modal emotion set generation network (MESGN) to generate an emotion set for an utterance. |
58 | CM-BERT: Cross-Modal BERT for Text-Audio Sentiment Analysis | Kaicheng Yang; Hua Xu; Kai Gao; | In this paper, we propose the Cross-Modal BERT (CM-BERT), which relies on the interaction of text and audio modality to fine-tune the pre-trained BERT model. |
59 | AffectI: A Game for Diverse, Reliable, and Efficient Affective Image Annotation | Xingkun Zuo; Jiyi Li; Qili Zhou; Jianjun Li; Xiaoyang Mao; | This paper proposes a novel affective image annotation technique, AffectI, for efficiently collecting diverse and reliable emotional labels with the estimate emotion distribution for images based on the concept of Game With a Purpose (GWAP). |
60 | Attentive One-Dimensional Heatmap Regression for Facial Landmark Detection and Tracking | Shi Yin; Shangfei Wang; Xiaoping Chen; Enhong Chen; Cong Liang; | To address this, we propose a novel attentive one-dimensional heatmap regression method for facial landmark localization. |
61 | Domain Adaptive Person Re-Identification via Coupling Optimization | Xiaobin Liu; Shiliang Zhang; | To handle those two challenges, this paper proposes a coupling optimization method including the Domain-Invariant Mapping (DIM) method and the Global-Local distance Optimization (GLO), respectively. |
62 | Dual-Structure Disentangling Variational Generation for Data-Limited Face Parsing | Peipei Li; Yinglu Liu; Hailin Shi; Xiang Wu; Yibo Hu; Ran He; Zhenan Sun; | Since there are inaccurate pixel-level labels in synthesized parsing maps, we introduce a coarseness-tolerant learning algorithm, to effectively handle these noisy or uncertain labels. |
63 | Accurate UAV Tracking with Distance-Injected Overlap Maximization | Chunhui Zhang; Shiming Ge; Kangkai Zhang; Dan Zeng; | In this work, we propose to alleviate this issue with distance-injected overlap maximization. |
64 | PiRhDy: Learning Pitch-, Rhythm-, and Dynamics-aware Embeddings for Symbolic Music | Hongru Liang; Wenqiang Lei; Paul Yaozhu Chan; Zhenglu Yang; Maosong Sun; Tat-Seng Chua; | In this work, we provide a comprehensive solution by proposing a novel framework named PiRhDy that integrates pitch, rhythm, and dynamics information seamlessly. |
65 | Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events | Guang Yu; Siqi Wang; Zhiping Cai; En Zhu; Chuanfu Xu; Jianping Yin; Marius Kloft; | Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. |
66 | Pose-native Network Architecture Search for Multi-person Human Pose Estimation | Qian Bao; Wu Liu; Jun Hong; Lingyu Duan; Tao Mei; | In this work, we present the Pose-native Network Architecture Search (PoseNAS) to simultaneously design a better pose encoder and pose decoder for pose estimation. |
67 | Beyond the Attention: Distinguish the Discriminative and Confusable Features For Fine-grained Image Classification | Xiruo Shi; Liutong Xu; Pengfei Wang; Yuanyuan Gao; Haifang Jian; Wu Liu; | In this paper, we introduce a novel classification approach, named Logical-based Feature Extraction Model (LAFE for short) to address this issue. |
68 | BlockMix: Meta Regularization and Self-Calibrated Inference for Metric-Based Meta-Learning | Hao Tang; Zechao Li; Zhimao Peng; Jinhui Tang; | Toward this end, we propose new mechanisms to learn generalized and discriminative feature embeddings as well as improve the robustness of classifiers against prediction corruptions for meta-learning. |
69 | Fine-grained Feature Alignment with Part Perspective Transformation for Vehicle ReID | Dechao Meng; Liang Li; Shuhui Wang; Xingyu Gao; Zheng-Jun Zha; Qingming Huang; | In this paper, we propose part perspective transformation module (PPT) to map the different parts of vehicle into a unified perspective respectively. |
70 | Compact Bilinear Augmented Query Structured Attention for Sport Highlights Classification | Yanbin Hao; Hao Zhang; Chong-Wah Ngo; Qiang Liu; Xiaojun Hu; | Specifically, we adapt the hierarchical attention neural networks, which contain learnable query-scheme, on the video to identify discriminative spatial/temporal visual clues within highlight clips. |
71 | Semantic Image Analogy with a Conditional Single-Image GAN | Jiacheng Li; Zhiwei Xiong; Dong Liu; Xuejin Chen; Zheng-Jun Zha; | To accomplish this task, we propose a novel method to model the patch-level correspondence between semantic layout and appearance of a single image by training a single-image GAN that takes semantic labels as conditional input. |
72 | A Structured Graph Attention Network for Vehicle Re-Identification | Yangchun Zhu; Zheng-Jun Zha; Tianzhu Zhang; Jiawei Liu; Jiebo Luo; | In this paper, we propose a Structured Graph ATtention network (SGAT) to fully exploit these relationships and allow the message propagation to update the features of graph nodes. |
73 | Contextual Multi-Scale Feature Learning for Person Re-Identification | Baoyu Fan; Li Wang; Runze Zhang; Zhenhua Guo; Yaqian Zhao; Rengang Li; Weifeng Gong; | In this paper, we proposed a novel architecture, namely contextual multi-scale network (CMSNet), for learning common and contextual multi-scale representations simultaneously. |
74 | Space-Time Video Super-Resolution Using Temporal Profiles | Zeyu Xiao; Zhiwei Xiong; Xueyang Fu; Dong Liu; Zheng-Jun Zha; | In this paper, we propose a novel space-time video super-resolution method, which aims to recover a high-frame-rate and high-resolution video from its low-frame-rate and low-resolution observation. |
75 | Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification | Boqiang Xu; Lingxiao He; Xingyu Liao; Wu Liu; Zhenan Sun; Tao Mei; | To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. |
76 | SalGCN: Saliency Prediction for 360-Degree Images Based on Spherical Graph Convolutional Networks | Haoran Lv; Qin Yang; Chenglin Li; Wenrui Dai; Junni Zou; Hongkai Xiong; | In this paper, we propose a saliency prediction framework for 360-degree images based on graph convolutional networks (SalGCN), which directly applies to the spherical graph signals. |
77 | LIGHTEN: Learning Interactions with Graph and Hierarchical TEmporal Networks for HOI in videos | Sai Praneeth Reddy Sunkesula; Rishabh Dabral; Ganesh Ramakrishnan; | While traditional methods formulate the problem as inference on a sequence of video segments, we present a hierarchical approach, LIGHTEN, to learn visual features to effectively capture spatio-temporal cues at multiple granularities in a video. |
78 | Concept-based Explanation for Fine-grained Images and Its Application in Infectious Keratitis Classification | Zhengqing Fang; Kun Kuang; Yuxiao Lin; Fei Wu; Yu-Feng Yao; | In this paper, we focus on the real application problem of classification of infectious keratitis and propose a visual concept mining (VCM) method to explain the fine-grained infectious keratitis images. |
79 | Guided Attention Network for Object Detection and Counting on Drones | Cai YuanQiang; Dawei Du; Libo Zhang; Longyin Wen; Weiqiang Wang; Yanjun Wu; Siwei Lyu; | In this paper, we propose a new Guided Attention network (GAnet) to deal with both object detection and counting tasks based on the feature pyramid. |
80 | PIDNet: An Efficient Network for Dynamic Pedestrian Intrusion Detection | Jingchen Sun; Jiming Chen; Tao Chen; Jiayuan Fan; Shibo He; | In this paper, we propose a novel and efficient multi-task deep neural network, PIDNet, to solve this problem. |
81 | VONAS: Network Design in Visual Odometry using Neural Architecture Search | Xing Cai; Lanqing Zhang; Chengyuan Li; Ge Li; Thomas H. Li; | Therefore, this paper explores the network design for the VO task and proposes a more general single path based one-shot NAS, named VONAS, which can model sequential information for video-related tasks. |
82 | Learning from the Past: Meta-Continual Learning with Knowledge Embedding for Jointly Sketch, Cartoon, and Caricature Face Recognition | Wenbo Zheng; Lan Yan; Fei-Yue Wang; Chao Gou; | We propose a novel framework termed as Meta-Continual Learning with Knowledge Embedding to address the task of jointly sketch, cartoon, and caricature face recognition. |
83 | ChoreoNet: Towards Music to Dance Synthesis with Choreographic Action Unit | Zijie Ye; Haozhe Wu; Jia Jia; Yaohua Bu; Wei Chen; Fanbo Meng; Yanfeng Wang; | Inspired by these, we systematically study such two-stage choreography approach and construct a dataset to incorporate such choreography knowledge. |
84 | InvisibleFL: Federated Learning over Non-Informative Intermediate Updates against Multimedia Privacy Leakages | Qiushi Li; Wenwu Zhu; Chao Wu; Xinglin Pan; Fan Yang; Yuezhi Zhou; Yaoxue Zhang; | In this paper, we propose a privacy-preserving solution that avoids multimedia privacy leakages in federated learning. |
85 | Asymmetric Deep Hashing for Efficient Hash Code Compression | Shu Zhao; Dayan Wu; Wanqian Zhang; Yu Zhou; Bo Li; Weiping Wang; | In this paper, we propose a novel deep hashing method, called Code Compression oriented Deep Hashing (CCDH), for efficiently compressing hash codes. |
86 | A Human-Computer Duet System for Music Performance | Yuen-Jen Lin; Hsuan-Kai Kao; Yih-Chih Tseng; Ming Tsai; Li Su; | In this paper, we firstly create a virtual violinist, who can collaborate with a human pianist to perform chamber music automatically without any intervention. |
87 | Photo Stand-Out: Photography with Virtual Character | Yujia Wang; Sifan Hou; Bing Ning; Wei Liang; | In this paper, we propose a novel optimization framework to synthesize an aesthetic pose for the virtual character with respect to the presented user’s pose. |
88 | Norm-in-Norm Loss with Faster Convergence and Better Performance for Image Quality Assessment | Dingquan Li; Tingting Jiang; Ming Jiang; | Therefore, we explore normalization in the design of loss functions for IQA. |
89 | Context-aware Attention Network for Predicting Image Aesthetic Subjectivity | Munan Xu; Jia-Xing Zhong; Yurui Ren; Shan Liu; Ge Li; | Based on the attention model, we predict the distribution of human aesthetic ratings of images, which reflects the diversity and similarity of human subjective opinions. |
90 | Scoring High: Analysis and Prediction of Viewer Behavior and Engagement in the Context of 2018 FIFA WC Live Streaming | Nikolas Wehner; Michael Seufert; Sebastian Egger-Lampl; Bruno Gardlo; Pedro Casas; Raimund Schatz; | In this paper, we analyze a unique dataset consisting of more than a million 2018 FIFA World Cup mobile live streaming sessions, collected at a large national public broadcaster. |
91 | Object-level Attention for Aesthetic Rating Distribution Prediction | Jingwen Hou; Sheng Yang; Weisi Lin; | We study the problem of image aesthetic assessment (IAA) and aim to automatically predict the image aesthetic quality in the form of discrete distribution, which is particularly important in IAA due to its nature of having possibly higher diversification of agreement for aesthetics. |
92 | ARSketch: Sketch-Based User Interface for Augmented Reality Glasses | Zhaohui Zhang; Haichao Zhu; Qian Zhang; | To tackle this problem, we introduce a sketch-based neural network-driven user interface for AR/MR glasses, called ARSketch, which enables drawing sketches freely in air to interact with the devices. |
93 | RIRNet: Recurrent-In-Recurrent Network for Video Quality Assessment | Pengfei Chen; Leida Li; Lei Ma; Jinjian Wu; Guangming Shi; | Partially inspired by psychophysical and vision science studies revealing the speed tuning property of neurons in visual cortex when performing motion perception (i.e., sensitive to different temporal frequencies), we propose a novel no-reference (NR) VQA framework named Recurrent-In-Recurrent Network (RIRNet) to incorporate this characteristic to prompt an accurate representation of motion perception in VQA task. |
94 | Cognitive Representation Learning of Self-Media Online Article Quality | Yiru Wang; Shen Huang; Gongfu Li; Qiang Deng; Dongliang Liao; Pengda Si; Yujiu Yang; Jin Xu; | To solve these challenges, we establish a joint model CoQAN in combination with the layout organization, writing characteristics and text semantics, designing different representation learning subnetworks, especially for the feature learning process and interactive reading habits on mobile terminals. |
95 | Describing Subjective Experiment Consistency by p-Value P–P Plot | Jakub Nawala; Lucjan Janowski; Bogdan Cmiel; Krzysztof Rusek; | We provide a tool to classify subjective experiment (and all its results) as either consistent or inconsistent. |
96 | Increasing Video Perceptual Quality with GANs and Semantic Coding | Leonardo Galteri; Marco Bertini; Lorenzo Seidenari; Tiberio Uricchio; Alberto Del Bimbo; | In this work we show how such videos can be efficiently generated by shifting bitrate with masks derived via computer vision and how a deep generative adversarial network can be trained to restore video quality. |
97 | Label Embedding Online Hashing for Cross-Modal Retrieval | Yongxin Wang; Xin Luo; Xin-Shun Xu; | To address these issues, in this paper, we propose a novel supervised online cross-modal hashing method, i.e., Label EMbedding ONline hashing, LEMON for short. |
98 | Quaternion-Based Knowledge Graph Network for Recommendation | Zhaopeng Li; Qianqian Xu; Yangbangyan Jiang; Xiaochun Cao; Qingming Huang; | In this paper, we propose Quaternion-based Knowledge Graph Network (QKGN) for recommendation, which represents users and items with quaternion embeddings in hypercomplex space, so that the latent inter-dependencies between entities and relations could be captured effectively. |
99 | Class-Aware Modality Mix and Center-Guided Metric Learning for Visible-Thermal Person Re-Identification | Yongguo Ling; Zhun Zhong; Zhiming Luo; Paolo Rota; Shaozi Li; Nicu Sebe; | In this paper, we design a novel framework to jointly bridge the modality gap in pixel- and feature-level without additional parameters, as well as reduce the inter- and intra-modalities variations by a center-guided metric learning constraint. |
100 | Adversarial Video Moment Retrieval by Jointly Modeling Ranking and Localization | Da Cao; Yawen Zeng; Xiaochi Wei; Liqiang Nie; Richang Hong; Zheng Qin; | Toward this end, we contribute a novel solution to thoroughly investigate the video moment retrieval issue under the adversarial learning paradigm. |
101 | Beyond the Parts: Learning Multi-view Cross-part Correlation for Vehicle Re-identification | Xinchen Liu; Wu Liu; Jinkai Zheng; Chenggang Yan; Tao Mei; | Different from existing methods, we propose a Parsing-guided Cross-part Reasoning Network, named as PCRNet, for vehicle Re-Id. |
102 | Weakly-Supervised Image Hashing through Masked Visual-Semantic Graph-based Reasoning | Lu Jin; Zechao Li; Yonghua Pan; Jinhui Tang; | To address this issue, this work proposes a novel Masked visual-semantic Graph-based Reasoning Network, termed as MGRN, to learn joint visual-semantic representations for image hashing. |
103 | Semantic Consistency Guided Instance Feature Alignment for 2D Image-Based 3D Shape Retrieval | Heyu Zhou; Weizhi Nie; Dan Song; Nian Hu; Xuanya Li; An-An Liu; | In this paper, we propose a novel semantic consistency guided instance feature alignment network (SC-IFA) to address these limitations. |
104 | RGB2LIDAR: Towards Solving Large-Scale Cross-Modal Visual Localization | Niluthpol Chowdhury Mithun; Karan Sikka; Han-Pang Chiu; Supun Samarasekera; Rakesh Kumar; | To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km2 area) of RGB and aerial LIDAR depth images. |
105 | Performance Optimization of Federated Person Re-identification via Benchmark Analysis | Weiming Zhuang; Yonggang Wen; Xuesen Zhang; Xin Gan; Daiying Yin; Dongzhan Zhou; Shuai Zhang; Shuai Yi; | In this work, we implement federated learning to person re-identification (FedReID) and optimize its performance affected by statistical heterogeneity in the real-world scenario. |
106 | Traffic-Aware Multi-Camera Tracking of Vehicles Based on ReID and Camera Link Model | Hung-Min Hsu; Yizhou Wang; Jenq-Neng Hwang; | In this paper, we propose an effective and reliable MTMCT framework for vehicles, which consists of a traffic-aware single camera tracking (TSCT) algorithm, a trajectory-based camera link model (CLM) for vehicle re-identification (ReID), and a hierarchical clustering algorithm to obtain the cross camera vehicle trajectories. |
107 | Active Object Search | Jie Wu; Tianshui Chen; Lishan Huang; Hefeng Wu; Guanbin Li; Ling Tian; Liang Lin; | In this work, we investigate an Active Object Search (AOS) task that is not explicitly addressed in the literature. |
108 | An Analysis of Delay in Live 360° Video Streaming Systems | Jun Yi; Md Reazul Islam; Shivang Aggarwal; Dimitrios Koutsonikolas; Y. Charlie Hu; Zhisheng Yan; | In this paper, we conduct the first in-depth measurement study of task-level time consumption for five system components in live 360° video streaming. |
109 | DeepFacePencil: Creating Face Images from Freehand Sketches | Yuhang Li; Xuejin Chen; Binxin Yang; Zihan Chen; Zhihua Cheng; Zheng-Jun Zha; | In this paper, we explore the task of generating photo-realistic face images from hand-drawn sketches. |
110 | When Bitstream Prior Meets Deep Prior: Compressed Video Super-resolution with Learning from Decoding | Peilin Chen; Wenhan Yang; Long Sun; Shiqi Wang; | In this paper, we systematically investigate the SR of compressed LR videos by leveraging the interactivity between decoding prior and deep prior. |
111 | RL-Bélády: A Unified Learning Framework for Content Caching | Gang Yan; Jian Li; | This paper instead proposes a novel framework that can simultaneously learn both content admission and content eviction for caching in CDNs. |
112 | ShapeCaptioner: Generative Caption Network for 3D Shapes by Learning a Mapping from Parts Detected in Multiple Views to Sentences | Zhizhong Han; Chao Chen; Yu-Shen Liu; Matthias Zwicker; | To resolve this issue, we propose ShapeCaptioner, a generative caption network, to perform 3D shape captioning from semantic parts detected in multiple views. |
113 | Co-Attentive Lifting for Infrared-Visible Person Re-Identification | Xing Wei; Diangang Li; Xiaopeng Hong; Wei Ke; Yihong Gong; | This paper proposes a novel attention-based approach to handle the two difficulties in a unified framework. |
114 | Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts | Zhiwei Wu; Changmeng Zheng; Yi Cai; Junying Chen; Ho-fung Leung; Qing Li; | In this paper, we propose a neural network which combines object-level image information and character-level text information to predict entities. |
115 | Context-Aware Multi-View Summarization Network for Image-Text Matching | Leigang Qu; Meng Liu; Da Cao; Liqiang Nie; Qi Tian; | Toward this end, we present a novel context-aware multi-view summarization network to summarize context-enhanced visual region information from multiple views. |
116 | Performance over Random: A Robust Evaluation Protocol for Video Summarization Methods | Evlampios Apostolidis; Eleni Adamantidou; Alexandros I. Metsai; Vasileios Mezaris; Ioannis Patras; | This paper proposes a new evaluation approach for video summarization algorithms. |
117 | Concept Drift Detection for Multivariate Data Streams and Temporal Segmentation of Daylong Egocentric Videos | Pravin Nagar; Mansi Khemka; Chetan Arora; | In this paper, we present a novel unsupervised temporal segmentation technique especially suited for day-long egocentric videos. |
118 | Distributed Multi-agent Video Fast-forwarding | Shuyue Lan; Zhilu Wang; Amit K. Roy-Chowdhury; Ermin Wei; Qi Zhu; | This paper presents a consensus-based distributed multi-agent video fast-forwarding framework, named DMVF, that fast-forwards multi-view video streams collaboratively and adaptively. |
119 | Controllable Video Captioning with an Exemplar Sentence | Yitian Yuan; Lin Ma; Jingwen Wang; Wenwu Zhu; | In this paper, we investigate a novel and challenging task, namely controllable video captioning with an exemplar sentence. |
120 | MMFL: Multimodal Fusion Learning for Text-Guided Image Inpainting | Qing Lin; Bo Yan; Jichun Li; Weimin Tan; | We propose a multimodal fusion learning method for image inpainting (MMFL). |
121 | Vision Meets Wireless Positioning: Effective Person Re-identification with Recurrent Context Propagation | Yiheng Liu; Wengang Zhou; Mao Xi; Sanjing Shen; Houqiang Li; | In this work, we approach person re-identification with the sensing data from both vision and wireless positioning. |
122 | Structural Semantic Adversarial Active Learning for Image Captioning | Beichen Zhang; Liang Li; Li Su; Shuhui Wang; Jincan Deng; Zheng-Jun Zha; Qingming Huang; | To solve this problem, we propose a structural semantic adversarial active learning (SSAAL) model that leverages both visual and textual information for deriving the most representative samples while maximizing the image captioning performance. |
123 | MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis | Devamanyu Hazarika; Roger Zimmermann; Soujanya Poria; | In this paper, we aim to learn effective modality representations to aid the process of fusion. |
124 | Multi-modal Cooking Workflow Construction for Food Recipes | Liang-Ming Pan; Jingjing Chen; Jianlong Wu; Shaoteng Liu; Chong-Wah Ngo; Min-Yen Kan; Yugang Jiang; Tat-Seng Chua; | In this paper, we build MM-ReS, the first large-scale dataset for cooking workflow construction, consisting of 9,850 recipes with human-labeled workflow graphs. |
125 | Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition | Yuqian Fu; Li Zhang; Junke Wang; Yanwei Fu; Yu-Gang Jiang; | In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. |
126 | Adaptive Temporal Triplet-loss for Cross-modal Embedding Learning | David Semedo; João Magalhães; | In this work, we seek for highly expressive loss functions that allow the encoding of data temporal traits into cross-modal embedding spaces. |
127 | Scene-Aware Background Music Synthesis | Yujia Wang; Wei Liang; Wanwan Li; Dingzeyu Li; Lap-Fai Yu; | In this paper, we introduce an interactive background music synthesis algorithm guided by visual content. |
128 | Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes | Xutong Jin; Sheng Li; Tianshu Qu; Dinesh Manocha; Guoping Wang; | We present a novel learning-based impact sound synthesis algorithm called Deep-Modal. |
129 | Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions | Yu-Siang Huang; Yi-Hsuan Yang; | In contrast with this general approach, this paper shows that Transformers can do even better for music modeling, when we improve the way a musical score is converted into the data fed to a Transformer model. |
130 | Make Your Favorite Music Curative: Music Style Transfer for Anxiety Reduction | Zhejing Hu; Yan Liu; Gong Chen; Sheng-hua Zhong; Aiwei Zhang; | This paper proposes a novel style transfer model to generate the therapeutic music according to user’s preference. |
131 | PopMAG: Pop Music Accompaniment Generation | Yi Ren; Jinzheng He; Xu Tan; Tao Qin; Zhou Zhao; Tie-Yan Liu; | To improve harmony, in this paper, we propose a novel MUlti-track MIDI representation (MuMIDI), which enables simultaneous multi-track generation in a single sequence and explicitly models the dependency of the notes from different tracks. |
132 | DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices | Run Wang; Felix Juefei-Xu; Yihao Huang; Qing Guo; Xiaofei Xie; Lei Ma; Yang Liu; | In this paper, we devise a novel approach, named DeepSonar, based on monitoring neuron behaviors of speaker recognition (SR) system, i.e., a deep neural network (DNN), to discern AI-synthesized fake voices. |
133 | FakePolisher: Making DeepFakes More Detection-Evasive by Shallow Reconstruction | Yihao Huang; Felix Juefei-Xu; Run Wang; Qing Guo; Lei Ma; Xiaofei Xie; Jianwen Li; Weikai Miao; Yang Liu; Geguang Pu; | Towards reducing the artifacts in the synthesized images, in this paper, we devise a simple yet powerful approach termed FakePolisher that performs shallow reconstruction of fake images through a learned linear dictionary, intending to effectively and efficiently reduce the artifacts introduced during image synthesis. |
134 | Boosting Visual Question Answering with Context-aware Knowledge Aggregation | Guohao Li; Xin Wang; Wenwu Zhu; | To solve the challenging issue, we propose a Knowledge Graph Augmented (KG-Aug) model which conducts context-aware knowledge aggregation on external knowledge graphs, requiring no ground-truth knowledge facts for extra supervision. |
135 | Memory-Augmented Relation Network for Few-Shot Learning | Jun He; Richang Hong; Xueliang Liu; Mingliang Xu; Zheng-Jun Zha; Meng Wang; | We investigate a new metric-learning method to explicitly exploit these relationships. |
136 | K-armed Bandit based Multi-Modal Network Architecture Search for Visual Question Answering | Yiyi Zhou; Rongrong Ji; Xiaoshuai Sun; Gen Luo; Xiaopeng Hong; Jinsong Su; Xinghao Ding; Ling Shao; | In this paper, we propose a cross-modal network architecture search (NAS) algorithm for VQA, termed as k-Armed Bandit based NAS (KAB-NAS). |
137 | Adversarial Graph Representation Adaptation for Cross-Domain Facial Expression Recognition | Yuan Xie; Tianshui Chen; Tao Pu; Hefeng Wu; Liang Lin; | In this work, we propose a novel Adversarial Graph Representation Adaptation (AGRA) framework that unifies graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation. |
138 | KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning in Visual Dialogue | Xiaoze Jiang; Siyi Du; Zengchang Qin; Yajing Sun; Jing Yu; | In this paper, we propose a novel Knowledge-Bridge Graph Network (KBGN) model by using graph to bridge the cross-modal semantic relations between vision and text knowledge in fine granularity, as well as retrieving required knowledge via an adaptive information selection mode. |
139 | Cascade Grouped Attention Network for Referring Expression Segmentation | Gen Luo; Yiyi Zhou; Rongrong Ji; Xiaoshuai Sun; Jinsong Su; Chia-Wen Lin; Qi Tian; | In this paper, we focus on addressing this issue by proposing a Cascade Grouped Attention Network (CGAN) with two innovative designs: Cascade Grouped Attention (CGA) and Instance-level Attention (ILA) loss. |
140 | Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos | Jie Wu; Guanbin Li; Xiaoguang Han; Liang Lin; | In this paper, we propose a Boundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning (RL) to guide the process of progressively refining the temporal boundary. |
141 | Poet: Product-oriented Video Captioner for E-commerce | Shengyu Zhang; Ziqi Tan; Jin Yu; Zhou Zhao; Kun Kuang; Jie Liu; Jingren Zhou; Hongxia Yang; Fei Wu; | To address this problem, we propose a product-oriented video captioner framework, abbreviated as Poet. |
142 | Text-Guided Neural Image Inpainting | Lisai Zhang; Qingcai Chen; Baotian Hu; Shuoran Jiang; | The goal of this paper is to fill the semantic information in corrupted images according to the provided descriptive text. |
143 | Single-Shot Two-Pronged Detector with Rectified IoU Loss | Keyang Wang; Lei Zhang; | In this paper, we introduce a novel two-pronged transductive idea to explore the relationship among different layers in both backward and forward directions, which can enrich the semantic information of low-level features and detailed information of high-level features at the same time. |
144 | Dynamic Context-guided Capsule Network for Multimodal Machine Translation | Huan Lin; Fandong Meng; Jinsong Su; Yongjing Yin; Zhengyuan Yang; Yubin Ge; Jie Zhou; Jiebo Luo; | To address the above issues, in this paper, we propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT. |
145 | Differentiable Manifold Reconstruction for Point Cloud Denoising | Shitong Luo; Wei Hu; | To this end, we propose to learn the underlying manifold of a noisy point cloud from differentiably subsampled points with trivial noise perturbation and their embedded neighborhood feature, aiming to capture intrinsic structures in point clouds. |
146 | BS-MCVR: Binary-sensing based Mobile-cloud Visual Recognition | Hongyi Zheng; Wangmeng Zuo; Lei Zhang; | In this work, we present a machine-perception-oriented MCVR system, called BS-MCVR, where the mobile end is designed to efficiently sense highly compact and discriminative features directly from the scene, and the sensed features are analyzed on the cloud for recognition. |
147 | Learning Modality-Invariant Latent Representations for Generalized Zero-shot Learning | Jingjing Li; Mengmeng Jing; Lei Zhu; Zhengming Ding; Ke Lu; Yang Yang; | In this paper, therefore, we present a new method which can simultaneously generate both visual representations and semantic representations so that the essential multi-modal information associated with unseen classes can be captured. |
148 | Describe What to Change: A Text-guided Unsupervised Image-to-image Translation Approach | Yahui Liu; Marco De Nadai; Deng Cai; Huayang Li; Xavier Alameda-Pineda; Nicu Sebe; Bruno Lepri; | In this work, we propose a novel unsupervised approach, based on image-to-image translation, that alters the attributes of a given image through a command-like sentence such as "change the hair color to black". |
149 | INCLUDE: A Large Scale Dataset for Indian Sign Language Recognition | Advaith Sridhar; Rohith Gandhi Ganesan; Pratyush Kumar; Mitesh Khapra; | In this work, we present the Indian Lexicon Sign Language Dataset – INCLUDE – an ISL dataset that contains 0.27 million frames across 4,287 videos over 263 word signs from 15 different word categories. |
150 | Amora: Black-box Adversarial Morphing Attack | Run Wang; Felix Juefei-Xu; Qing Guo; Yihao Huang; Xiaofei Xie; Lei Ma; Yang Liu; | In this paper, we investigate and introduce a new type of adversarial attack to evade FR systems by manipulating facial content, called adversarial morphing attack (a.k.a. Amora). |
151 | Visual Relation of Interest Detection | Fan Yu; Haonan Wang; Tongwei Ren; Jinhui Tang; Gangshan Wu; | In this paper, we propose a novel Visual Relation of Interest Detection (VROID) task, which aims to detect visual relations that are important for conveying the main content of an image, motivated from the intuition that not all correctly detected relations are really "interesting" in semantics and only a fraction of them really make sense for representing the image main content. |
152 | University-1652: A Multi-view Multi-source Benchmark for Drone-based Geo-localization | Zhedong Zheng; Yunchao Wei; Yi Yang; | To verify the effectiveness of the drone platform, we introduce a new multi-view multi-source benchmark for drone-based geo-localization, named University-1652. |
153 | DIPDefend: Deep Image Prior Driven Defense against Adversarial Examples | Tao Dai; Yan Feng; Dongxian Wu; Bin Chen; Jian Lu; Yong Jiang; Shu-Tao Xia; | Motivated by deep image prior that can capture rich image statistics from a single image, we propose an effective Deep Image Prior Driven Defense (DIPDefend) method against adversarial examples. |
154 | TRIE: End-to-End Text Reading and Information Extraction for Document Understanding | Peng Zhang; Yunlu Xu; Zhanzhan Cheng; Shiliang Pu; Jing Lu; Liang Qiao; Yi Niu; Fei Wu; | In this paper, we propose a unified end-to-end text reading and information extraction network, where the two tasks can reinforce each other. |
155 | Adversarial Privacy-preserving Filter | Jiaming Zhang; Jitao Sang; Xian Zhao; Xiaowen Huang; Yanfeng Sun; Yongli Hu; | This work aims to develop a privacy-preserving solution, called Adversarial Privacy-preserving Filter (APF), to protect the online shared face images from being maliciously used. |
156 | Mix Dimension in Poincaré Geometry for 3D Skeleton-based Action Recognition | Wei Peng; Jingang Shi; Zhaoqiang Xia; Guoying Zhao; | In this paper, we provide an orthogonal way to explore the underlying connections. |
157 | Dynamic Extension Nets for Few-shot Semantic Segmentation | Lizhao Liu; Junyi Cao; Minqian Liu; Yong Guo; Qi Chen; Mingkui Tan; | To address the above issues, we propose a Dynamic Extension Network (DENet) in which we dynamically construct and maintain a classifier for the novel class by leveraging the knowledge from the base classes and the information from novel data. |
158 | Fast Enhancement for Non-Uniform Illumination Images using Light-weight CNNs | Feifan Lv; Bo Liu; Feng Lu; | This paper proposes a new light-weight convolutional neural network (~5k params) for non-uniform illumination image enhancement to handle color, exposure, contrast, noise and artifacts, etc., simultaneously and effectively. |
159 | Animating Through Warping: An Efficient Method for High-Quality Facial Expression Animation | Zili Yi; Qiang Tang; Vishnu Sanjay Ramiya Srinivasan; Zhan Xu; | Motivated by the idea that HD images can be generated by adding high-frequency residuals to low-resolution results produced by a neural network, we propose a novel framework known as Animating Through Warping (ATW) to enable efficient animation of HD images. |
160 | Exploiting Better Feature Aggregation for Video Object Detection | Liang Han; Pichao Wang; Zhaozheng Yin; Fan Wang; Hao Li; | To exploit better feature aggregation, in this paper, we propose two improvements over previous works: a class-constrained spatial-temporal relation network and a correlation-based feature alignment module. |
161 | NuI-Go: Recursive Non-Local Encoder-Decoder Network for Retinal Image Non-Uniform Illumination Removal | Chongyi Li; Huazhu Fu; Runmin Cong; Zechao Li; Qianqian Xu; | To address this issue, we propose a non-uniform illumination removal network for retinal image, called NuI-Go, which consists of three Recursive Non-local Encoder-Decoder Residual Blocks (NEDRBs) for enhancing the degraded retinal images in a progressive manner. |
162 | Online Filtering Training Samples for Robust Visual Tracking | Jie Zhao; Kenan Dai; Dong Wang; Huchuan Lu; Xiaoyun Yang; | In this paper, we propose an optimization module named MetricNet for online filtering training samples. |
163 | Boosting Continuous Sign Language Recognition via Cross Modality Augmentation | Junfu Pu; Wengang Zhou; Hezhen Hu; Houqiang Li; | To tackle this issue, we propose a novel architecture with cross modality augmentation. |
164 | ThumbNet: One Thumbnail Image Contains All You Need for Recognition | Chen Zhao; Bernard Ghanem; | Based on the fact that input images of a CNN contain substantial redundancy, in this paper, we propose a unified framework, dubbed as ThumbNet, to simultaneously accelerate and compress CNN models by enabling them to infer on one thumbnail image. |
165 | Dual Temporal Memory Network for Efficient Video Object Segmentation | Kaihua Zhang; Long Wang; Dong Liu; Bo Liu; Qingshan Liu; Zhu Li; | We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. |
166 | Cooperative Bi-path Metric for Few-shot Learning | Zeyuan Wang; Yifan Zhao; Jia Li; Yonghong Tian; | In this paper, we make two contributions to investigate the few-shot classification problem. |
167 | From Design Draft to Real Attire: Unaligned Fashion Image Translation | Yu Han; Shuai Yang; Wenjing Wang; Jiaying Liu; | In this paper, we study a new unaligned translation problem between design drafts and real fashion items, whose main challenge lies in the huge misalignment between the two modalities. |
168 | Siamese Attentive Graph Tracking | Fei Zhao; Ting Zhang; Chao Ma; Ming Tang; Jinqiao Wang; Xiaobo Wang; | In this paper, we propose to advance Siamese trackers with graph convolutional networks, which pay more attention to the structural layout of target objects, to learn features robust to large appearance changes over time. |
169 | HiFaceGAN: Face Renovation via Collaborative Suppression and Replenishment | Lingbo Yang; Shanshe Wang; Siwei Ma; Wen Gao; Chang Liu; Pan Wang; Peiran Ren; | In this paper, we investigate a more challenging and practical "dual-blind" version of the problem by lifting the requirements on both types of prior, termed as "Face Renovation"(FR). |
170 | Discernible Image Compression | Zhaohui Yang; Yunhe Wang; Chang Xu; Peng Du; Chao Xu; Chunjing Xu; Qi Tian; | Based on the encoder-decoder framework, we propose using a pre-trained CNN to extract features of the original and compressed images, and making them similar. |
171 | Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation | Jialian Wu; Liangchen Song; Tiancai Wang; Qian Zhang; Junsong Yuan; | To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. |
172 | Adv-watermark: A Novel Watermark Perturbation for Adversarial Examples | Xiaojun Jia; Xingxing Wei; Xiaochun Cao; Xiaoguang Han; | In this paper, we propose a novel watermark perturbation for adversarial examples (Adv-watermark) which combines image watermarking techniques and adversarial example algorithms. |
173 | Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild | Jichao Zhang; Jingjing Chen; Hao Tang; Wei Wang; Yan Yan; Enver Sangineto; Nicu Sebe; | We address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need of precise annotations of the gaze angle and the head pose. |
174 | Learning Hierarchical Graph for Occluded Pedestrian Detection | Gang Li; Jian Li; Shanshan Zhang; Jian Yang; | In this paper, we propose a novel Hierarchical Graph Pedestrian Detector (HGPD), which integrates semantic and spatial relation information to construct two graphs named intra-proposal graph and inter-proposal graph, without relying on extra cues w.r.t visible regions. |
175 | Adaptively-Accumulated Knowledge Transfer for Partial Domain Adaptation | Taotao Jing; Haifeng Xia; Zhengming Ding; | In this work, we propose an Adaptively-Accumulated Knowledge Transfer framework (A$^2$KT) to align the relevant categories across two domains for effective domain adaptation. |
176 | Box Guided Convolution for Pedestrian Detection | Jinpeng Li; Shengcai Liao; Hangzhi Jiang; Ling Shao; | In particular, we proposed a box guided convolution (BGC) that can dynamically adjust the sizes of convolution kernels guided by the predicted bounding boxes. |
177 | Stronger, Faster and More Explainable: A Graph Convolutional Baseline for Skeleton-based Action Recognition | Yi-Fan Song; Zhang Zhang; Caifeng Shan; Liang Wang; | In this work, we propose an efficient but strong baseline based on Graph Convolutional Network (GCN), where three main improvements are aggregated, i.e., early fused Multiple Input Branches (MIB), Residual GCN (ResGCN) with bottleneck structure and Part-wise Attention (PartAtt) block. |
178 | Adversarial Image Attacks Using Multi-Sample and Most-Likely Ensemble Methods | Xia Du; Chi-Man Pun; | In this paper, we propose the multi-sample ensemble method (MSEM) and most-likely ensemble method (MLEM) to generate adversarial attacks that successfully fool the classifier for images in both the digital and real worlds. |
179 | DCSFN: Deep Cross-scale Fusion Network for Single Image Rain Removal | Cong Wang; Xiaoying Xing; Yutong Wu; Zhixun Su; Junyang Chen; | In this paper, we explore the cross-scale manner between networks and inner-scale fusion operation to solve the image rain removal task. |
180 | Self-Paced Video Data Augmentation by Generative Adversarial Networks with Insufficient Samples | Yumeng Zhang; Gaoguo Jia; Li Chen; Mingrui Zhang; Junhai Yong; | In this study, we propose a generative data augmentation method for video classification using dynamic images. |
181 | CF-SIS: Semantic-Instance Segmentation of 3D Point Clouds by Context Fusion with Self-Attention | Xin Wen; Zhizhong Han; Geunhyuk Youk; Yu-Shen Liu; | To address the above two issues, we propose a novel network of feature context fusion for SIS task, named CF-SIS. |
182 | Hybrid Resolution Network Using Edge Guided Region Mutual Information Loss for Human Parsing | Yunan Liu; Liang Zhao; Shanshan Zhang; Jian Yang; | In this paper, we propose a new method for human parsing, which effectively maintains high-resolution representations and leverages body edge details to improve the performance. |
183 | Meta-RCNN: Meta Learning for Few-Shot Object Detection | Xiongwei Wu; Doyen Sahoo; Steven Hoi; | In this paper, we investigate the problem of few-shot object detection, where a detector has access to only limited amounts of annotated data. |
184 | Objectness Consistent Representation for Weakly Supervised Object Detection | Ke Yang; Peng Zhang; Peng Qiao; Zhiyuan Wang; Dongsheng Li; Yong Dou; | In this paper, we propose a novel object representation named Objectness Consistent Representation (OCRepr) to meet the consistency criterion of objectness. |
185 | Unpaired Image Enhancement with Quality-Attention Generative Adversarial Network | Zhangkai Ni; Wenhan Yang; Shiqi Wang; Lin Ma; Sam Kwong; | In this work, we aim to learn an unpaired image enhancement model, which can enrich low-quality images with the characteristics of high-quality images provided by users. |
186 | ASTA-Net: Adaptive Spatio-Temporal Attention Network for Person Re-Identification in Videos | Xierong Zhu; Jiawei Liu; Haoze Wu; Meng Wang; Zheng-Jun Zha; | In this work, we propose a novel Adaptive Spatio-Temporal Attention Network (ASTA-Net) to adaptively aggregate the spatial and temporal attention features into discriminative pedestrian representation for person re-identification in videos. |
187 | Talking Face Generation with Expression-Tailored Generative Adversarial Network | Dan Zeng; Han Liu; Hui Lin; Shiming Ge; | In this paper, we propose an end-to-end Expression-Tailored Generative Adversarial Network (ET-GAN) to generate an expression enriched talking face video of arbitrary identity. |
188 | Cross-Modal Omni Interaction Modeling for Phrase Grounding | Tianyu Yu; Tianrui Hui; Zhihao Yu; Yue Liao; Sansi Yu; Faxi Zhang; Si Liu; | In this paper, we propose a novel Cross-Modal Omni Interaction network (COI Net) composed of a neighboring interaction module, a global interaction module, a cross-modal interaction module and a multilevel alignment module. |
189 | Bridging the Web Data and Fine-Grained Visual Recognition via Alleviating Label Noise and Domain Mismatch | Yazhou Yao; Xiansheng Hua; Guanyu Gao; Zeren Sun; Zhibin Li; Jian Zhang; | Specifically, we propose an end-to-end deep denoising network (DDN) model to jointly solve these problems in the process of web images selection. |
190 | Is Depth Really Necessary for Salient Object Detection? | Jiawei Zhao; Yifan Zhao; Jia Li; Xiaowu Chen; | Taking the advantages of RGB and RGBD methods, we propose a novel depth-aware salient object detection framework, which has following superior designs: 1) It does not rely on depth data in the testing phase. |
191 | Self-Play Reinforcement Learning for Fast Image Retargeting | Nobukatsu Kajiura; Satoshi Kosugi; Xueting Wang; Toshihiko Yamasaki; | In this study, we address image retargeting, which is a task that adjusts input images to arbitrary sizes. |
192 | Brain-media: A Dual Conditioned and Lateralization Supported GAN (DCLS-GAN) towards Visualization of Image-evoked Brain Activities | Ahmed Fares; Sheng-hua Zhong; Jianmin Jiang; | To ensure that such extracted multimedia elements remain meaningful, we introduce a dually conditioned learning technique in the proposed deep framework, where one condition is analyzing EEGs through deep learning to extract a class-dependent and more compact brain feature space utilizing the distinctive characteristics of hemispheric lateralization and brain stimulation, and the other is to extract expressive visual features assisting our automated analysis of brain activities as well as their visualizations aided by artificial intelligence. |
193 | Mesh Guided One-shot Face Reenactment Using Graph Convolutional Networks | Guangming Yao; Yi Yuan; Tianjia Shao; Kun Zhou; | In this paper, we introduce a method for one-shot face reenactment, which uses the reconstructed 3D meshes (i.e., the source mesh and driving mesh) as guidance to learn the optical flow needed for the reenacted face synthesis. |
194 | Controllable Continuous Gaze Redirection | Weihao Xia; Yujiu Yang; Jing-Hao Xue; Wensen Feng; | In this work, we present interpGaze, a novel framework for controllable gaze redirection that achieves both precise redirection and continuous interpolation. |
195 | Preserving Global and Local Temporal Consistency for Arbitrary Video Style Transfer | Xinxiao Wu; Jialu Chen; | In this paper, we propose a novel fast method that explores both global and local temporal consistency for video style transfer without estimating optical flow. |
196 | Deep Shapely Portraits | Qinjie Xiao; Xiangjun Tang; You Wu; Leyang Jin; Yong-Liang Yang; Xiaogang Jin; | We present deep shapely portraits, a novel method based on deep learning, to automatically reshape an input portrait to be better proportioned and more shapely while keeping personal facial characteristics. |
197 | Depth Super-Resolution via Deep Controllable Slicing Network | Xinchen Ye; Baoli Sun; Zhihui Wang; Jingyu Yang; Rui Xu; Haojie Li; Baopu Li; | To alleviate these problems, we propose a deep controllable slicing network from a novel perspective. |
198 | Efficient Joint Gradient Based Attack Against SOR Defense for 3D Point Cloud Classification | Chengcheng Ma; Weiliang Meng; Baoyuan Wu; Shibiao Xu; Xiaopeng Zhang; | In this paper, we propose a novel white-box attack method, Joint Gradient Based Attack (JGBA), aiming to break the SOR defense. |
199 | Discrete Haze Level Dehazing Network | Xiaofeng Cong; Jie Gui; Kai-Chao Miao; Jun Zhang; Bing Wang; Peng Chen; | In this paper, a Discrete Haze Level Dehazing network (DHL-Dehaze), a very effective method to dehaze multiple different haze level images, is proposed. |
200 | Deep Heterogeneous Multi-Task Metric Learning for Visual Recognition and Retrieval | Shikang Gan; Yong Luo; Yonggang Wen; Tongliang Liu; Han Hu; | To overcome this drawback, we propose a deep heterogeneous MTML (DHMTML) method, in which a nonlinear mapping is learned for each task by using a deep neural network. |
201 | HOSE-Net: Higher Order Structure Embedded Network for Scene Graph Generation | Meng Wei; Chun Yuan; Xiaoyu Yue; Kuo Zhong; | Accordingly, in this paper, we propose a Higher Order Structure Embedded Network (HOSE-Net) to mitigate this issue. |
202 | Dual Semantic Fusion Network for Video Object Detection | Lijian Lin; Haosheng Chen; Honglun Zhang; Jun Liang; Yu Li; Ying Shan; Hanzi Wang; | In this work, we propose a dual semantic fusion network (abbreviated as DSFNet) to fully exploit both frame-level and instance-level semantics in a unified fusion framework without external guidance. |
203 | Sharp Multiple Instance Learning for DeepFake Video Detection | Xiaodan Li; Yining Lang; Yuefeng Chen; Xiaofeng Mao; Yuan He; Shuhui Wang; Hui Xue; Quan Lu; | In this paper, we introduce a new problem of partial face attack in DeepFake video, where only video-level labels are provided but not all the faces in the fake videos are manipulated. |
204 | Learning to Detect Specular Highlights from Real-world Images | Gang Fu; Qing Zhang; Qifeng Lin; Lei Zhu; Chunxia Xiao; | In this paper, we present a large-scale real-world highlight dataset containing a rich variety of material categories, with diverse highlight shapes and appearances, in which each image is with an annotated ground-truth mask. |
205 | Video Super-Resolution using Multi-scale Pyramid 3D Convolutional Networks | Jianping Luo; Shaofei Huang; Yuan Yuan; | In this paper, we propose a multi-scale pyramid 3D convolutional (MP3D) network for video SR, where 3D convolution can explore temporal correlation directly without explicit motion compensation. |
206 | PCA-SRGAN: Incremental Orthogonal Projection Discrimination for Face Super-resolution | Hao Dou; Chen Chen; Xiyuan Hu; Zuxing Xuan; Zhisen Hu; Silong Peng; | To further improve the performance of GAN-based models on super-resolving face images, we propose PCA-SRGAN which pays attention to the cumulative discrimination in the orthogonal projection space spanned by PCA projection matrix of face data. |
207 | Exploring Font-independent Features for Scene Text Recognition | Yizhi Wang; Zhouhui Lian; | Specifically, we introduce trainable font embeddings to shape the font styles of generated glyphs, with the image feature of scene text only representing its essential patterns. |
208 | Context-aware Feature Generation For Zero-shot Semantic Segmentation | Zhangxuan Gu; Siyuan Zhou; Li Niu; Zihan Zhao; Liqing Zhang; | In this paper, we propose a novel context-aware feature generation method for zero-shot segmentation named CaGNet. |
209 | Defending Adversarial Examples via DNN Bottleneck Reinforcement | Wenqing Liu; Miaojing Shi; Teddy Furon; Li Li; | In order to reinforce the information bottleneck,we introduce the multi-scale low-pass objective and multi-scale high-frequency communication for better frequency steering in the network. |
210 | Weakly-Supervised Video Object Grounding by Exploring Spatio-Temporal Contexts | Xun Yang; Xueliang liu; Meng Jian; Xinjian Gao; Meng Wang; | To fill the research gap, this paper presents a weakly-supervised framework for linking objects mentioned in a sentence with the corresponding regions in videos. |
211 | S2SiamFC: Self-supervised Fully Convolutional Siamese Network for Visual Tracking | Chon Hou Sio; Yu-Jen Ma; Hong-Han Shuai; Jun-Cheng Chen; Wen-Huang Cheng; | To exploit rich information from unlabeled data, in this work, we propose a novel self-supervised framework for visual tracking which can easily adapt the state-of-the-art supervised Siamese-based trackers into unsupervised ones by utilizing the fact that an image and any cropped region of it can form a natural pair for self-training. |
212 | Learnable Optimal Sequential Grouping for Video Scene Detection | Daniel Rotman; Yevgeny Yaroker; Elad Amrani; Udi Barzelay; Rami Ben-Ari; | In this work, we extend the capabilities of OSG to the learning regime. |
213 | NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination | Penghao Zhou; Chong Zhou; Pai Peng; Junlong Du; Xing Sun; Xiaowei Guo; Feiyue Huang; | Thus, we propose \heatmapname (\heatmapnameshort ), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with \nmsname, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. |
214 | Dual-Gradients Localization Framework for Weakly Supervised Object Localization | Chuangchuang Tan; Guanghua Gu; Tao Ruan; Shikui Wei; Yao Zhao; | In this work, we propose an offline framework to achieve precise localization on any convolutional layer of a classification model by exploiting two kinds of gradients, called Dual-Gradients Localization (DGL) framework. |
215 | DualLip: A System for Joint Lip Reading and Generation | Weicong Chen; Xu Tan; Yingce Xia; Tao Qin; Yu Wang; Tie-Yan Liu; | In this paper, we develop DualLip, a system that jointly improves lip reading and generation by leveraging the task duality and using unlabeled text and lip video data. |
216 | Dual Attention GANs for Semantic Image Synthesis | Hao Tang; Song Bai; Nicu Sebe; | In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. |
217 | SimSwap: An Efficient Framework For High Fidelity Face Swapping | Renwang Chen; Xuanhong Chen; Bingbing Ni; Yanhao Ge; | We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. |
218 | Self-Mimic Learning for Small-scale Pedestrian Detection | Jialian Wu; Chunluan Zhou; Qian Zhang; Ming Yang; Junsong Yuan; | In this paper, we conduct an in-depth analysis of the small-scale pedestrian detection problem, which reveals that weak representations of small-scale pedestrians are the main cause for a classifier to miss them. |
219 | Action2Motion: Conditioned Generation of 3D Human Motions | Chuan Guo; Xinxin Zuo; Sen Wang; Shihao Zou; Qingyao Sun; Annan Deng; Minglun Gong; Li Cheng; | This paper, on the other hand, considers a relatively new problem, which could be thought of as an inverse of action recognition: given a prescribed action type, we aim to generate plausible human motion sequences in 3D. |
220 | Skin Textural Generation via Blue-noise Gabor Filtering based Generative Adversarial Network | Hui Zhang; Chuan Wang; Nenglun Chen; Jue Wang; Wenping Wang; | To this end, we propose a new facial noise generation method. |
221 | A Slow-I-Fast-P Architecture for Compressed Video Action Recognition | Jiapeng Li; Ping Wei; Yongchi Zhang; Nanning Zheng; | In this paper, we propose a novel Slow-I-Fast-P (SIFP) neural network model for compressed video action recognition. |
222 | DMVOS: Discriminative Matching for Real-time Video Object Segmentation | Peisong Wen; Ruolin Yang; Qianqian Xu; Chen Qian; Qingming Huang; Runmin Cong; Jianlou Si; | In this work, we propose Discriminative Matching for real-time Video Object Segmentation (DMVOS), a real-time VOS framework with high-accuracy to fill this gap. |
223 | Multi-Group Multi-Attention: Towards Discriminative Spatiotemporal Representation | Zhensheng Shi; Liangjie Cao; Cheng Guan; Ju Liang; Qianqian Li; Zhaorui Gu; Haiyong Zheng; Bing Zheng; | In this paper, we propose Multi-Group Multi-Attention, dubbed MGMA, paying more attention to "where and when" the action happens, for learning discriminative spatiotemporal representation in videos. |
224 | Vaccine-style-net: Point Cloud Completion in Implicit Continuous Function Space | Wei Yan; Ruonan Zhang; Jing Wang; Shan Liu; Thomas H. Li; Ge Li; | In this paper, we propose Vaccine-Style-Net, a new point cloud completion method that can produce high resolution 3D shapes with complete smooth surface. |
225 | Adaptive Wasserstein Hourglass for Weakly Supervised RGB 3D Hand Pose Estimation | Yumeng Zhang; Li Chen; Yufeng Liu; Wen Zheng; Junhai Yong; | In this paper, we propose a domain adaptation method called Adaptive Wasserstein Hourglass for weakly-supervised 3D hand pose estimation to close the large gap between synthetic and real-world datasets flexibly. |
226 | Weakly Supervised Segmentation with Maximum Bipartite Graph Matching | Weide Liu; Chi Zhang; Guosheng Lin; Tzu-Yi HUNG; Chunyan Miao; | We propose to improve the CAMs from a novel graph perspective. |
227 | Recognizing Camera Wearer from Hand Gestures in Egocentric Videos: https://egocentricbiometric.github.io/ | Daksh Thapar; Aditya Nigam; Chetan Arora; | In this work, we take the privacy threat a notch higher and show that even the wearer’s hand gestures, as seen through an egocentric video, leak wearer’s identity. |
228 | Prototype-Matching Graph Network for Heterogeneous Domain Adaptation | Zijian Wang; Yadan Luo; Zi Huang; Mahsa Baktashmotlagh; | To tackle this problem, in this paper, we propose the Prototype-Matching Graph Network (PMGN), which gradually explores the domain-invariant class prototype representations. |
229 | Towards Lighter and Faster: Learning Wavelets Progressively for Image Super-Resolution | Huanrong Zhang; Zhi Jin; Xiaojun Tan; Xiying Li; | To address this trade-off issue between reconstruction performance, the number of network parameters, and inference time, we propose a lightweight and fast network (WSR) to learn wavelet coefficients of the target image progressively for single image super-resolution. |
230 | Spatio-Temporal Inception Graph Convolutional Networks for Skeleton-Based Action Recognition | Zhen Huang; Xu Shen; Xinmei Tian; Houqiang Li; Jianqiang Huang; Xian-Sheng Hua; | Specifically, we design a simple and highly modularized graph convolutional network architecture for skeleton-based action recognition. |
231 | Dynamic Future Net: Diversified Human Motion Generation | Wenheng Chen; He Wang; Yi Yuan; Tianjia Shao; Kun Zhou; | In this paper, we present Dynamic Future Net,a new deep learning model where we explicitly focuses on the aforementioned motion stochasticity by constructing a generative model with non-trivial modelling capacity in temporal stochas-ticity. |
232 | ATF: Towards Robust Face Alignment via Leveraging Similarity and Diversity across Different Datasets | Xing Lan; Qinghao Hu; Fangzhou Xiong; Cong Leng; Jian Cheng; | To address the above problems, we proposed a novel Alternating Training Framework (ATF), which leverages similarity and diversity across multi-media sources for a more robust detector. |
233 | Dual Gaussian-based Variational Subspace Disentanglement for Visible-Infrared Person Re-Identification | Nan Pu; Wei Chen; Yu Liu; Erwin M. Bakker; Michael S. Lew; | To solve the problem, we present a carefully designed dual Gaussian-based variational auto-encoder (DG-VAE), which disentangles an identity-discriminable and an identity-ambiguous cross-modality feature subspace, following a mixture-of-Gaussians (MoG) prior and a standard Gaussian distribution prior, respectively. |
234 | Attention Based Dual Branches Fingertip Detection Network and Virtual Key System | Chong Mou; Xin Zhang; | To rectify these problems, this paper proposes an attention-based dual branches network that can efficiently fulfill both fingertip detection and gesture recognition tasks. |
235 | Action Completeness Modeling with Background Aware Networks for Weakly-Supervised Temporal Action Localization | Md Moniruzzaman; Zhaozheng Yin; Zhihai He; Ruwen Qin; Ming C. Leu; | To solve these problems, we introduce a novel weakly-supervised Action Completeness Modeling with Background Aware Networks (ACM-BANets). |
236 | Adversarial Knowledge Transfer from Unlabeled Data | Akash Gupta; Rameswar Panda; Sujoy Paul; Jianming Zhang; Amit K. Roy-Chowdhury; | In this paper, we present a novel Adversarial Knowledge Transfer (AKT) framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier on a given visual recognition task. |
237 | Task Decoupled Knowledge Distillation For Lightweight Face Detectors | Xiaoqing Liang; Xu Zhao; Chaoyang Zhao; Nanfei Jiang; Ming Tang; Jinqiao Wang; | In this paper, we propose a task decoupled knowledge distillation method, which decouples the detection distillation task into two subtasks and uses different samples in distilling the features of different subtasks. |
238 | Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework | Li Tao; Xueting Wang; Toshihiko Yamasaki; | We propose a self-supervised method to learn feature representations from videos. |
239 | Memory Recursive Network for Single Image Super-Resolution | Jie Liu; Minqiang Zou; Jie Tang; Gangshan Wu; | To address these issues, we propose the memory recursive network (MRNet) to make full use of the output features at each stage. |
240 | Scale-aware Progressive Optimization Network | Ying Chen; Lifeng Huang; Chengying Gao; Ning Liu; | To this end, we propose a scale-aware progressive optimization network (SPO-Net) for crowd counting, which trains a scale adaptive network to achieve high-quality density map estimation and overcome the variable scale dilemma in highly congested scenes. |
241 | Resource Efficient Domain Adaptation | Junguang Jiang; Ximei Wang; Mingsheng Long; Jianmin Wang; | In this paper, we propose Resource Efficient Domain Adaptation (REDA), a general framework that can adaptively adjust computation re- sources across ‘easier’ and ‘harder’ inputs. |
242 | MGAAttack: Toward More Query-efficient Black-box Attack by Microbial Genetic Algorithm | Lina Wang; Kang Yang; Wenqi Wang; Run Wang; Aoshuang Ye; | To address the efficiency of querying in black-box attack, we propose a novel attack, called MGAAttack, which is a query-efficient and gradient-free black-box attack without obtaining any knowledge of the target model. |
243 | A Novel Graph-TCN with a Graph Structured Representation for Micro-expression Recognition | Ling Lei; Jianfeng Li; Tong Chen; Shigang Li; | To the best of our knowledge, we are the first to use the learning-based video motion magnification method to extract the features of shape representations from the intermediate layer while magnifying MEs. |
244 | Masked Face Recognition with Generative Data Augmentation and Domain Constrained Ranking | Mengyue Geng; Peixi Peng; Yangru Huang; Yonghong Tian; | To obtain sufficient training data, based on the MFSR, we introduce a novel Identity Aware Mask GAN (IAMGAN) with segmentation guided multi-level identity preserve module to generate the synthetic masked face images from the full face images. |
245 | Occlusion Detection for Automatic Video Editing | Junhua Liao; Haihan Duan; Xin Li; Haoran Xu; Yanbing Yang; Wei Cai; Yanru Chen; Liangyin Chen; | In this paper, for releasing the burden of video editors, a frame-level video occlusion detection method is proposed, which is a fundamental component of automatic video editing. |
246 | Cartoon Face Recognition: A Benchmark Dataset | Yi Zheng; Yifan Zhao; Mengyuan Ren; He Yan; Xiangju Lu; Junhui Liu; Jia Li; | To further investigate this challenging dataset, we propose a multi-task domain adaptation approach that jointly utilizes the human and cartoon domain knowledge with three discriminative regularizations. |
247 | Reversible Watermarking in Deep Convolutional Neural Networks for Integrity Authentication | Xiquan Guan; Huamin Feng; Weiming Zhang; Hang Zhou; Jie Zhang; Nenghai Yu; | In this paper, we propose a reversible watermarking algorithm for integrity authentication. |
248 | Masked Face Recognition with Latent Part Detection | Feifei Ding; Peixi Peng; Yangru Huang; Mengyue Geng; Yonghong Tian; | This paper focuses on a novel task named masked faces recognition (MFR), which aims to match masked faces with common faces and is important especially during the global outbreak of COVID-19. |
249 | PanelNet: A Novel Deep Neural Network for Predicting Collective Diagnostic Ratings by a Panel of Radiologists for Pulmonary Nodules | Chunyan Zhang; Songhua Xu; Zongfang Li; | To fill the overlooked gap, this study introduces a novel deep neural network, titled PanelNet, that is able to computationally model and reproduce the aforesaid collective diagnosis capability demonstrated by a group of medical experts. |
250 | Privacy-Preserving Visual Content Tagging using Graph Transformer Networks | Xuan-Son Vu; Duc-Trong Le; Christoffer Edlund; Lili Jiang; Hoang D. Nguyen; | Therefore, this paper proposes an end-to-end framework (SGTN) using Graph Transformer and Convolutional Networks to significantly improve classification and privacy preservation of visual data. |
251 | Rotationally-Consistent Novel View Synthesis for Humans | Youngjoong Kwon; Stefano Petrangeli; Dahun Kim; Haoliang Wang; Henry Fuchs; Viswanathan Swaminathan; | To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. |
252 | Integrating Semantic Segmentation and Retinex Model for Low-Light Image Enhancement | Minhao Fan; Wenjing Wang; Wenhan Yang; Jiaying Liu; | We propose an enhancement pipeline with three parts that effectively utilize the semantic layer information. |
253 | Alleviating Human-level Shift: A Robust Domain Adaptation Method for Multi-person Pose Estimation | Xixia Xu; Qi Zou; Xue Lin; | Therefore, we propose a novel domain adaptation method for multi-person pose estimation to conduct the human-level topological structure alignment and fine-grained feature alignment. |
254 | SpatialGAN: Progressive Image Generation Based on Spatial Recursive Adversarial Expansion | Lei Zhao; Sihuan Lin; Ailin Li; Huaizhong Lin; Wei Xing; Dongming Lu; | In this paper, we propose a progressive spatial recursive adversarial expansion model(called SpatialGAN) capable of producing high-quality samples of the natural image. |
255 | Medical Visual Question Answering via Conditional Reasoning | Li-Ming Zhan; Bo Liu; Lu Fan; Jiaxin Chen; Xiao-Ming Wu; | In this paper, we propose a novel conditional reasoning framework for Med-VQA, aiming to automatically learn effective reasoning skills for various Med-VQA tasks. |
256 | Nighttime Dehazing with a Synthetic Benchmark | Jing Zhang; Yang Cao; Zheng-Jun Zha; Dacheng Tao; | To address this issue, we propose a novel synthetic method called 3R to simulate nighttime hazy images from daytime clear images, which first reconstructs the scene geometry, then simulates the light rays and object reflectance, and finally renders the haze effects. |
257 | Pay Attention Selectively and Comprehensively: Pyramid Gating Network for Human Pose Estimation without Pre-training | Chenru Jiang; Kaizhu Huang; Shufei Zhang; Xinheng Wang; Jimin Xiao; | To mitigate these problems, we propose a novel comprehensive recalibration model called Pyramid GAting Network (PGA-Net) that is capable of distillating, selecting, and fusing the discriminative and attention-aware features at different scales and different levels (i.e., both semantic and natural levels). |
258 | Data-driven Meta-set Based Fine-Grained Visual Recognition | Chuanyi Zhang; Yazhou Yao; Xiangbo Shu; Zechao Li; Zhenmin Tang; Qi Wu; | To this end, we propose a data-driven meta-set based approach to deal with noisy web images for fine-grained recognition. |
259 | WildDeepfake: A Challenging Real-World Dataset for Deepfake Detection | Bojia Zi; Minghao Chang; Jingjing Chen; Xingjun Ma; Yu-Gang Jiang; | To better support detection against real-world deepfakes, in this paper, we introduce a new dataset WildDeepfake, which consists of 7,314 face sequences extracted from 707 deepfake videos collected completely from the internet. |
260 | LodoNet: A Deep Neural Network with 2D Keypoint Matching for 3D LiDAR Odometry Estimation | Ce Zheng; Yecheng Lyu; Ming Li; Ziming Zhang; | In contrast, motivated by the success of image based feature extractors, we propose to transfer the LiDAR frames to image space and reformulate the problem as image feature extraction. |
261 | Memory-Based Network for Scene Graph with Unbalanced Relations | Weitao Wang; Ruyang Liu; Meng Wang; Sen Wang; Xiaojun Chang; Yang Chen; | For these reasons, we propose a novel scene graph generation model that can effectively improve the detection of low-frequency relations. |
262 | Pairwise Similarity Regularization for Adversarial Domain Adaptation | Haotian Wang; Wenjing Yang; Ji Wang; Ruxin Wang; Long Lan; Mingyang Geng; | To resolve this issue, we propose a Pairwise Similarity Regularization (PSR) approach that exploits cluster structures of the target domain data and minimizes the divergence between the pairwise similarity of clustering partition and that of pseudo predictions. |
263 | Generalized Zero-Shot Video Classification via Generative Adversarial Networks | Mingyao Hong; Guorong Li; Xinfeng Zhang; Qingming Huang; | In order to solve this problem, we propose a description text dataset based on the UCF101 action recognition dataset. |
264 | Drum Synthesis and Rhythmic Transformation with Adversarial Autoencoders | Maciej Tomczak; Masataka Goto; Jason Hockman; | This paper presents a method for joint synthesis and rhythm transformation of drum sounds through the use of adversarial autoencoders (AAE). |
265 | MMNet: Multi-Stage and Multi-Scale Fusion Network for RGB-D Salient Object Detection | Guibiao Liao; Wei Gao; Qiuping Jiang; Ronggang Wang; Ge Li; | To effectively capture multi-scale cross-modal fusion features, this paper proposes a novel Multi-stage and Multi-Scale Fusion Network (MMNet), which consists of a cross-modal multi-stage fusion module (CMFM) and a bi-directional multi-scale decoder (BMD). |
266 | Stable Video Style Transfer Based on Partial Convolution with Depth-Aware Supervision | Songhua Liu; Hao Wu; Shoutong Luo; Zhengxing Sun; | presents a novel training framework for video style transfer without dependency on video dataset of target style; (2). |
267 | Video Synthesis via Transform-Based Tensor Neural Network | Yimeng Zhang; Xiao-Yang Liu; Bo Wu; Anwar Walid; | In this paper, we propose a novel multi-phase deep neural network Transform-Based Tensor-Net that exploits the low-rank structure of video data in a learned transform domain, which unfolds an Iterative Shrinkage-Thresholding Algorithm (ISTA) for tensor signal recovery. |
268 | Cluster Attention Contrast for Video Anomaly Detection | Ziming Wang; Yuexian Zou; Zeming Zhang; | To avoid these problems, we introduce a novel contrastive representation learning task, Cluster Attention Contrast, to establish subcategories of normality as clusters. |
269 | Automatic Interest Recognition from Posture and Behaviour | Wolmer Bigi; Claudio Baecchi; Alberto Del Bimbo; | To address all these aspects, we propose an automatic system that aims to recognize the user’s interest towards a garment by just looking at body posture and behaviour. |
270 | Referenceless Rate-Distortion Modeling with Learning from Bitstream and Pixel Features | Yangfan Sun; Li Li; Zhu Li; Shan Liu; none none; | Therefore, to improve the fidelity of prediction, we propose a referenceless prediction-based R-QP modeling (PmR-QP) method to estimate bitrate by leveraging a deep learning algorithm with only one-pass coding. |
271 | MS2L: Multi-Task Self-Supervised Learning for Skeleton Based Action Recognition | Lilang Lin; Sijie Song; Wenhan Yang; Jiaying Liu; | In this paper, we address self-supervised representation learning from human skeletons for action recognition. |
272 | Domain-Adaptive Object Detection via Uncertainty-Aware Distribution Alignment | Dang-Khoa Nguyen; Wei-Lun Tseng; Hong-Han Shuai; | Specifically, we propose a Multi-level Entropy Attention Alignment (MEAA) method that consists of two main components: (1) Local Uncertainty Attentional Alignment (LUAA) module to accelerate the model better perceiving structure-invariant objects of interest by utilizing information theory to measure the uncertainty of each local region via the entropy of the pixel-wise domain classifier and (2) Multi-level Uncertainty-Aware Context Alignment (MUCA) module to enrich domain-invariant information of relevant objects based on the entropy of multi-level domain classifiers. |
273 | MM-Hand: 3D-Aware Multi-Modal Guided Hand Generation for 3D Hand Pose Synthesis | Zhenyu Wu; Duc Hoang; Shih-Yao Lin; Yusheng Xie; Liangjian Chen; Yen-Yu Lin; Zhangyang Wang; Wei Fan; | We propose a 3D-aware multi-modal guided hand generative network (MM-Hand), together with a novel geometry-based curriculum learning strategy. |
274 | Joint Self-Attention and Scale-Aggregation for Self-Calibrated Deraining Network | Cong Wang; Yutong Wu; Zhixun Su; Junyang Chen; | In this paper, we propose an effective algorithm, called JDNet, to solve the single image deraining problem and conduct the segmentation and detection task for applications. |
275 | Hybrid Dynamic-static Context-aware Attention Network for Action Assessment in Long Videos | Ling-An Zeng; Fa-Ting Hong; Wei-Shi Zheng; Qi-Zhi Yu; Wei Zeng; Yao-Wei Wang; Jian-Huang Lai; | In this work, we present a novel hybrid dynAmic-static Context-aware attenTION NETwork (ACTION-NET) for action assessment in long videos. |
276 | F2GAN: Fusing-and-Filling GAN for Few-shot Image Generation | Yan Hong; Li Niu; Jianfu Zhang; Weijie Zhao; Chen Fu; Liqing Zhang; | In this paper, we propose a Fusing-and-Filling Generative Adversarial Network (F2GAN) to generate realistic and diverse images for a new category with only a few images. |
277 | JAFPro: Joint Appearance Fusion and Propagation for Human Video Motion Transfer from Multiple Reference Images | Xianggang Yu; Haolin Liu; Xiaoguang Han; Zhen Li; Zixiang Xiong; Shuguang Cui; | We present a novel framework for human video motion transfer. |
278 | A W2VV++ Case Study with Automated and Interactive Text-to-Video Retrieval | Jakub Lokoć; Tomáš Soućek; Patrik Veselý; František Mejzlík; Jiaqi Ji; Chaoxi Xu; Xirong Li; | To report on this challenging problem, we present two orthogonal task-based performance studies centered around the state-of-the-art W2VV++ query representation learning model for video retrieval. |
279 | Attention Cube Network for Image Restoration | Yucheng Hang; Qingmin Liao; Wenming Yang; Yupeng Chen; Jie Zhou; | To address these issues, we propose an attention cube network (A-CubeNet) for image restoration for more powerful feature expression and feature correlation learning. |
280 | CRNet: A Center-aware Representation for Detecting Text of Arbitrary Shapes | Yu Zhou; Hongtao Xie; Shancheng Fang; Yan Li; Yongdong Zhang; | To tackle these problems, we propose an anchor-free scene text detector leveraging Center-aware Representation to achieve accurate arbitrary-shaped scene text detection namely CRNet. |
281 | Expressional Region Retrieval | Xiaoqian Guo; Xiangyang Li; Shuqiang Jiang; | In this paper, we introduce a new task called expressional region retrieval, in which the query is formulated as a region of image with the associated description. |
282 | ATRW: A Benchmark for Amur Tiger Re-identification in the Wild | Shuyuan Li; Jianguo Li; Hanlin Tang; Rui Qian; Weiyao Lin; | This paper tries to fill the gap by introducing a novel large-scale dataset, the Amur Tiger Re-identification in the Wild (ATRW) dataset. |
283 | VideoIC: A Video Interactive Comments Dataset and Multimodal Multitask Learning for Comments Generation | Weiying Wang; Jieting Chen; Qin Jin; | In order to support various related research, we build a large scale video interactive comments dataset called VideoIC, which consists of 4951 videos spanning 557 hours and 5 million comments. |
284 | Human Identification and Interaction Detection in Cross-View Multi-Person Videos with Wearable Cameras | Jiewen Zhao; Ruize Han; Yiyang Gan; Liang Wan; Wei Feng; Song Wang; | By focusing on two wearable cameras and the interactive activities that involve only two people, in this paper we develop a new approach that can simultaneously: (i) identify the same persons across the two videos, (ii) detect the interactive activities of interest, including their occurrence intervals and involved people, and (iii) recognize the category of each interactive activity. |
285 | Surface Reconstruction with Unconnected Normal Maps: An Efficient Mesh-based Approach | Miaohui Wang; Wuyuan Xie; Maolin Cui; | For the first time, this paper presents an efficient approach to address the fundamental problem of surface reconstruction from unconnected normal maps (denoted as "SfN+") using discrete geometry. |
286 | MOR-UAV: A Benchmark Dataset and Baselines for Moving Object Recognition in UAV Videos | Murari Mandal; Lav Kush Kumar; Santosh Kumar Vipparthi; | Therefore, in this paper, we introduce MOR-UAV, a large-scale video dataset for MOR in aerial videos. |
287 | Learning Tuple Compatibility for Conditional Outfit Recommendation | Xuewen Yang; Dongliang Xie; Xin Wang; Jiangbo Yuan; Wanying Ding; Pengyun Yan; | To better define the fashion compatibility and more flexibly meet different needs, we propose a novel problem of learning compatibility among multiple tuples (each consisting of an item and category pair), and recommending fashion items following the category choices from customers. |
288 | Efficient Crowd Counting via Structured Knowledge Transfer | Lingbo Liu; Jiaqi Chen; Hefeng Wu; Tianshui Chen; Guanbin Li; Liang Lin; | To liberate these crowd counting models, we propose a novel Structured Knowledge Transfer (SKT) framework, which fully exploits the structured knowledge of a well-trained teacher network to generate a lightweight but still highly effective student network. |
289 | DeSmoothGAN: Recovering Details of Smoothed Images via Spatial Feature-wise Transformation and Full Attention | Yifei Huang; Chenhui Li; Xiaohu Guo; Jing Liao; Chenxu Zhang; Changbo Wang; | In this work, we propose DeSmoothGAN to utilize both characteristics specifically. |
290 | PatchMatch based Multiview Stereo with Local Quadric Window | Hyewon Song; Jaeseong Park; Suwoong Heo; Jiwoo Kang; Sanghoon Lee; | In this paper, we propose an accurate PatchMatch based multiview stereo matching method with a quadric support window that efficiently captures the surface of a complex structured object. |
291 | Expert Performance in the Examination of Interior Surfaces in an Automobile: Virtual Reality vs. Reality | Alexander Tesch; Ralf Dörner; | In this paper, we evaluate the applicability of VR using head mounted displays (HMDs) in an experiment where we had experts examine the design quality of an interior in VR and compared the results with the examination on a powerwall as well as in reality. |
292 | Uncertainty-based Traffic Accident Anticipation with Spatio-Temporal Relational Learning | Wentao Bao; Qi Yu; Yu Kong; | In this paper, we propose an uncertainty-based accident anticipation model with spatio-temporal relational learning. |
293 | A Tightly-coupled Semantic SLAM System with Visual, Inertial and Surround-view Sensors for Autonomous Indoor Parking | Xuan Shao; Lin Zhang; Tianjun Zhang; Ying Shen; Hongyu Li; Yicong Zhou; | To this end, this paper proposes a novel tightly-coupled semantic SLAM system by integrating Visual, Inertial, and Surround-view sensors, VIS SLAM for short, for autonomous indoor parking. |
294 | Searching Privately by Imperceptible Lying: A Novel Private Hashing Method with Differential Privacy | Yimu Wang; Shiyin Lu; Lijun Zhang; | In this paper, we tackle this valuable yet challenging problem and formulate a task termed as private hashing, which takes into account both searching performance and privacy protection. |
295 | Leverage Social Media for Personalized Stress Detection | Xin Wang; Huijun Zhang; Lei Cao; Ling Feng; | We construct a three-leveled framework, aiming at personalized stress detection based on social media. |
296 | Arbitrary Style Transfer via Multi-Adaptation Network | Yingying Deng; Fan Tang; Weiming Dong; Wen Sun; Feiyue Huang; Changsheng Xu; | In this paper, we propose the multi-adaptation network which involves two self-adaptation (SA) modules and one co-adaptation (CA) module:the SA modules adaptively disentangle the content and style representations, i.e., content SA module uses position-wise self-attention to enhance content representation and style SA module uses channel-wise self-attention to enhance style representation; the CA module rearranges the distribution of style representation based on content representation distribution by calculating the local similarity between the disentangled content and style features in a non-local fashion. |
297 | Dual-view Attention Networks for Single Image Super-Resolution | Jingcai Guo; Shiheng Ma; Jie Zhang; Qihua Zhou; Song Guo; | In this paper, we propose the Dual-view Attention Networks to alleviate these problems for SISR. |
298 | MRI Measurement Matrix Learning via Correlation Reweighting | Zhongnian Li; Tao Zhang; Ruoyu Chen; Daoqiang Zhang; | In this paper, we propose a novel Measurement Matrix Learning via Correlation Reweighting (MML-CR) approach for exploring and solving this problem by optimizing reweighted model. |
299 | Complementary-View Co-Interest Person Detection | Ruize Han; Jiewen Zhao; Wei Feng; Yiyang Gan; Liang Wan; Song Wang; | In this paper, we study a much more realistic and challenging problem, namely co-interest person~(CIP) detection from multiple temporally-synchronized videos taken by the complementary and time-varying views. |
300 | Multimodal Dialogue Systems via Capturing Context-aware Dependencies of Semantic Elements | Weidong He; Zhi Li; Dongcai Lu; Enhong Chen; Tong Xu; Baoxing Huai; Jing Yuan; | To address these issues, we propose a Multimodal diAlogue systems with semanTic Elements, MATE for short. |
301 | EyeShopper: Estimating Shoppers’ Gaze using CCTV Cameras | Carlos Bermejo; Dimitris Chatzopoulos; Pan Hui; | In this paper, we present EyeShopper, an innovative system that tracks the gaze of shoppers when facing away from the camera and provides insights about their behavior in physical stores. |
302 | Exploiting Active Learning in Novel Refractive Error Detection with Smartphones | Eugene Yujun Fu; Zhongqi Yang; Hong Va Leong; Grace Ngai; Chi-wai Do; Lily Chan; | To address these challenges, this paper exploits active learning methods with a set of Convolutional Neural Network features encoding information of human eyes from pre-trained gaze estimation model. |
303 | Price Suggestion for Online Second-hand Items with Texts and Images | Liang Han; Zhaozheng Yin; Zhurong Xia; Minqian Tang; Rong Jin; | This paper presents an intelligent price suggestion system for online second-hand listings based on their uploaded images and text descriptions. |
304 | An Advanced LiDAR Point Cloud Sequence Coding Scheme for Autonomous Driving | Xuebin Sun; Sukai Wang; Miaohui Wang; Shing Shin Cheng; Ming Liu; | Learning from the high efficiency video coding (HEVC) coding framework, we propose an advanced coding scheme for large-scale LiDAR point cloud sequences, in which several techniques have been developed to remove the spatial and temporal redundancy. |
305 | Learning Optimization-based Adversarial Perturbations for Attacking Sequential Recognition Models | Xing Xu; Jiefu Chen; Jinhui Xiao; Zheng Wang; Yang Yang; Heng Tao Shen; | In this paper, we study the adversarial attack on the general and popular DNN structure of CNN+RNN, i.e., the combination of convolutional neural network (CNN) and recurrent neural network (RNN), which has been widely used in various SR tasks. |
306 | Emotions Don’t Lie: An Audio-Visual Deepfake Detection Method using Affective Cues | Trisha Mittal; Uttaran Bhattacharya; Rohan Chandra; Aniket Bera; Dinesh Manocha; | We present a learning-based method for detecting real and fake deepfake multimedia content. |
307 | Deep Disturbance-Disentangled Learning for Facial Expression Recognition | Delian Ruan; Yan Yan; Si Chen; Jing-Hao Xue; Hanzi Wang; | In this paper, we propose a novel Deep Disturbance-disentangled Learning (DDL) method for FER. |
308 | Unsupervised Learning Facial Parameter Regressor for Action Unit Intensity Estimation via Differentiable Renderer | Xinhui Song; Tianyang Shi; Zunlei Feng; Mingli Song; Jackie Lin; Chuanjie Lin; Changjie Fan; Yi Yuan; | In this paper, we present a framework to predict the facial parameters (including identity parameters and AU parameters) based on a bone-driven face model (BDFM) under different views. |
309 | Semi-supervised Multi-modal Emotion Recognition with Cross-Modal Distribution Matching | Jingjun Liang; Ruichen Li; Qin Jin; | In this work, we propose a novel semi-supervised multi-modal emotion recognition model based on cross-modality distribution matching, which leverages abundant unlabeled data to enhance the model training under the assumption that the inner emotional status is consistent at the utterance level across modalities. |
310 | PersonalitySensing: A Multi-View Multi-Task Learning Approach for Personality Detection based on Smartphone Usage | Songcheng Gao; Wenzhong Li; Lynda J. Song; Xiao Zhang; Mingkai Lin; Sanglu Lu; | In this paper, we propose a deep learning approach to infer people’s Big Five personality traits based on smartphone data. |
311 | AU-assisted Graph Attention Convolutional Network for Micro-Expression Recognition | Hong-Xia Xie; Ling Lo; Hong-Han Shuai; Wen-Huang Cheng; | In this paper, we propose a novel micro-expression recognition approach by combining Action Units (AUs) and emotion category labels. |
312 | DFEW: A Large-Scale Database for Recognizing Dynamic Facial Expressions in the Wild | Xingxun Jiang; Yuan Zong; Wenming Zheng; Chuangao Tang; Wanchuang Xia; Cheng Lu; Jiateng Liu; | In this paper, we focus on this challenging but interesting topic and make contributions from three aspects. |
313 | Region of Interest Based Graph Convolution: A Heatmap Regression Approach for Action Unit Detection | Zheng Zhang; Taoyue Wang; Lijun Yin; | In this paper, we first extend the heatmaps to ROI maps, encoding the location of both positive and negative occurred AUs, then employ a well-designed backbone network to regress it. |
314 | IExpressNet: Facial Expression Recognition with Incremental Classes | Junjie Zhu; Bingjun Luo; Sicheng Zhao; Shihui Ying; Xibin Zhao; Yue Gao; | To address these problems, we develop an Incremental Facial Expression Recognition Network (IExpressNet), which can learn a competitive multi-class classifier at any time with a lower requirement of computing resources. |
315 | SST-EmotionNet: Spatial-Spectral-Temporal based Attention 3D Dense Network for EEG Emotion Recognition | Ziyu Jia; Youfang Lin; Xiyang Cai; Haobin Chen; Haijun Gou; Jing Wang; | In this paper, we propose a novel spatial-spectral-temporal based attention 3D dense network, named SST-EmotionNet, for EEG emotion recognition. |
316 | Language Models as Emotional Classifiers for Textual Conversation | Connor T. Heaton; David M. Schwartz; | In this study we present a novel methodology for classifying emotion in a conversation. |
317 | Occluded Facial Expression Recognition with Step-Wise Assistance from Unpaired Non-Occluded Images | Bin Xia; Shangfei Wang; | Considering facial images without occlusions usually provide more information for facial expression recognition compared to occluded facial images, we propose a step-wise learning strategy for occluded facial expression recognition that utilizes unpaired non-occluded images as guidance in the feature and label space. |
318 | Learning from Macro-expression: a Micro-expression Recognition Framework | Bin Xia; Weikang Wang; Shangfei Wang; Enhong Chen; | Since micro-expression and macro-expression share some similarities in facial muscle movements and texture changes, in this paper we propose a micro-expression recognition framework that leverages macro-expression samples as guidance. |
319 | Emotion-Based End-to-End Matching Between Image and Music in Valence-Arousal Space | Sicheng Zhao; Yaxian Li; Xingxu Yao; Weizhi Nie; Pengfei Xu; Jufeng Yang; Kurt Keutzer; | In this paper, we study end-to-end matching between image and music based on emotions in the continuous valence-arousal (VA) space. |
320 | Exploiting Multi-Emotion Relations at Feature and Label Levels for Emotion Tagging | Zhiwei Xu; Shangfei Wang; Can Wang; | In this paper, we propose a novel emotion tagging method, that thoroughly explores emotion relations from both the feature and label levels. |
321 | Uncertainty-aware Cross-dataset Facial Expression Recognition via Regularized Conditional Alignment | Linyi Zhou; Xijian Fan; Yingjie Ma; Tardi Tjahjadi; Qiaolin Ye; | To mitigate this problem, this paper proposes an unsupervised domain adaptation method via regularized conditional alignment for FER, which adversarially reduces domain- and class-wise discrepancies while explicitly dealing with uncertainties within and across domain. |
322 | Fonts Like This but Happier: A New Way to Discover Fonts | Tugba Kulahcioglu; Gerard de Melo; | In this study, we propose a new multimodal font discovery method in which users provide a reference font together with the changes they wish to obtain in order to get closer to their ideal font. |
323 | Adaptive Multimodal Fusion for Facial Action Units Recognition | Huiyuan Yang; Taoyue Wang; Lijun Yin; | In this paper, we propose a novel A daptive M ultimodal F usion method (AMF ) for AU detection, which learns to select the most relevant feature representations from different modalities by a re-sampling procedure conditioned on a feature scoring module. |
324 | Exploiting Self-Supervised and Semi-Supervised Learning for Facial Landmark Tracking with Unlabeled Data | Shi Yin; Shangfei Wang; Xiaoping Chen; Enhong Chen; | To relieve the burden of manual annotations, we propose a novel facial landmark tracking method that makes full use of unlabeled facial videos by exploiting both self-supervised and semi-supervised learning mechanisms. |
325 | Cross Corpus Physiological-based Emotion Recognition Using a Learnable Visual Semantic Graph Convolutional Network | Woan-Shiuan Chien; Hao-Chun Yang; Chi-Chun Lee; | In this study, we aim to develop a network learning strategy for robust cross-corpus emotion recognition using physiological features jointly with affective video content. |
326 | Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks | Mengshi Qi; Jie Qin; Xiantong Zhen; Di Huang; Yi Yang; Jiebo Luo; | In this paper, we address few-shot video classification by learning an ensemble of SlowFast networks augmented with memory units. |
327 | Look Through Masks: Towards Masked Face Recognition with De-Occlusion Distillation | Chenyu Li; Shiming Ge; Daichi Zhang; Jia Li; | Inspired by recent progress on amodal perception, we propose to migrate the mechanism of amodal completion for the task of masked face recognition with an end-to-end de-occlusion distillation framework, which consists of two modules. |
328 | Privacy-sensitive Objects Pixelation for Live Video Streaming | Jizhe Zhou; Chi-Man Pun; Yu Tong; | To cope with the inevitable but impacting detection issue, we propose a novel Privacy-sensitive Objects Pixelation (PsOP) framework for automatic personal privacy filtering during live video streaming. |
329 | Deep Local Binary Coding for Person Re-Identification by Delving into the Details | Jiaxin Chen; Jie Qin; Yichao Yan; Lei Huang; Li Liu; Fan Zhu; Ling Shao; | In this work, we present a novel binary representation learning framework for efficient person ReID, namely Deep Local Binary Coding (DLBC). |
330 | March on Data Imperfections: Domain Division and Domain Generalization for Semantic Segmentation | Hai Xu; Hongtao Xie; Zheng-Jun Zha; Sun-ao Liu; Yongdong Zhang; | In contrast to previous works, we present a novel model-agnostic training optimization algorithm which has two prominent components: Domain Division and Domain Generalization. |
331 | Gait Recognition with Multiple-Temporal-Scale 3D Convolutional Neural Network | Beibei Lin; Shunli Zhang; Feng Bao; | To address the above issues, we propose a novel multiple-temporal-scale gait recognition framework which integrates the temporal information in multiple temporal scales, making use of both the frame and interval fusion information. |
332 | SRHEN: Stepwise-Refining Homography Estimation Network via Parsing Geometric Correspondences in Deep Latent Space | Yi Li; Wenjie Pei; Zhenyu He; | In this paper, we propose to parse the geometric correspondences between related images explicitly to bridge the gap between deep appearance features and the homography. |
333 | Tactile Sketch Saliency | Jianbo Jiao; Ying Cao; Manfred Lau; Rynson Lau; | In this paper, we aim to understand the functionality of 2D sketches by predicting how humans would interact with the objects depicted by sketches in real life. |
334 | Towards Clustering-friendly Representations: Subspace Clustering via Graph Filtering | Zhengrui Ma; Zhao Kang; Guangchun Luo; Ling Tian; Wenyu Chen; | To recover the "clustering-friendly" representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. |
335 | One-shot Scene Graph Generation | Yuyu Guo; Jingkuan Song; Lianli Gao; Heng Tao Shen; | In this paper, we propose Multiple Structured Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot scene graph generation task. |
336 | Cross-Granularity Learning for Multi-Domain Image-to-Image Translation | Huiyuan Fu; Ting Yu; Xin Wang; Huadong Ma; | To ensure the important instance to be more realistically translated, we propose a cross-granularity learning model for multi-domain image-to-image translation. |
337 | Enhancing Self-supervised Monocular Depth Estimation via Incorporating Robust Constraints | Rui Li; Xiantuo He; Yu Zhu; Xianjun Li; Jinqiu Sun; Yanning Zhang; | In this paper, we address this issue by enhancing the robustness of the self-supervised paradigm using a set of image-based and geometry-based constraints. |
338 | A Novel Object Re-Track Framework for 3D Point Clouds | Tuo Feng; Licheng Jiao; Hao Zhu; Long Sun; | In this paper, we propose a 3D object two-stage re-track framework directly utilizing point clouds as the input, without using the ground truth as the reference box. |
339 | Video Relation Detection via Multiple Hypothesis Association | Zixuan Su; Xindi Shang; Jingjing Chen; Yu-Gang Jiang; Zhiyong Qiu; Tat-Seng Chua; | In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). |
340 | HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation | Lin Huang; Jianchao Tan; Jingjing Meng; Ji Liu; Junsong Yuan; | To overcome these issues, we propose to fully utilize the structural correlations among hand joints and object corners in order to obtain more reliable poses. |
341 | Multi-Features Fusion and Decomposition for Age-Invariant Face Recognition | Lixuan Meng; Chenggang Yan; Jun Li; Jian Yin; Wu Liu; Hongtao Xie; Liang Li; | To address this issue, in this work we propose a novel Multi-Features Fusion and Decomposition (MFFD) framework to learn more discriminative feature representations and alleviate the intra-class variations for AIFR. |
342 | Part-Aware Interactive Learning for Scene Graph Generation | Hongshuo Tian; Ning Xu; An-An Liu; Yongdong Zhang; | In this paper, we propose a part-aware interactive learning method, which are divided into the intra-object and inter-object scenarios. |
343 | Retrieval Guided Unsupervised Multi-domain Image to Image Translation | Raul Gomez; Yahui Liu; Marco De Nadai; Dimosthenis Karatzas; Bruno Lepri; Nicu Sebe; | In this paper we propose the use of an image retrieval system to assist the image-to-image translation task. |
344 | GangSweep: Sweep out Neural Backdoors by GAN | Liuwan Zhu; Rui Ning; Cong Wang; Chunsheng Xin; Hongyi Wu; | This work proposes GangSweep, a new backdoor detection framework that leverages the super reconstructive power of Generative Adversarial Networks (GAN) to detect and ”sweep out” neural backdoors. |
345 | Iterative Back Modification for Faster Image Captioning | Zhengcong Fei; | In this paper, we propose a non-autoregressive approach for faster image caption generation. |
346 | VIMES: A Wearable Memory Assistance System for Automatic Information Retrieval | Carlos Bermejo; Tristan Braud; Ji Yang; Shayan Mirjafari; Bowen Shi; Yu Xiao; Pan Hui; | In this work, we propose VIMES, an augmented reality-based memory assistance system that helps recall declarative memory, such as whom the user meets and what they chat. |
347 | Neutral Face Game Character Auto-Creation via PokerFace-GAN | Tianyang Shi; Zhengxia Zou; Xinhui Song; Zheng Song; Changjian Gu; Changjie Fan; Yi Yuan; | In this paper, considering the above problems, we propose a novel method named "PokerFace-GAN" for neutral face game character auto-creation. |
348 | Gray2ColorNet: Transfer More Colors from Reference Image | Peng Lu; Jinbei Yu; Xujun Peng; Zhaoran Zhao; Xiaojie Wang; | Thus, an end-to-end colorization network Gray2ColorNet is proposed in this work, where an attention gating mechanism based color fusion network is designed to accomplish the colorization tasks. |
349 | Crossing You in Style: Cross-modal Style Transfer from Music to Visual Arts | Cheng-Che Lee; Wan-Yi Lin; Yen-Ting Shih; Pei-Yi (Patricia) Kuo; Li Su; | Assuming that musical features can be properly mapped to visual contents through semantic links between the two domains, we solve the music-to-visual style transfer problem in two steps: music visualization and style transfer. |
350 | Modeling Caricature Expressions by 3D Blendshape and Dynamic Texture | Keyu Chen; Jianmin Zheng; Jianfei Cai; Juyong Zhang; | This paper presents a solution to the problem, with an emphasis on enhancing the ability to create desired expressions and meanwhile preserve the identity exaggeration style of the caricature, which imposes challenges due to the complicated nature of caricatures. |
351 | SketchMan: Learning to Create Professional Sketches | Jia Li; Nan Gao; Tong Shen; Wei Zhang; Tao Mei; Hui Ren; | We propose a new challenging task sketch enhancement (SE) defined in an ill-posed space, i.e. enhancing a non-professional sketch (NPS) to a professional sketch (PS), which is a creative generation task different from sketch abstraction, sketch completion and sketch variation. |
352 | Anisotropic Stroke Control for Multiple Artists Style Transfer | Xuanhong Chen; Xirui Yan; Naiyuan Liu; Ting Qiu; Bingbing Ni; | To circumvent this issue, we propose a Stroke Control Multi-Artist Style Transfer framework. |
353 | A Multi-update Deep Reinforcement Learning Algorithm for Edge Computing Service Offloading | Hao Hao; Changqiao Xu; Lujie Zhong; Gabriel-Miro Muntean; | Instead, this paper proposes an innovative deep reinforcement learning method to solve it. |
354 | Identity-Aware Attribute Recognition via Real-Time Distributed Inference in Mobile Edge Clouds | Zichuan Xu; Jiangkai Wu; Qiufen Xia; Pan Zhou; Jiankang Ren; Huizhi Liang; | In this paper, we design novel models for pedestrian attribute recognition with re-ID in an MEC-enabled camera monitoring system. |
355 | Deep Unsupervised Hybrid-similarity Hadamard Hashing | Wanqian Zhang; Dayan Wu; Yu Zhou; Bo Li; Weiping Wang; Dan Meng; | In this paper, we propose a simple yet effective unsupervised hashing method, dubbed Deep Unsupervised Hybrid-similarity Hadamard Hashing (DU3H), which tackles these issues in an end-to-end deep hashing framework. |
356 | Incomplete Cross-modal Retrieval with Dual-Aligned Variational Autoencoders | Mengmeng Jing; Jingjing Li; Lei Zhu; Ke Lu; Yang Yang; Zi Huang; | In this paper, we propose a Dual-Aligned Variational Autoencoders (DAVAE) to address the incomplete CMR problem. |
357 | MRS-Net: Multi-Scale Recurrent Scalable Network for Face Quality Enhancement of Compressed Videos | Tie Liu; Mai Xu; Shengxi Li; Rui Ding; Huaida Liu; | Motivated by scalable video coding, we propose a multi-scale recurrent scalable network (MRS-Net) to enhance the quality of multi-scale faces in compressed videos. |
358 | Panoptic Image Annotation with a Collaborative Assistant | Jasper R.R. Uijlings; Mykhaylo Andriluka; Vittorio Ferrari; | This paper aims to reduce the time to annotate images for panoptic segmentation, which requires annotating segmentation masks and class labels for all object instances and stuff regions. |
359 | Blind Natural Video Quality Prediction via Statistical Temporal Features and Deep Spatial Features | Jari Korhonen; Yicheng Su; Junyong You; | In this study, we combine the hand-crafted statistical temporal features used in a state-of-the-art video quality model and spatial features obtained from convolutional neural network trained for image quality assessment via transfer learning. |
360 | Aesthetic-Aware Image Style Transfer | Zhiyuan Hu; Jia Jia; Bei Liu; Yaohua Bu; Jianlong Fu; | In this paper, we propose a novel problem called Aesthetic-Aware Image Style Transfer task, which aims to transfer colour and texture separately and independently to manipulate the aesthetic effect of an image. |
361 | Building Movie Map – A Tool for Exploring Areas in a City – and its Evaluations | Naoki Sugimoto; Yoshihito Ebine; Kiyoharu Aizawa; | We propose a new Movie Map, which will enable users to explore a given city area using omnidirectional videos. |
362 | A Probabilistic Graphical Model for Analyzing the Subjective Visual Quality Assessment Data from Crowdsourcing | Jing Li; Suiyi Ling; Junle Wang; Patrick Le Callet; | In this paper, we propose a probabilistic graphical annotation model to infer the underlying ground truth and discovering the annotator’s behavior. |
363 | DroidCloud: Scalable High Density AndroidTM Cloud Rendering | Linsheng Li; Bin Yang; Cathy Bao; Shuo Liu; Randy Xu; Yong Yao; Mohammad R. Haghighat; Jerry W. Hu; Shoumeng Yan; Zhengwei Qi; | This paper presents DroidCloud, the first open-source Android\footnoteAndroid is a trademark of Google LLC. |
364 | Interpretable Embedding for Ad-Hoc Video Search | Jiaxin Wu; Chong-Wah Ngo; | This paper integrates feature embedding and concept interpretation into a neural network for unified dual-task learning. |
365 | Joint Attribute Manipulation and Modality Alignment Learning for Composing Text and Image to Image Retrieval | Feifei Zhang; Mingliang Xu; Qirong Mao; Changsheng Xu; | In this paper, we devote to an emerging task in cross-modal retrieval, Composing Text and Image to Image Retrieval (CTI-IR), which aims at retrieving images relevant to a query image with text describing desired modifications to the query image. |
366 | Semi-supervised Online Multi-Task Metric Learning for Visual Recognition and Retrieval | Yangxi Li; Han Hu; Jin Li; Yong Luo; Yonggang Wen; | In this paper, we proposed a novel semi-supervised online multi-task DML method termed SOMTML, which enables the models describing different tasks to help each other during the metric learning procedure and thus improving their respective performance. |
367 | Supervised Hierarchical Deep Hashing for Cross-Modal Retrieval | Yu-Wei Zhan; Xin Luo; Yongxin Wang; Xin-Shun Xu; | In this paper, we propose an effective cross-modal hashing method, named Supervised Hierarchical Deep Cross-modal Hashing, SHDCH for short, to learn hash codes by explicitly delving into the hierarchical labels. |
368 | Multi-graph Convolutional Network for Unsupervised 3D Shape Retrieval | Weizhi Nie; Yue Zhao; An-An Liu; Zan Gao; Yuting Su; | To solve these problems, we propose a novel multi-graph network (MGN) for unsupervised 3D shape retrieval, which utilizes the correlations among modalities and structural similarity between two models to guide the shape representation learning process without category information. |
369 | Bottom-Up Foreground-Aware Feature Fusion for Person Search | Wenjie Yang; Dangwei Li; Xiaotang Chen; Kaiqi Huang; | In this work, we propose a subnet to fuse the bounding box features that pooled from multiple ConvNet stages in a bottom-up manner, termed bottom-up fusion (BUF) network. |
370 | Rethinking Generative Zero-Shot Learning: An Ensemble Learning Perspective for Recognising Visual Patches | Zhi Chen; Sen Wang; Jingjing Li; Zi Huang; | To address these issues, we propose a novel framework called multi-patch generative adversarial nets (MPGAN) that synthesises local patch features and labels unseen classes with a novel weighted voting strategy. |
371 | Surpassing Real-World Source Training Data: Random 3D Characters for Generalizable Person Re-Identification | Yanan Wang; Shengcai Liao; Ling Shao; | To address this, we propose to automatically synthesize a large-scale person re-identification dataset following a set-up similar to real surveillance but with virtual environments, and then use the synthesized person images to train a generalizable person re-identification model. |
372 | Zero-Shot Multi-View Indoor Localization via Graph Location Networks | Meng-Jiun Chiou; Zhenguang Liu; Yifang Yin; An-An Liu; Roger Zimmermann; | In this paper, we propose a novel neural network based architecture Graph Location Networks (GLN) to perform infrastructure-free, multi-view image based indoor localization. |
373 | Hierarchical Gumbel Attention Network for Text-based Person Search | Kecheng Zheng; Wu Liu; Jiawei Liu; Zheng-Jun Zha; Tao Mei; | In this work, we propose a novel hierarchical Gumbel attention network for text-based person search via Gumbel top-k re-parameterization algorithm. |
374 | Dual Context-Aware Refinement Network for Person Search | Jiawei Liu; Zheng-Jun Zha; Richang Hong; Meng Wang; Yongdong Zhang; | In this work, we propose a novel dual context-aware refinement network (DCRNet) for person search, which jointly explores two kinds of contexts including intra-instance context and inter-instance context to learn discriminative representation. |
375 | Heterogeneous Fusion of Semantic and Collaborative Information for Visually-Aware Food Recommendation | Lei Meng; Fuli Feng; Xiangnan He; Xiaoyan Gao; Tat-Seng Chua; | To address this problem, this paper presents a heterogeneous multi-task learning framework, termed privileged-channel infused network (PiNet). |
376 | How to Learn Item Representation for Cold-Start Multimedia Recommendation? | Xiaoyu Du; Xiang Wang; Xiangnan He; Zechao Li; Jinhui Tang; Tat-Seng Chua; | In this work, we pay special attention to cold items in multimedia recommender training. |
377 | Personalized Item Recommendation for Second-hand Trading Platform | Xuzheng Yu; Tian Gan; Yinwei Wei; Zhiyong Cheng; Liqiang Nie; | Accordingly, we proposed a method to simultaneously learn representations of items and users from coarse-grained and fine-grained features, and a multi-task learning strategy is designed to address the issue of data sparsity. |
378 | What Aspect Do You Like: Multi-scale Time-aware User Interest Modeling for Micro-video Recommendation | Hao Jiang; Wenjie Wang; Yinwei Wei; Zan Gao; Yinglong Wang; Liqiang Nie; | In view of this, we propose an end-to-end Multi-scale Time-aware user Interest modeling Network (MTIN). |
379 | Domain-Specific Alignment Network for Multi-Domain Image-Based 3D Object Retrieval | Yuting Su; Yuqian Li; Dan Song; Zhendong Mao; Xuanya Li; An-An Liu; | To address these issues, we propose an unsupervised Domain-Specific Alignment Network (DSAN) for multi-domain image-based 3D object retrieval. |
380 | Multi-modal Attentive Graph Pooling Model for Community Question Answer Matching | Jun Hu; Quan Fang; Shengsheng Qian; Changsheng Xu; | In this paper, we propose a multi-modal attentive graph pooling approach (MMAGP) to model the multi-modal content of questions and answers with GNNs in a unified framework, which explores the multi-modal and redundant properties of CQA systems. |
381 | Task-distribution-aware Meta-learning for Cold-start CTR Prediction | Tianwei Cao; Qianqian Xu; Zhiyong Yang; Qingming Huang; | In this paper, we propose an adaptive loss that ensures the consistency between the task weight and difficulty. |
382 | CFVMNet: A Multi-branch Network for Vehicle Re-identification Based on Common Field of View | Ziruo Sun; Xiushan Nie; Xiaoming Xi; Yilong Yin; | In this study, we proposed a multi-branch network based on common field of view (CFVMNet) to address these issues. |
383 | Exploiting Heterogeneous Artist and Listener Preference Graph for Music Genre Classification | Chunyuan Yuan; Qianwen Ma; Junyang Chen; Wei Zhou; Xiaodan Zhang; Xuehai Tang; Jizhong Han; Songlin Hu; | In this paper, we make use of both artist-music and listener-music preference relations to construct a heterogeneous preference graph. |
384 | Graph-Refined Convolutional Network for Multimedia Recommendation with Implicit Feedback | Yinwei Wei; Xiang Wang; Liqiang Nie; Xiangnan He; Tat-Seng Chua; | In this work, we focus on adaptively refining the structure of interaction graph to discover and prune potential false-positive edges. |
385 | Visually Precise Query | Riddhiman Dasgupta; Francis Tom; Sudhir Kumar; Mithun Das Gupta; Yokesh Kumar; Badri N. Patro; Vinay P. Namboodiri; | In this paper we introduce the task of VPQ generation which takes a product image and its title as its input and provides aword level extractive summary of the title, containing a list of salient attributes, which can now be used as a query to search for similar products. |
386 | All-in-depth via Cross-baseline Light Field Camera | Dingjian Jin; Anke Zhang; Jiamin Wu; Gaochang Wu; Haoqian Wang; Lu Fang; | Aiming for all-in-depth solution, we propose a cross-baseline LF camera using a commercial LF camera and a monocular camera, which naturally form a ‘stereo camera’ enabling compensated baseline for LF camera. |
387 | Revealing True Identity: Detecting Makeup Attacks in Face-based Biometric Systems | Mohammad Amin Arab; Puria Azadi Moghadam; Mohamed Hussein; Wael Abd-Almageed; Mohamed Hefeeda; | In this paper, we propose a novel solution to address makeup attacks, which are the hardest to detect in such systems because makeup can substantially alter the facial features of a person, including making them appear older/younger by adding/hiding wrinkles, modifying the shape of eyebrows, beard, and moustache, and changing the color of lips and cheeks. |
388 | Relevance-Based Compression of Cataract Surgery Videos Using Convolutional Neural Networks | Negin Ghamsarian; Hadi Amirpourazarian; Christian Timmerer; Mario Taschwer; Klaus Schöffmann; | To address this problem, we propose a relevance-based compression technique consisting of two modules: (i) relevance detection, which uses neural networks for semantic segmentation and classification of the videos to detect relevant spatio-temporal information, and (ii) content-adaptive compression, which restricts the amount of distortion applied to the relevant content while allocating less bitrate to irrelevant content. |
389 | A Modular Approach for Synchronized Wireless Multimodal Multisensor Data Acquisition in Highly Dynamic Social Settings | Chirag Raman; Stephanie Tan; Hayley Hung; | In this work, we propose a modular and cost-effective wireless approach for synchronized multisensor data acquisition of social human behavior. |
390 | SphericRTC: A System for Content-Adaptive Real-Time 360-Degree Video Communication | Shuoqian Wang; Xiaoyang Zhang; Mengbai Xiao; Kenneth Chiu; Yao Liu; | We present the SphericRTC system for real-time 360-degree video communication. |
391 | Single Image Shape-from-Silhouettes | Yawen Lu; Yuxing Wang; Guoyu Lu; | In this work, we present a novel shape-from-silhouette method based on just a single image, which is an end-to-end learning framework relying on view synthesis and shape-from-silhouette methodology to reconstruct a 3D shape. |
392 | VVSec: Securing Volumetric Video Streaming via Benign Use of Adversarial Perturbation | Zhongze Tang; Xianglong Feng; Yi Xie; Huy Phan; Tian Guo; Bo Yuan; Sheng Wei; | We for the first time identify an effective threat model that extracts 3D face models from volumetric videos and compromises face ID-based authentications To defend against such attack, we develop a novel volumetric video security mechanism, namely VVSec, which makes benign use of adversarial perturbations to obfuscate the security and privacy-sensitive 3D face models. |
393 | Bitrate Requirements of Non-Panoramic VR Remote Rendering | Viktor Kelkkanen; Markus Fiedler; David Lindero; | This paper shows the impact of bitrate settings on objective quality measures when streaming non-panoramic remote-rendered Virtual Reality (VR) images. |
394 | Kalman Filter-based Head Motion Prediction for Cloud-based Mixed Reality | Serhan Gül; Sebastian Bosse; Dimitri Podborski; Thomas Schierl; Cornelius Hellge; | In this paper, we design a Kalman filter for head motion prediction in our cloud-based volumetric video streaming system. |
395 | Perception-Lossless Codec of Haptic Data with Low Delay | Chaoyang Zeng; Tiesong Zhao; Qian Liu; Yiwen Xu; Kai Wang; | In this paper, we propose an end-to-end haptic codec for high-efficiency, low-delay and perception-lossless compression of kinesthetic signal, one of two major components of haptic signals. |
396 | Neural3D: Light-weight Neural Portrait Scanning via Context-aware Correspondence Learning | Xin Suo; Minye Wu; Yanshun Zhang; Yingliang Zhang; Lan Xu; Qiang Hu; Jingyi Yu; | Aiming at light-weight and realistic human portrait reconstruction, in this paper we propose Neural3D: a novel neural human portrait scanning system using only a single RGB camera. |
397 | Presence, Embodied Interaction and Motivation: Distinct Learning Phenomena in an Immersive Virtual Environment | Jack Ratcliffe; Laurissa Tokarchuk; | This paper describes an experiment designed to interrogate these approaches, and provides evidence that embodied controls and presence encourage learning in immersive virtual environments, but for distinct,non-interacting reasons, which are also not explained by motivational benefits. |
398 | User Centered Adaptive Streaming of Dynamic Point Clouds with Low Complexity Tiling | Shishir Subramanyam; Irene Viola; Alan Hanjalic; Pablo Cesar; | In this paper, we present a low-complexity tiling approach to perform adaptive streaming of point cloud content. |
399 | Leveraging QoE Heterogenity for Large-Scale Livecaset Scheduling | Rui-Xiao Zhang; Ming Ma; Tianchi Huang; Hanyu Li; Jiangchuan Liu; Lifeng Sun; | In this paper, we conduct measurement studies over large-scale data provided by a top livecast platform in China. |
400 | Towards Viewport-dependent 6DoF 360 Video Tiled Streaming for Virtual Reality Systems | Jong-Beom Jeong; Soonbin Lee; Il-Woong Ryu; Tuan Thanh Le; Eun-Seok Ryu; | Therefore, this paper proposes a viewport-dependent high-efficiency video coding (HEVC)-compliant tiled streaming system on test model for immersive video (TMIV), MPEG-Immersive multiview compression reference software. |
401 | Low-latency FoV-adaptive Coding and Streaming for Interactive 360° Video Streaming | Yixiang Mao; Liyang Sun; Yong Liu; Yao Wang; | This work focuses on developing low-latency and FoV-adaptive coding and streaming strategies for interactive $360^\circ$ video streaming. |
402 | Towards Modality Transferable Visual Information Representation with Optimal Model Compression | Rongqun Lin; Linwei Zhu; Shiqi Wang; Sam Kwong; | In this paper, we propose a new scheme for visual signal representation that leverages the philosophy of transferable modality. |
403 | AdaP-360: User-Adaptive Area-of-Focus Projections for Bandwidth-Efficient 360-Degree Video Streaming | Chao Zhou; Shuoqian Wang; Mengbai Xiao; Sheng Wei; Yao Liu; | In this work, we motivate a user-adaptive approach to address inefficiencies in 360-degree streaming through an analysis of user-viewing traces. |
404 | Tile Rate Allocation for 360-Degree Tiled Adaptive Video Streaming | Praveen Kumar Yadav; Wei Tsang Ooi; | In this paper, we model the tile rate allocation problem as a multiclass knapsack problem with a dynamic profit function that is a function of the FoV and the buffer occupancy. |
405 | Lab2Pix: Label-Adaptive Generative Adversarial Network for Unsupervised Image Synthesis | Lianli Gao; Junchen Zhu; Jingkuan Song; Feng Zheng; Heng Tao Shen; | Therefore, we propose an unsupervised framework named Lab2Pix to adaptively synthesize images from labels by elegantly considering the particular properties of label to image synthesis task. |
406 | Deep Multimodal Neural Architecture Search | Zhou Yu; Yuhao Cui; Jun Yu; Meng Wang; Dacheng Tao; Qi Tian; | In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. |
407 | DIMC-net: Deep Incomplete Multi-view Clustering Network | Jie Wen; Zheng Zhang; Zhao Zhang; Zhihao Wu; Lunke Fei; Yong Xu; Bob Zhang; | In this paper, a new deep incomplete multi-view clustering network, called DIMC-net, is proposed to address the challenge of multi-view clustering on missing views. |
408 | Cross-domain Cross-modal Food Transfer | Bin Zhu; Chong-Wah Ngo; Jing-jing Chen; | This paper addresses the challenge of resource scarcity in the scenario that only partial data instead of a complete view of data is accessible for model transfer. |
409 | Texture Semantically Aligned with Visibility-aware for Partial Person Re-identification | Lishuai Gao; Hua Zhang; Zan Gao; Weili Guan; Zhiyong Cheng; Meng Wang; | In this work, we propose a novel texture semantic alignment (TSA) approach with the visibility-aware for partial person ReID task where the occlusion issue and changes in poses are simultaneously explored in an end-to-end unified framework. |
410 | KTN: Knowledge Transfer Network for Multi-person DensePose Estimation | Xuanhan Wang; Lianli Gao; Jingkuan Song; Heng Tao Shen; | In this paper, we address the multi-person densepose estimation problem, which aims at learning dense correspondences between 2D pixels of human body and 3D surface. |
411 | Activity-driven Weakly-Supervised Spatio-Temporal Grounding from Untrimmed Videos | Junwen Chen; Wentao Bao; Yu Kong; | In this paper, we study the problem of weakly-supervised spatio-temporal grounding from raw untrimmed video streams. |
412 | Modeling Temporal Concept Receptive Field Dynamically for Untrimmed Video Analysis | Zhaobo Qi; Shuhui Wang; Chi Su; Li Su; Weigang Zhang; Qingming Huang; | Accordingly, we introduce temporal dynamic convolution (TDC) to give stronger flexibility to concept-based event analytics. |
413 | Relational Graph Learning for Grounded Video Description Generation | Wenqiao Zhang; Xin Eric Wang; Siliang Tang; Haizhou Shi; Haochen Shi; Jun Xiao; Yueting Zhuang; William Yang Wang; | To tackle the above limitations, we design a novel relational graph learning framework for GVD, in which a language-refined scene graph representation is designed to explore fine-grained visual concepts. |
414 | Finding Achilles’ Heel: Adversarial Attack on Multi-modal Action Recognition | Deepak Kumar; Chetan Kumar; Chun Wei Seah; Siyu Xia; Ming Shao; | Unfortunately, frame selection is usually computationally expensive in run-time, and adding noises to all frames is unrealistic, either. In this paper, we present a novel yet efficient approach to address this issue. |
415 | Online Multi-view Subspace Learning with Mixed Noise | Jinxing Li; Hongwei Yong; Feng Wu; Mu Li; | To tackle these problems, a novel online multi-view subspace learning is proposed in this paper. |
416 | LSOTB-TIR: A Large-Scale High-Diversity Thermal Infrared Object Tracking Benchmark | Qiao Liu; Xin Li; Zhenyu He; Chenglong Li; Jun Li; Zikun Zhou; Di Yuan; Jing Li; Kai Yang; Nana Fan; Feng Zheng; | In this paper, we present a Large-Scale and high-diversity general Thermal InfraRed (TIR) Object Tracking Benchmark, called LSOTB-TIR, which consists of an evaluation dataset and a training dataset with a total of 1,400 TIR sequences and more than 600K frames. |
417 | Towards More Explainability: Concept Knowledge Mining Network for Event Recognition | Zhaobo Qi; Shuhui Wang; Chi Su; Li Su; Qingming Huang; Qi Tian; | To address the above issues, we propose a concept knowledge mining network (CKMN) for event recognition. |
418 | Simultaneous Semantic Alignment Network for Heterogeneous Domain Adaptation | Shuang Li; Binhui Xie; Jiashu Wu; Ying Zhao; Chi Harold Liu; Zhengming Ding; | In this paper, we propose a Simultaneous Semantic Alignment Network (SSAN) to simultaneously exploit correlations among categories and align the centroids for each category across domains. |
419 | Diverter-Guider Recurrent Network for Diverse Poems Generation from Image | Liang Li; Shijie Yang; Li Su; Shuhui Wang; Chenggang Yan; Zheng-jun Zha; Qingming Huang; | This paper proposes the paradigm of multiple poems generation from one image, which is closer to human poetizing but more challenging. |
420 | Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning | Ying Cheng; Ruize Wang; Zhihao Pan; Rui Feng; Yuejie Zhang; | In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. |
421 | Cross-Modal Relation-Aware Networks for Audio-Visual Event Localization | Haoming Xu; Runhao Zeng; Qingyao Wu; Mingkui Tan; Chuang Gan; | Motivated by these, in this paper, we propose a relation-aware network to leverage both audio and visual information for accurate event localization. |
422 | Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion | Yikai Wang; Fuchun Sun; Ming Lu; Anbang Yao; | We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. |
423 | Look, Listen and Infer | Ruijian Jia; Xinsheng Wang; Shanmin Pang; Jihua Zhu; Jianru Xue; | In this work, for the first time, a Look, Listen and Infer Network (LLINet) is proposed to learn a zero-shot model that can infer the relations of visual scenes and sounds from novel categories never appeared before. |
424 | DCNet: Dense Correspondence Neural Network for 6DoF Object Pose Estimation in Occluded Scenes | Zhi Chen; Wei Yang; Zhenbo Xu; Xike Xie; Liusheng Huang; null null; | In this work, we propose DCNet, an end-to-end framework for estimating 6DoF object poses. |
425 | Transferrable Referring Expression Grounding with Concept Transfer and Context Inheritance | Xuejing Liu; Liang Li; Shuhui Wang; Zheng-Jun Zha; Dechao Meng; Qingming Huang; | In this paper, we explore REG in a new scenario, where the REG model can ground novel objects out of REG training data. |
426 | Deep Multi-modality Soft-decoding of Very Low Bit-rate Face Videos | Yanhui Guo; Xi Zhang; Xiaolin Wu; | We propose a novel deep multi-modality neural network for restoring very low bit rate videos of talking heads. |
427 | Multi-modal Multi-relational Feature Aggregation Network for Medical Knowledge Representation Learning | Yingying Zhang; Quan Fang; Shengsheng Qian; Changsheng Xu; | In this paper, we propose a Multi-modal Multi-Relational Feature Aggregation Network (MMRFAN) for medical knowledge representation learning. |
428 | Photo Stream Question Answer | Wenqiao Zhang; Siliang Tang; Yanpeng Cao; Jun Xiao; Shiliang Pu; Fei Wu; Yueting Zhuang; | In this paper, we present a new visual question answering (VQA) task — Photo Stream QA, which aims to answer the open-ended questions about a narrative photo stream. |
429 | Generalized Zero-shot Learning with Multi-source Semantic Embeddings for Scene Recognition | Xinhang Song; Haitao Zeng; Sixian Zhang; Luis Herranz; Shuqiang Jiang; | In this paper we focus on zero-shot scene recognition, a more challenging setting with hundreds of categories where their differences can be subtle and often localized in certain objects or regions. |
430 | A Unified Framework for Detecting Audio Adversarial Examples | Xia Du; Chi-Man Pun; Zheng Zhang; | In this paper, we propose a unified adversarial detection framework for detecting adaptive audio adversarial examples, which combines noise padding with sound reverberation. |
431 | Emerging Topic Detection on the Meta-data of Images from Fashion Social Media | Kunihiro Miyazaki; Takayuki Uchiba; Scarlett Young; Yuichi Sasaki; Kenji Tanaka; | Therefore, in this research, we propose a novel framework for capturing changes in people’s tastes in terms of coordination rather than individual items. |
432 | Deep Concept-wise Temporal Convolutional Networks for Action Localization | Xin Li; Tianwei Lin; Xiao Liu; Wangmeng Zuo; Chao Li; Xiang Long; Dongliang He; Fu Li; Shilei Wen; Chuang Gan; | In this paper, we empirically find that stacking more conventional temporal convolution layers actually deteriorates action classification performance, possibly ascribing to that all channels of 1D feature map, which generally are highly abstract and can be regarded as latent concepts, are excessively recombined in temporal convolution. |
433 | Who You Are Decides How You Tell | Shuang Wu; Shaojing Fan; Zhiqi Shen; Mohan Kankanhalli; Anthony K.H. Tung; | In this paper, we focus on human-centered automatic image captioning. |
434 | Query Twice: Dual Mixture Attention Meta Learning for Video Summarization | Junyan Wang; Yang Bai; Yang Long; Bingzhang Hu; Zhenhua Chai; Yu Guan; Xiaolin Wei; | In this paper, we propose a novel framework named Dual Mixture Attention (DMASum) model with Meta Learning for video summarization that tackles the softmax bottleneck problem, where the Mixture of Attention layer (MoA) effectively increases the model capacity by employing twice self-query attention that can capture the second-order changes in addition to the initial query-key attention, and a novel Single Frame Meta Learning rule is then introduced to achieve more generalization to small datasets with limited training sources. |
435 | Textual Dependency Embedding for Person Search by Language | Kai Niu; Yan Huang; Liang Wang; | In this work, we focus on the long-distance dependencies in a sentence for better textual encoding, and accordingly propose the Textual Dependency Embedding (TDE) method. |
436 | Visual-Semantic Graph Matching for Visual Grounding | Chenchen Jing; Yuwei Wu; Mingtao Pei; Yao Hu; Yunde Jia; Qi Wu; | In this paper, we formulate visual grounding as a graph matching problem to find node correspondences between a visual scene graph and a language scene graph. |
437 | LAL: Linguistically Aware Learning for Scene Text Recognition | Yi Zheng; Wenda Qin; Derry Wijaya; Margrit Betke; | In this work, we propose a bimodal framework that simultaneously utilizes visual and linguistic information to enhance recognition performance. |
438 | Cascade Reasoning Network for Text-based Visual Question Answering | Fen Liu; Guanghui Xu; Qi Wu; Qing Du; Wei Jia; Mingkui Tan; | We study the problem of text-based visual question answering (T-VQA) in this paper. |
439 | Jointly Cross- and Self-Modal Graph Attention Network for Query-Based Moment Localization | Daizong Liu; Xiaoye Qu; Xiao-Yang Liu; Jianfeng Dong; Pan Zhou; Zichuan Xu; | To this end, we propose a novel Cross- and Self-Modal Graph Attention Network (CSMGAN) that recasts this task as a process of iterative messages passing over a joint graph. |
440 | Text-Guided Image Inpainting | Zijian Zhang; Zhou Zhao; Zhu Zhang; Baoxing Huai; Jing Yuan; | Based on these observations, we propose a new inpainting problem that introduces text as a kind of guidance to direct and control the inpainting process. |
441 | RT-VENet: A Convolutional Network for Real-time Video Enhancement | Mohan Zhang; Qiqi Gao; Jinglu Wang; Henrik Turbell; David Zhao; Jinhui Yu; Yan Lu; | We present a novel convolutional network that can perform high-quality enhancement on 1080p videos at 45 FPS with a single CPU, which has high potential for real-world deployment. |
442 | Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment Retrieval in Videos | Zhu Zhang; Zhijie Lin; Zhou Zhao; Jieming Zhu; Xiuqiang He; | In this paper, we propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments. |
443 | Feature Reintegration over Differential Treatment: A Top-down and Adaptive Fusion Network for RGB-D Salient Object Detection | Miao Zhang; Yu Zhang; Yongri Piao; Beiqi Hu; Huchuan Lu; | In this paper, we propose a novel top-down multi-level fusion structure where different fusion strategies are utilized to effectively explore the low-level and high-level features. |
444 | Dual Path Interaction Network for Video Moment Localization | Hao Wang; Zheng-Jun Zha; Xuejin Chen; Zhiwei Xiong; Jiebo Luo; | In this paper, we propose a unified top-down and bottom-up approach called Dual Path Interaction Network (DPIN), where the alignment and discrimination information are closely connected to jointly make the prediction. |
445 | Cap2Seg: Inferring Semantic and Spatial Context from Captions for Zero-Shot Image Segmentation | Guiyu Tian; Shuai Wang; Jie Feng; Li Zhou; Yadong Mu; | In this work we describe Cap2Seg, a novel solution of zero-shot image segmentation that harnesses accompanying image captions for intelligently inferring spatial and semantic context for the zero-shot image segmentation task. |
446 | Spatial-Temporal Knowledge Integration: Robust Self-Supervised Facial Landmark Tracking | Congcong Zhu; Xiaoqiang Li; Jide Li; Guangtai Ding; Weiqin Tong; | To address these problems, we propose a Spatial-Temporal Knowledge Integration (STKI) approach. |
447 | Weakly Supervised 3D Object Detection from Point Clouds | Zengyi Qin; Jinglu Wang; Yan Lu; | In this work, we propose VS3D, a framework for weakly supervised 3D object detection from point clouds without using any ground truth 3D bounding box for training. |
448 | Bridging the Gap between Vision and Language Domains for Improved Image Captioning | Fenglin Liu; Xian Wu; Shen Ge; Xiaoyu Zhang; Wei Fan; Yuexian Zou; | In this paper, we propose to bridge the gap between the vision and language domains in the encoder, by enriching visual information with textual concepts, to achieve deep image understandings. |
449 | STRONG: Spatio-Temporal Reinforcement Learning for Cross-Modal Video Moment Localization | Da Cao; Yawen Zeng; Meng Liu; Xiangnan He; Meng Wang; Zheng Qin; | In this article, we tackle the cross-modal video moment localization issue, namely, localizing the most relevant video moment in an untrimmed video given a sentence as the query. |
450 | Language-Aware Fine-Grained Object Representation for Referring Expression Comprehension | Heqian Qiu; Hongliang Li; Qingbo Wu; Fanman Meng; Hengcan Shi; Taijin Zhao; King Ngi Ngan; | To address these problems, we propose a language-aware deformable convolution model (LDC) to learn language-aware fine-grained object representations. |
451 | Hierarchical Scene Graph Encoder-Decoder for Image Paragraph Captioning | Xu Yang; Chongyang Gao; Hanwang Zhang; Jianfei Cai; | We propose irredundant attention in SSG-RNN to improve the possibility of abstracting topics from rarely described sub-graphs and inheriting attention in WSG-RNN to generate more grounded sentences with the abstracted topics, both of which give rise to more distinctive paragraphs. |
452 | Improving Intra- and Inter-Modality Visual Relation for Image Captioning | Yong Wang; WenKai Zhang; Qing Liu; Zhengyuan Zhang; Xin Gao; Xian Sun; | In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed I2RT. |
453 | Exploring Language Prior for Mode-Sensitive Visual Attention Modeling | Xiaoshuai Sun; Xuying Zhang; Liujuan Cao; Yongjian Wu; Feiyue Huang; Rongrong Ji; | In this paper, we propose a new probabilistic framework for attention, and introduce the concept ofmode to model the flexibility and adaptability of attention modulation in complex environments. |
454 | Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling | Jiacheng Li; Siliang Tang; Juncheng Li; Jun Xiao; Fei Wu; Shiliang Pu; Yueting Zhuang; | In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. |
455 | ICECAP: Information Concentrated Entity-aware Image Captioning | Anwen Hu; Shizhe Chen; Qin Jin; | In this work, we focus on the entity-aware news image captioning task which aims to generate informative captions by leveraging the associated news articles to provide background knowledge about the target image. |
456 | Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal | Jiayi Ji; Xiaoshuai Sun; Yiyi Zhou; Rongrong Ji; Fuhai Chen; Jianzhuang Liu; Qi Tian; | In this paper, we investigate the fragility of deep image captioning models against adversarial attacks. |
457 | ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection | Ye Liu; Junsong Yuan; Chang Wen Chen; | Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. |
458 | ChefGAN: Food Image Generation from Recipes | Siyuan Pan; Ling Dai; Xuhong Hou; Huating Li; Bin Sheng; | To achieve this, we propose a GANs based method termed ChefGAN. |
459 | Dual Hierarchical Temporal Convolutional Network with QA-Aware Dynamic Normalization for Video Story Question Answering | Fei Liu; Jing Liu; Xinxin Zhu; Richang Hong; Hanqing Lu; | In this paper, we propose a novel framework named Dual Hierarchical Temporal Convolutional Network (DHTCN) to address the aforementioned defects together. |
460 | Generalized Zero-Shot Learning using Generated Proxy Unseen Samples and Entropy Separation | Omkar Gune; Biplab Banerjee; Subhasis Chaudhuri; Fabio Cuzzolin; | In this work, we propose to use a generative model (GAN) for synthesizing the visual proxy samples while strictly adhering to the standard assumptions of the GZSL. |
461 | Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue | Zipeng Xu; Fangxiang Feng; Xiaojie Wang; Yushu Yang; Huixing Jiang; Zhongyuan Wang; | In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. |
462 | Fine-grained Iterative Attention Network for Temporal Language Localization in Videos | Xiaoye Qu; Pengwei Tang; Zhikang Zou; Yu Cheng; Jianfeng Dong; Pan Zhou; Zichuan Xu; | In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. |
463 | Hierarchical Bi-Directional Feature Perception Network for Person Re-Identification | Zhipu Liu; Lei Zhang; Yang Yang; | To solve this issue, we propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other. |
464 | Hard Negative Samples Emphasis Tracker without Anchors | Zhongzhou Zhang; Lei Zhang; | To address this issue, we propose a simple yet efficient hard negative samples emphasis method, which constrains Siamese network to learn features that are aware of hard negative samples and enhance the discrimination of embedding features. |
465 | JointFontGAN: Joint Geometry-Content GAN for Font Generation via Few-Shot Learning | Yankun Xi; Guoli Yan; Jing Hua; Zichun Zhong; | In this paper, we propose a novel model, JointFontGAN, to derive fonts, including both geometric structures and shape contents in correctness and consistency with very few font samples available. |
466 | DeepRhythm: Exposing DeepFakes with Attentional Visual Heartbeat Rhythms | Hua Qi; Qing Guo; Felix Juefei-Xu; Xiaofei Xie; Lei Ma; Wei Feng; Yang Liu; Jianjun Zhao; | In this work, we propose DeepRhythm, a DeepFake detection technique that exposes DeepFakes by monitoring the heartbeat rhythms. |
467 | FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire | Jinglin Liu; Yi Ren; Zhou Zhao; Chen Zhang; Baoxing Huai; Jing Yuan; | To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. |
468 | Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning | Jing Wang; Jinhui Tang; Jiebo Luo; | In this paper, we present a novel design – Multimodal Attention Captioner with OCR Spatial Relationship (dubbed as MMA-SR) architecture, which manages information from different modalities with a multimodal attention network and explores spatial relationships between text tokens for OCR-based image captioning. |
469 | Towards Accuracy-Fairness Paradox: Adversarial Example-based Data Augmentation for Visual Debiasing | Yi Zhang; Jitao Sang; | Specifically, to ensure the adversarial generalization as well as cross-task transferability, we propose to couple the operations of target task classifier training, bias task classifier training, and adversarial example generation. |
470 | Learning Semantic Concepts and Temporal Alignment for Narrated Video Procedural Captioning | Botian Shi; Lei Ji; Zhendong Niu; Nan Duan; Ming Zhou; Xilin Chen; | In this paper, we start with an encoder-decoder backbone using transformer models. |
471 | LGNN: A Context-aware Line Segment Detector | Quan Meng; Jiakai Zhang; Qiang Hu; Xuming He; Jingyi Yu; | When a 3D point cloud is accessible, we present a multi-modal line segment classification technique for extracting a 3D wireframe of the environment robustly and efficiently. |
472 | DeVLBert: Learning Deconfounded Visio-Linguistic Representations | Shengyu Zhang; Tan Jiang; Tan Wang; Kun Kuang; Zhou Zhao; Jianke Zhu; Jin Yu; Hongxia Yang; Fei Wu; | In this paper, we propose to investigate the problem of out-of-domain visio-linguistic pretraining, where the pretraining data distribution differs from that of downstream data on which the pretrained model will be fine-tuned. |
473 | Sequential Attention GAN for Interactive Image Editing | Yu Cheng; Zhe Gan; Yitong Li; Jingjing Liu; Jianfeng Gao; | To explore more practical and interactive real-life applications, we introduce a new task – Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. |
474 | Portraits of No One: An Internet Artwork | Tiago Martins; João Correia; Sérgio Rebelo; João Bicker; Penousal Machado; | Portraits of No One: An Internet Artwork |
475 | MaLiang: An Emotion-driven Chinese Calligraphy Artwork Composition System | Ruixue Liu; Shaozu Yuan; Meng Chen; Baoyang Chen; Zhijie Qiu; Xiaodong He; | We present a novel Chinese calligraphy artwork composition system (MaLiang) which can generate aesthetic, stylistic and diverse calligraphy images based on the emotion status from the input text. |
476 | First Impression: AI Understands Personality | Xiaohui Wang; Xia Liang; Miao Lu; Jingyan Qin; | First impression, an interactive art, is proposed to let AI understand human personality at first glance. |
477 | Draw Portraits by Music: A Music based Image Style Transformation | Siyu Jin; Jingyan Qin; Wenfa Li; | Draw Portraits by Music: A Music based Image Style Transformation |
478 | Little World: Virtual Humans Accompany Children on Dramatic Performance | Xiaohui Wang; Xiaoxue Ding; Jinke Li; Jingyan Qin; | To help them achieve performance, an interactive art called ‘little world’ is proposed to let virtual humans accompany children on drama performance. |
479 | Keep Running – AI Paintings of Horse Figure and Portrait | James She; Carmen Ng; Wadia Sheng; | “Keep Running” is a collection of human and machine generated paintings using a generative adversarial network technology. |
480 | AI Mirror: Visualize AI’s Self-knowledge | Siyu Hu; Bo Shui; Siyu Jin; Xiaohui Wang; | “AI mirror”, an interactive art, tends to visualize the self-knowledge mechanism from the AI’s perspective, and arouses people’s reflection on artificial intelligence. |
481 | Image Sentiment Transfer | Tianlang Chen; Wei Xiong; Haitian Zheng; Jiebo Luo; | In this work, we introduce an important but still unexplored research task — image sentiment transfer. |
482 | Personal Food Model | Ali Rostami; Vaibhav Pandey; Nitish Nag; Vesper Wang; Ramesh Jain; | In this paper, we adopt a person-centric multimedia and multimodal perspective on food computing and show how multimedia and food computing are synergistic and complementary. |
483 | Helping Users Tackle Algorithmic Threats on Social Media: A Multimedia Research Agenda | Christian von der Weth; Ashraf Abdul; Shaojing Fan; Mohan Kankanhalli; | We investigate how multimedia researchers can help tackle these problems to level the playing field for social media users. |
484 | Reproducibility Companion Paper: Instance of Interest Detection | Fan Yu; Dandan Wang; Haonan Wang; Tongwei Ren; Jinhui Tang; Gangshan Wu; Jingjing Chen; Michael Riegler; | In this paper, we explain the file structure of the source code and publish the details of our IOID dataset, which can be used to retrain the model with custom parameters. |
485 | Reproducibility Companion Paper: Outfit Compatibility Prediction and Diagnosis with Multi-Layered Comparison Network | Xin Wang; Bo Wu; Yueqi Zhong; Wei Hu; Jan Zahálka; | We provide the software package for replicating the implementation of Multi-Layered Comparison Network (MCN), as well as the Polyvore-T dataset and baseline methods compared in the original paper. |
486 | Reproducibility Companion Paper: Visual Sentiment Analysis for Review Images with Item-Oriented and User-Oriented CNN | Quoc-Tuan Truong; Hady W. Lauw; Martin Aumüller; Naoko Nitta; | We revisit our contributions on visual sentiment analysis for online review images published at ACM Multimedia 2017, where we develop item-oriented and user-oriented convolutional neural networks that better capture the interaction of image features with specific expressions of users or items. |
487 | Reproducibility Companion Paper: Selective Deep Convolutional Features for Image Retrieval | Tuan Hoang; Thanh-Toan Do; Ngai-Man Cheung; Michael Riegler; Jan Zahálka; | In this companion paper, firstly, we briefly summarize the contributions of our main manuscript: Selective Deep Convolutional Features for Image Retrieval, published in ACM MultiMedia 2017. |
488 | MLModelCI: An Automatic Cloud Platform for Efficient MLaaS | Huaizheng Zhang; Yuanming Li; Yizheng Huang; Yonggang Wen; Jianxiong Yin; Kyle Guan; | MLModelCI provides multimedia researchers and developers with a one-stop platform for efficient machine learning (ML) services. |
489 | Hysia: Serving DNN-Based Video-to-Retail Applications in Cloud | Huaizheng Zhang; Yuanming Li; Qiming Ai; Yong Luo; Yonggang Wen; Yichao Jin; Nguyen Binh Duong Ta; | In this paper, we provide practitioners and researchers in multimedia with a cloud-based platform named Hysia for easy development and deployment of V2R applications. |
490 | PyRetri: A PyTorch-based Library for Unsupervised Image Retrieval by Deep Convolutional Neural Networks | Benyi Hu; Ren-Jie Song; Xiu-Shen Wei; Yazhou Yao; Xian-Sheng Hua; Yuehu Liu; | In order to fill this gap, we introduce PyRetri, an open source library for deep learning based unsupervised image retrieval. |
491 | Cottontail DB: An Open Source Database System for Multimedia Retrieval and Analysis | Ralph Gasser; Luca Rossetto; Silvan Heller; Heiko Schuldt; | In this paper we introduce Cottontail DB, an open source database management system that integrates support for scalar and vector attributes in a unified data and query model that allows for both Boolean retrieval and nearest neighbour search. |
492 | BMXNet 2: An Open Source Framework for Low-bit Networks – Reproducing, Understanding, Designing and Showcasing | Joseph Bethge; Christian Bartz; Haojin Yang; Christoph Meinel; | BMXNet 2 is an open-source framework that provides a broad basis for academia and industry. |
493 | PyAnomaly: A Pytorch-based Toolkit for Video Anomaly Detection | Yuhao Cheng; Wu Liu; Pengrui Duan; Jingen Liu; Tao Mei; | In this paper, we present a PyTorch-based video anomaly detection toolbox, namely PyAnomaly that contains high modular and extensible components, comprehensive and impartial evaluation platforms, a friendly manageable system configuration, and the abundant engineering deployment functions. |
494 | TAPAS-360°: A Tool for the Design and Experimental Evaluation of 360° Video Streaming Systems | Giuseppe Ribezzo; Luca De Cicco; Vittorio Palmisano; Saverio Mascolo; | In this paper, we present TAPAS-360°, an open-source tool that enables designing and experimenting all the components required to build omnidirectional video streaming systems. |
495 | SOMHunter: Lightweight Video Search System with SOM-Guided Relevance Feedback | Miroslav Kratochvil; František Mejzlík; Patrik Veselý; Tomáš Soućek; Jakub Lokoć; | To partially alleviate this difficulty, we provide an open-source version of the lightweight known-item search system SOMHunter that competed successfully at VBS 2020. |
496 | Text-to-Image Synthesis via Aesthetic Layout | Samah Saeed Baraheem; Trung-Nghia Le; Tam V. Nguyen; | In this work, we introduce a practical system which synthesizes an appealing image from natural language descriptions such that the generated image should maintain the aesthetic level of photographs. |
497 | Progressive Domain Adaptation for Robot Vision Person Re-identification | Zijun Sha; Zelong Zeng; Zheng Wang; Yoichi Natori; Yasuhiro Taniguchi; Shin’ichi Satoh; | In this paper, we demonstrate a guiding robot with person followers system, which recognizes the follower using a person re-identification technology. |
498 | Semantic Storytelling Automation: A Context-Aware and Metadata-Driven Approach | Paula Viana, Pedro Carvalho, Maria Teresa Andrade, Pieter P. Jonker, Vasileios Papanikolaou, Inês N. Teixeira, Luis Vilaça, José P. Pinto, Tiago Costa; | We propose an innovative approach that uses context and content information to transform a still photo into an appealing context-aware video clip. |
499 | ADHD Intelligent Auxiliary Diagnosis System Based on Multimodal Information Fusion | Yanyi Zhang; Ming Kong; Tianqi Zhao; Wenchen Hong; Qiang Zhu; Fei Wu; | We have designed and developed the ADHD intelligent auxiliary diagnosis system with software and hardware cooperation. |
500 | Video 360 Content Navigation for Mobile HMD Devices | Jounsup Park; Mingyuan Wu; Eric Lee; Klara Nahrstedt; Yash Shah; Arielle Rosenthal; John Murray; Kevin Spiteri; Michael Zink; Ramesh Sitaraman; | We demonstrate a video 360 navigation and streaming system for Mobile HMD devices. |
501 | GoldenRetriever: A Speech Recognition System Powered by Modern Information Retrieval | Yuanfeng Song; Di Jiang; Xiaoling Huang; Yawen Li; Qian Xu; Raymond Chi-Wing Wong; Qiang Yang; | Exploiting their commonality, this demonstration proposes a novel system named GoldenRetriever that marries IR with ASR. |
502 | Integrating Event Camera Sensor Emulator | Andrew C. Freeman; Ketan Mayer-Patel; | Arguing the potential usefulness of this sensor, this paper introduces a system for simulating the sensor’s event outputs and pixel firing rate control from 3D-rendered input images. |
503 | Scene-segmented Video Information Annotation System V2.0 | Alex Lee; Chang-Uk Kwak; Jeong-Woo Son; Gyeong-June Hahm; Min-Ho Han; Sun-Joong Kim; | We have built the scene-segmented video information annotation system and upgraded it to version 2.0. |
504 | SmartShots: Enabling Automatic Generation of Videos with Data Visualizations Embedded | Tan Tang; Junxiu Tang; Jiewen Lai; Lu Ying; Peiran Ren; Lingyun Yu; Yingcai Wu; | Specifically, we propose a computational framework that integrates non-verbal video clips, images, a melody, and a data table to create a video with data visualizations embedded. |
505 | A Smart-Site-Survey System using Image-based 3D Metric Reconstruction and Interactive Panorama Visualization | Sha Yu; Kevin Mcguinness; Patricia Moore; David Azcona; Noel O’Connor; | This work presents a so-called Smart Site Survey (SSS) system that provides an efficient, web-based platform for virtual inspection of remote sites with absolute 3D metrics. |
506 | AI-SAS: Automated In-match Soccer Analysis System | Ning Zhang; Tong Shen; Yue Chen; Wei Zhang; Dan Zeng; Jingen Liu; Tao Mei; | In this work, we present an Automated In-match Soccer Analysis System (AI-SAS), using a domain-knowledge-based multi-view global tracking. |
507 | Detecting Urban Issues With the Object Detection Kit | Maarten Sukel; Stevan Rudinac; Marcel Worring; | In the Object Detection Kit demo we will demonstrate how the framework can be used to detect urban issues and showcase the capabilities of the system. |
508 | Visual-speech Synthesis of Exaggerated Corrective Feedback | Yaohua Bu; Weijun Li; Tianyi Ma; Shengqi Chen; Jia Jia; Kun Li; Xiaobo Lu; | To provide more discriminative feedback for the second language (L2) learners to better identify their mispronunciation, we propose a method for exaggerated visual-speech feedback in computer-assisted pronunciation training (CAPT). |
509 | TindART: A Personal Visual Arts Recommender | Gjorgji Strezoski; Lucas Fijen; Jonathan Mitnik; Dániel László; Pieter de Marez Oyens; Yoni Schirris; Marcel Worring; | We present TindART – a comprehensive visual arts recommender system. |
510 | Fashionist: Personalising Outfit Recommendation for Cold-Start Scenarios | Dhruv Verma; Kshitij Gulati; Vasu Goel; Rajiv Ratn Shah; | We attempt to address the cold-start problem for new users, by leveraging a novel visual preference modelling approach on a small set of input images. |
511 | EmotionTracker: A Mobile Real-time Facial Expression Tracking System with the Assistant of Public AI-as-a-Service | Xuncheng Liu; Jingyi Wang; Weizhan Zhang; Qinghu Zheng; Xuanya Li; | In this demonstration, we present EmotionTracker, a real-time mobile facial expression tracking system combining AIaaS and mobile local auxiliary computing, including facial expression tracking and the corresponding task offloading. |
512 | AvatarMeeting: An Augmented Reality Remote Interaction System With Personalized Avatars | Xuanyu Wang; Yang Wang; Yan Shi; Weizhan Zhang; Qinghua Zheng; | Specifically, we propose a novel framework including a consumer-grade set-up, a complete transmission scheme and a processing pipeline, which consists of prescan modeling, pose detection and action reconstruction. |
513 | An Interactive Design for Visualizable Person Re-Identification | Haolin Ren; Zheng Wang; Zhixiang Wang; Lixiong Chen; Shin’ichi Satoh; Daning Hu; | As system operators need a comfortable access to these important elements, we introduce an interactive design of a person ReID system to visualize these quantities. |
514 | Image and Video Restoration and Compression Artefact Removal Using a NoGAN Approach | Filippo Mameli; Marco Bertini; Leonardo Galteri; Alberto Del Bimbo; | In this work, we report results obtained using the NoGAN training approach and adapting the popular DeOldify architecture used for colorization, for image and video compression artefact removal and restoration. |
515 | Beautify As You Like | Wentao Jiang; Si Liu; Chen Gao; Ran He; Bo Li; Shuicheng Yan; | In this demo, we present the first fast makeup transfer system named as Fast Pose and expression robust Spatial-Aware GAN (FPSGAN). |
516 | iDirector: An Intelligent Directing System for Live Broadcast | Jiawei Zuo; Yue Chen; Linfang Wang; Yingwei Pan; Ting Yao; Ke Wang; Tao Mei; | In this paper, we demonstrate an end-to-end intelligent system for live sports broadcasting, namely iDirector, which aims to mimic the human-in-loop live broadcasting process by aggregating the input multi-camera video streaming into the final output program video (PGM video) for audience. |
517 | Multimedia Food Logger | Ali Rostami; Bihao Xu; Ramesh Jain; | We will demonstrate the complete functionality of such a system in this demo. |
518 | A Cross-modality and Progressive Person Search System | Xiaodong Chen; Wu Liu; Xinchen Liu; Yongdong Zhang; Tao Mei; | This demonstration presents an instant and progressive cross-modality person search system, called ‘CMPS’. |
519 | Binocular Multi-CNN System for Real-Time 3D Pose Estimation | Teo T. Niemirepo; Marko Viitanen; Jarno Vanne; | This paper introduces the first open-source algorithm for binocular 3D pose estimation. |
520 | An Interaction-based Video Viewing Support System using Geographical Relationships | Itsuki Hashimoto; Yuanyuan Wang; Yukiko Kawai; Kazutoshi Sumiya; | Therefore, in this paper, we propose a video viewing support system to recommend supplementary information using geographical relationships based on user interaction. |
521 | Infinity Battle: A Glance at How Blockchain Techniques Serve in a Serverless Gaming System | Feijie Wu; Ho Yin Yuen; Henry C.B. Chan; Victor C.M. Leung; Wei Cai; | In this work, we present the Infinity Battle, a serverless turn-based strategy game supported by a novel Proof-of-Play consensus model. |
522 | ConfFlow: A Tool to Encourage New Diverse Collaborations | Ekin Gedik; Hayley Hung; | ConfFlow is an interactive web application that allows conference participants to inspect other attendees through a visualized similarity space. |
523 | HyFea: Winning Solution to Social Media Popularity Prediction for Multimedia Grand Challenge 2020 | Xin Lai; Yihong Zhang; Wei Zhang; | In this paper, we present HyFea, our winning solution to the Social Media Prediction (SMP) Challenge for multimedia grand challenge of ACM Multimedia 2020. |
524 | A Feature Generalization Framework for Social Media Popularity Prediction | Kai Wang; Penghui Wang; Xin Chen; Qiushi Huang; Zhendong Mao; Yongdong Zhang; | In this paper, we propose a novel combined framework for social media popularity prediction, which accomplishes feature generalization and temporal modeling based on multi-modal feature extraction. |
525 | Curriculum Learning for Wide Multimedia-Based Transformer with Graph Target Detection | Weilong Chen; Feng Hong; Chenghao Huang; Shaoliang Zhang; Rui Wang; Ruobing Xie; Feng Xia; Leyu Lin; Yanru Zhang; Yan Wang; | In this paper, we propose a novel approach named curriculum learning for wide multimedia-based transformer with graph target detection(CL-WMTG). |
526 | Multimodal Deep Learning for Social Media Popularity Prediction With Attention Mechanism | Kele Xu; Zhimin Lin; Jianqiao Zhao; Peicang Shi; Wei Deng; Huaimin Wang; | Inspired by the recent success of multimodal learning, we propose a novel multimodal deep learning framework for the popularity prediction task, which aims to leverage the complementary knowledge from different modalities. |
527 | Rethinking Relation between Model Stacking and Recurrent Neural Networks for Social Media Prediction | Chih-Chung Hsu; Wen-Hai Tseng; Hao-Ting Yang; Chia-Hsiang Lin; Chi-Hung Kao; | In this paper, we discover a more dominant feature representation of text information, as well as propose a singe ensemble learning model to obtain the popularity scores, for social media prediction challenge. |
528 | Video Relation Detection with Trajectory-aware Multi-modal Features | Wentao Xie; Guanghui Ren; Si Liu; | In this paper, we present video relation detection with trajectory-aware multi-modal features to solve this task. |
529 | A Strong Baseline for Multiple Object Tracking on VidOR Dataset | Zhipeng Luo; Zhiguang Zhang; Yuehan Yao; | According to the above characteristics, we design a robust detection model, proposed a new deep metric learning method, and explored some useful tracking algorithms to help complete the video object detection task. |
530 | XlanV Model with Adaptively Multi-Modality Feature Fusing for Video Captioning | Yiqing Huang; Qiuyu Cai; Siyu Xu; Jiansheng Chen; | We adaptively fuse these two kinds of features in the X-Linear Attention Network Video and propose XlanV model for video captioning. |
531 | VideoTRM: Pre-training for Video Captioning Challenge 2020 | Jingwen Chen; Hongyang Chao; | As a part of the submission to this challenge, we propose a Transformer based framework named VideoTRM, which consists of four modules: a textual encoder for encoding the linguistic relationship among words in the input sentence, a visual encoder for capturing the temporal dynamics in the input video, a cross-modal encoder for modeling the interactions between the two modalities (i.e., textual and visual) and a decoder for sentence generation conditioned on the input video and words generated previously. |
532 | Multi-stage Tag Guidance Network in Video Caption | Lanxiao Wang; Chao Shang; Heqian Qiu; Taijin Zhao; Benliu Qiu; Hongliang Li; | In this work, we propose a tag guidance module to learn a representation which can better build the interaction in cross-modal between visual content and textual sentences. |
533 | Dense Scene Multiple Object Tracking with Box-Plane Matching | Jinlong Peng; Yueyang Gu; Yabiao Wang; Chengjie Wang; Jilin Li; Feiyue Huang; | Following the tracking-by-detection framework, we propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes. |
534 | Transductive Multi-Object Tracking in Complex Events by Interactive Self-Training | Ancong Wu; Chengzhi Lin; Bogao Chen; Weihao Huang; Zeyu Huang; Wei-Shi Zheng; | We propose a transductive interactive self-training method to adapt the tracking model to unseen crowded scenes with unlabeled testing data by means of teacher-student interative learning. |
535 | Application of Multi-Object Tracking with Siamese Track-RCNN to the Human in Events Dataset | Bing Shuai; Andrew Berneshawi; Manchen Wang; Chunhui Liu; Davide Modolo; Xinyu Li; Joseph Tighe; | Towards this, we propose Siamese Track-RCNN, a two stage detect-and-track framework which consists of three functional branches: (1) the detection branch localizes object instances; (2) the Siamese-based track branch estimates the object motion and (3) the object re-identification branch re-activates the previously terminated tracks when they re-emerge. |
536 | Towards Accurate Human Pose Estimation in Videos of Crowded Scenes | Shuning Chang; Li Yuan; Xuecheng Nie; Ziyuan Huang; Yichen Zhou; Yupeng Chen; Jiashi Feng; Shuicheng Yan; | In this paper, we focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data. |
537 | Combined Distillation Pose | Lei Yuan; Shu Zhang; Feng Fubiao; Naike Wei; Huadong Pan; | In this paper, the knowledge distillation method is used to make the network predictions results act as supervision information, then a multi-stage supervision training framework is designed from shallow to deep layers. |
538 | Deep Relationship Analysis in Video with Multimodal Feature Fusion | Fan Yu; DanDan Wang; Beibei Zhang; Tongwei Ren; | In this paper, we propose a novel multimodal feature fusion method based on scene segmentation to detect the relationships between entities in a long duration video. |
539 | Towards Using Semantic-Web Technologies for Multi-Modal Knowledge Graph Construction | Matthias Baumgartner; Luca Rossetto; Abraham Bernstein; | In this paper we present our approaches used in the first instance of the Deep Video Understanding Challenge, using a combination of several multi-modal detectors and an integration scheme informed by methods from the semantic web context in order to determine the capabilities limitations of currently available methods for the extraction of semantic relations between the characters and locations relevant to the narrative of a movie. |
540 | Story Semantic Relationships from Multimodal Cognitions | Vishal Anand; Raksha Ramesh; Ziyin Wang; Yijing Feng; Jiana Feng; Wenfeng Lyu; Tianle Zhu; Serena Yuan; Ching-Yung Lin; | We consider the problem of building semantic relationship of unseen entities from free-form multi-modal sources. |
541 | ACM Multimedia BioMedia 2020 Grand Challenge Overview | Steven A. Hicks; Vajira Thambawita; Hugo L. Hammer; Trine B. Haugen; Jorunn M. Andersen; Oliwia Witczak; Pål Halvorsen; Michael A. Riegler; | In this year’s challenge, participants are asked to develop algorithms that automatically predict the quality of a given human semen sample using a combination of visual, patient-related, and laboratory-analysis-related data. |
542 | A Quantitative Comparison of Different Machine Learning Approaches for Human Spermatozoa Quality Prediction Using Multimodal Datasets | Ming Feng; Kele Xu; Yin Wang; | In this paper, we make a quantitative comparison of different machine learning approaches for the human spermatozoa quality prediction task, leveraging multiple modalities dataset. |
543 | Enhancing Anomaly Detection in Surveillance Videos with Transfer Learning from Action Recognition | Kun Liu; Minzhi Zhu; Huiyuan Fu; Huadong Ma; Tat-Seng Chua; | In this paper, we propose to utilize transfer learning to leverage the good results from action recognition for anomaly detection in surveillance videos. |
544 | Modularized Framework with Category-Sensitive Abnormal Filter for City Anomaly Detection | Jie Wu; Yingying Li; Wei Zhang; Yi Wu; Xiao Tan; Hongwu Zhang; Shilei Wen; Errui Ding; Guanbin Li; | In this paper, we propose a modularized framework to perform general and specific anomaly detection. |
545 | Large Scale Hierarchical Anomaly Detection and Temporal Localization | Soumil Kanwal; Vineet Mehta; Abhinav Dhall; | In this paper, we propose a multiple feature-based approach for CitySCENE challenge-based anomaly detection. |
546 | Global Information Guided Video Anomaly Detection | Hui Lv; Chunyan Xu; Zhen Cui; | In this paper, we propose an end-to-end Global Information Guided (GIG) anomaly detection framework for anomaly detection using the video-level annotations (i.e., weak labels). |
547 | A Simple Baseline for Pose Tracking in Videos of Crowed Scenes | Li Yuan; Shuning Chang; Ziyuan Huang; Yichen Zhou; Yupeng Chen; Xuecheng Nie; Francis E.H. Tay; Jiashi Feng; Shuicheng Yan; | This paper presents our solution to ACM MM challenge: Large-scale Human-centric Video Analysis in Complex Events[13]; specifically, here we focus on Track3: Crowd Pose Tracking in Complex Events. |
548 | HiEve ACM MM Grand Challenge 2020: Pose Tracking in Crowded Scenes | Lumin Xu; Ruihan Xu; Sheng Jin; | We propose a simple yet effective top-down crowd pose tracking algorithm. |
549 | Toward Accurate Person-level Action Recognition in Videos of Crowed Scenes | Li Yuan; Yichen Zhou; Shuning Chang; Ziyuan Huang; Yupeng Chen; Xuecheng Nie; Tao Wang; Jiashi Feng; Shuicheng Yan; | In this paper, we focus on improving spatio-temporal action recognition by fully-utilizing the information of scenes and collecting new data. |
550 | Person-level Action Recognition in Complex Events via TSD-TSM Networks | Yanbin Hao; Zi-Niu Liu; Hao Zhang; Bin Zhu; Jingjing Chen; Yu-Gang Jiang; Chong-Wah Ngo; | In this paper, we present a simple yet efficient pipeline for this task, referred to as TSD-TSM networks. |
551 | Group-Skeleton-Based Human Action Recognition in Complex Events | Tingtian Li; Zixun Sun; Xiao Chen; | In this paper, we propose a novel group-skeleton-based human action recognition method in complex events. |
552 | Attention Based Beauty Product Retrieval Using Global and Local Descriptors | Jun Yu; Guochen Xie; Mengyan Li; Haonian Xie; Xinlong Hao; Fang Gao; Feng Shuang; | In this paper, we first introduce attention mechanism into a global image descriptor, i.e., Maximum Activation of Convolutions (MAC), and propose Attention-based MAC (AMAC). |
553 | Multi-Feature Fusion Method Based on Salient Object Detection for Beauty Product Retrieval | Runming Yan; Yongchun Lin; Zhichao Deng; Liang Lei; Chudong Xu; | In this paper, we propose a multi-feature fusion method based on salient object detection to improve retrieval performance. |
554 | Attention-driven Unsupervised Image Retrieval for Beauty Products with Visual and Textual Clues | Jingwen Hou; Sijie Ji; Annan Wang; | Therefore, we propose a search method utilizing both images and product descriptions in this work. |
555 | Learning Visual Features from Product Title for Image Retrieval | Fangxiang Feng; Tianrui Niu; Ruifan Li; Xiaojie Wang; Huixing Jiang; | In this paper, we utilize the easily accessible text information, that is, the product title, as a supervised signal to learn the features of the product image. |
556 | Learning to Remember Beauty Products | Toan H. Vu; An Dang; Jia-Ching Wang; | This paper develops a deep learning model for the beauty product image retrieval problem. |
557 | Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions for Beauty Product Retrieval | Kele Xu; Yuzhong Liu; Ming Feng; Jianqiao Zhao; Huaimin Wang; Hengxing Cai; | In this paper, we propose a novel descriptors, named Multi-Scale Generalized Attention-Based Regional Maximum Activation of Convolutions (MS-GRMAC). |
558 | Low-level Optimizations for Faster Mobile Deep Learning Inference Frameworks | Mathieu Febvay; | In this paper, we present the performance benchmark of four popular open-source deep learning inference frameworks used on mobile devices on three different convolutional neural network models. |
559 | Deep Neural Networks for Predicting Affective Responses from Movies | Ha Thi Phuong Thao; | In this work, we develop deep neural networks for predicting affective responses from movies taking both audio and video streams into account. |
560 | Learning Self-Supervised Multimodal Representations of Human Behaviour | Abhinav Shukla; | In this extended abstract, I present the direction of research that I have followed during the first half of my PhD, along with ideas and work in progress for the second half. |
561 | Multi-person Pose Estimation in Complex Physical Interactions | Wen Guo; | In this work we input the initial pose along with its interactees into a recurrent network to refine the pose of the person-of-interest, and we demonstrate the effectiveness of our method in the MuPoTS dataset, setting the new state-of-the-art on it. |