Paper Digest: KDD 2015 Highlights

August 1, 2015June 25, 2020 admin

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) is one of the top data mining conferences in the world.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team
team@paperdigest.org

TABLE 1: KDD 2015 Papers

	Title	Authors	Highlight
1	Online Controlled Experiments: Lessons from Running A/B/n Tests for 12 Years	Ron Kohavi	We provide an introduction, share real examples, key lessons, and cultural challenges.
2	MOOCS: What Have We Learned?	Daphne Koller	I will show how MOOCs provide opportunities for open-ended projects, intercultural learner interactions, and collaborative learning.
3	Machine Learning and Causal Inference for Policy Evaluation	Susan Athey	Specifically, we propose to divide the features of a model into causal features, whose values may be manipulated in a counterfactual policy environment, and attributes.
4	Data, Knowledge and Discovery: Machine Learning meets Natural Science	Hugh Durrant-Whyte	This talk will describe a number of applied machine learning projects addressing real-world inference problems in physical, life and social science areas.
5	Large-Scale Distributed Bayesian Matrix Factorization using Stochastic Gradient MCMC	Sungjin Ahn, Anoop Korattikara, Nathan Liu, Suju Rajan, Max Welling	In this paper, we propose a scalable distributed Bayesian matrix factorization algorithm using stochastic gradient MCMC.
6	TimeMachine: Timeline Generation for Knowledge-Base Entities	Tim Althoff, Xin Luna Dong, Kevin Murphy, Safa Alai, Van Dang, Wei Zhang	We present a method called TIMEMACHINE to generate a timeline of events and relations for entities in a knowledge base.
7	Estimating Local Intrinsic Dimensionality	Laurent Amsaleg, Oussama Chelly, Teddy Furon, Stéphane Girard, Michael E. Houle, Ken-ichi Kawarabayashi, Michael Nett	This paper is concerned with the estimation of a local measure of intrinsic dimensionality (ID) recently proposed by Houle.
8	Portraying Collective Spatial Attention in Twitter	Émilien Antoine, Adam Jatowt, Shoko Wakamiya, Yukiko Kawai, Toyokazu Akiyama	In this paper we demonstrate a novel visualization system for analyzing how Twitter users collectively talk about space and for uncovering correlations between geographical locations of Twitter users and the locations they tweet about.
9	Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy	Nurjahan Begum, Liudmila Ulanova, Jun Wang, Eamonn Keogh	In this work, we address this lethargy in two ways.
10	Efficient Online Evaluation of Big Data Stream Classifiers	Albert Bifet, Gianmarco de Francisci Morales, Jesse Read, Geoff Holmes, Bernhard Pfahringer	In this paper we propose a new evaluation methodology for big data streams.
11	Dynamically Modeling Patient’s Health State from Electronic Medical Records: A Time Series Approach	Karla L. Caballero Barajas, Ram Akella	In this paper, we present a method to dynamically estimate the probability of mortality inside the Intensive Care Unit (ICU) by combining heterogeneous data.
12	Facets: Fast Comprehensive Mining of Coevolving High-order Time Series	Yongjie Cai, Hanghang Tong, Wei Fan, Ping Ji, Qing He	In this paper, we propose a comprehensive method, FACETS, to simultaneously model all these three challenges.
13	Online Outlier Exploration Over Large Datasets	Lei Cao, Mingrui Wei, Di Yang, Elke A. Rundensteiner	In this work, we present the first online outlier exploration platform, called ONION, that enables analysts to effectively explore anomalies even in large datasets.
14	BatchRank: A Novel Batch Mode Active Learning Framework for Hierarchical Classification	Shayok Chakraborty, Vineeth Balasubramanian, Adepu Ravi Sankar, Sethuraman Panchanathan, Jieping Ye	In this paper, we propose a novel BMAL algorithm (BatchRank) for hierarchical classification.
15	On the Formation of Circles in Co-authorship Networks	Tanmoy Chakraborty, Sikhar Patranabis, Pawan Goyal, Animesh Mukherjee	In this paper, we propose an unsupervised approach to automatically detect circles in an ego network such that each circle represents a densely knit community of researchers.
16	Heterogeneous Network Embedding via Deep Architectures	Shiyu Chang, Wei Han, Jiliang Tang, Guo-Jun Qi, Charu C. Aggarwal, Thomas S. Huang	In this paper, we examine the scenario of a heterogeneous network with nodes and content of various types.
17	Differentially Private High-Dimensional Data Publication via Sampling-Based Inference	Rui Chen, Qian Xiao, Yu Zhang, Jianliang Xu	In this paper, we consider the problem of releasing high-dimensional data with differential privacy guarantees.
18	Efficient Algorithms for Public-Private Social Networks	Flavio Chierichetti, Alessandro Epasto, Ravi Kumar, Silvio Lattanzi, Vahab Mirrokni	We introduce the public-private model of graphs.
19	Warm Start for Parameter Selection of Linear Classifiers	Bo-Yu Chu, Chia-Hua Ho, Cheng-Hao Tsai, Chieh-Yen Lin, Chih-Jen Lin	Our aim is to devise effective warm-start strategies to efficiently solve this sequence of optimization problems.
20	Stream Sampling for Frequency Cap Statistics	Edith Cohen	We present a sampling framework for unaggregated data that uses a single pass (for streams) or two passes (for distributed data) and state proportional to the desired sample size.
21	Adaptation Algorithm and Theory Based on Generalized Discrepancy	Corinna Cortes, Mehryar Mohri, Andrés Muñoz Medina	We present a new algorithm for domain adaptation improving upon the discrepancy minimization algorithm (DM), which was previously shown to outperform a number of popular algorithms designed for this task.
22	Optimal Action Extraction for Random Forests and Boosted Trees	Zhicheng Cui, Wenlin Chen, Yujie He, Yixin Chen	To address this problem, we present a novel framework to post-process any ATM classifier to extract an optimal actionable plan that can change a given input to a desired class with a minimum cost.
23	Dynamic Matrix Factorization with Priors on Unknown Values	Robin Devooght, Nicolas Kourtellis, Amin Mantrach	In this work, we build on this assumption, and introduce a novel dynamic matrix factorization framework that allows to set an explicit prior on unknown values.
24	CoupledLP: Link Prediction in Coupled Networks	Yuxiao Dong, Jing Zhang, Jie Tang, Nitesh V. Chawla, Bai Wang	We propose a unified framework, CoupledLP, to solve the problem.
25	Unsupervised Feature Selection with Adaptive Structure Learning	Liang Du, Yi-Dong Shen	To address this, we propose a unified learning framework which performs structure learning and feature selection simultaneously.
26	Dirichlet-Hawkes Processes with Applications to Clustering Continuous-Time Document Streams	Nan Du, Mehrdad Farajtabar, Amr Ahmed, Alexander J. Smola, Le Song	In this paper, we propose a novel random process, referred to as the Dirichlet-Hawkes process, to take into account both information in a unified framework.
27	Beyond Triangles: A Distributed Framework for Estimating 3-profiles of Large Graphs	Ethan R. Elenberg, Karthikeyan Shanmugam, Michael Borokhovich, Alexandros G. Dimakis	For the harder problem of ego 3-profiles, we introduce an algorithm that can estimate profiles of hundreds of thousands of vertices in parallel, in the timescale of minutes.
28	Hierarchical Graph-Coupled HMMs for Heterogeneous Personalized Health Data	Kai Fan, Marisa Eisenberg, Alison Walsh, Allison Aiello, Katherine Heller	The purpose of this study is to leverage modern technology (mobile or web apps) to enrich epidemiology data and infer the transmission of disease.
29	More Constraints, Smaller Coresets: Constrained Matrix Approximation of Sparse Big Data	Dan Feldman, Tamir Tassa	We suggest a generic data reduction technique with provable guarantees for computing the low rank approximation of a matrix under some $ell_z error, and constrained factorizations, such as the Non-negative Matrix Factorization (NMF).
30	Certifying and Removing Disparate Impact	Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, Suresh Venkatasubramanian	Instead of requiring access to the process, we propose making inferences based on the data it uses.
31	RSC: Mining and Modeling Temporal Activity in Social Media	Alceu Ferraz Costa, Yuto Yamaguchi, Agma Juci Machado Traina, Caetano Traina, Christos Faloutsos	In this paper we analyze time-stamp data from social media services and find that the distribution of postings inter-arrival times (IAT) is characterized by four patterns: (i) positive correlation between consecutive IATs, (ii) heavy tails, (iii) periodic spikes and (iv) bimodal distribution.
32	A Clustering-Based Framework to Control Block Sizes for Entity Resolution	Jeffrey Fisher, Peter Christen, Qing Wang, Erhard Rahm	We propose two novel hierarchical clustering approaches which can generate blocks within a specified size range, and we present a penalty function which allows control of the trade-off between block quality and block size in the clustering process.
33	Who Supported Obama in 2012?: Ecological Inference through Distribution Regression	Seth R. Flaxman, Yu-Xiang Wang, Alexander J. Smola	We present a new solution to the “ecological inference” problem, of learning individual-level associations from aggregate data.
34	Real Estate Ranking via Mixed Land-use Latent Models	Yanjie Fu, Guannan Liu, Spiros Papadimitriou, Hui Xiong, Yong Ge, Hengshu Zhu, Chen Zhu	To that end, in this paper, we develop a geographical function ranking method, named FuncDivRank, by incorporating the functional diversity of communities into real estate appraisal.
35	Adaptive Message Update for Fast Affinity Propagation	Yasuhiro Fujiwara, Makoto Nakatsuji, Hiroaki Shiokawa, Yasutoshi Ida, Machiko Toyoda	This paper proposes an efficient algorithm that guarantees the same clustering results as the original algorithm.
36	Monitoring Least Squares Models of Distributed Streams	Moshe Gabel, Daniel Keren, Assaf Schuster	We propose the first monitoring algorithm for multivariate regression models of distributed data streams that guarantees a bounded model error.
37	Reconstructing Textual Documents from n-grams	Matthias Gallé, Matías Tealdi	Instead, we propose another method consisting in adding strategically fictitious n-grams and show that a noised corpus like that is much harder to reconstruct while increasing only little the perplexity of a language model obtained through it.
38	Anatomical Annotations for Drosophila Gene Expression Patterns via Multi-Dimensional Visual Descriptors Integration: Multi-Dimensional Feature Learning	Hongchang Gao, Lin Yan, Weidong Cai, Heng Huang	We propose a novel structured sparsity-inducing norms based feature learning model to integrate the multi-dimensional visual descriptors for Drosophila gene expression patterns annotations.
39	Selective Hashing: Closing the Gap between Radius Search and k-NN Search	Jinyang Gao, H.V. Jagadish, Beng Chin Ooi, Sheng Wang	We propose a novel indexing scheme called Selective Hashing, where a disjoint set of indices are built with different granularities and each point is only stored in the most effective index.
40	Using Local Spectral Methods to Robustify Graph-Based Learning Algorithms	David F. Gleich, Michael W. Mahoney	We study robustness with respect to the details of graph constructions, errors in node labeling, degree variability, and a variety of other real-world heterogeneities, studying these methods through a precise relationship with mincut problems.
41	Instance Weighting for Patient-Specific Risk Stratification Models	Jen J. Gong, Thoralf M. Sundt, James D. Rawn, John V. Guttag	In this paper, we present an approach to address the problem of small data using transfer learning methods in the context of developing risk models for cardiac surgeries.
42	A Deep Hybrid Model for Weather Forecasting	Aditya Grover, Ashish Kapoor, Eric Horvitz	We explore new directions with forecasting weather as a data-intensive challenge that involves inferences across space and time.
43	Network Lasso: Clustering and Optimization in Large Graphs	David Hallac, Jure Leskovec, Stephen Boyd	In this paper, we introduce the network lasso, a generalization of the group lasso to a network setting that allows for simultaneous clustering and optimization on graphs.
44	Learning Tree Structure in Multi-Task Learning	Lei Han, Yu Zhang	To the best of our knowledge, there is no work to learn the tree structure among tasks and model parameters simultaneously under the regularization framework and in this paper, we develop a TAsk Tree (TAT) model for MTL to achieve this.
45	Probabilistic Community and Role Model for Social Networks	Yu Han, Jie Tang	In this paper, we propose a unified probabilistic framework, the Community Role Model (CRM), to model a social network.
46	Real-Time Top-R Topic Detection on Twitter with Topic Hijack Filtering	Kohei Hayashi, Takanori Maehara, Masashi Toyoda, Ken-ichi Kawarabayashi	In this paper, we integrate both the extraction of meaningful topics and the filtering of messages over the Twitter stream.
47	Non-exhaustive, Overlapping Clustering via Low-Rank Semidefinite Programming	Yangyang Hou, Joyce Jiyoung Whang, David F. Gleich, Inderjit S. Dhillon	We propose a novel convex semidefinite program (SDP) as a relaxation of the non-exhaustive, overlapping clustering problem.
48	Inferring Air Quality for Station Location Recommendation Based on Urban Big Data	Hsun-Ping Hsieh, Shou-De Lin, Yu Zheng	We design a semi-supervised inference model utilizing existing monitoring data together with heterogeneous city dynamics, including meteorology, human mobility, structure of road networks, and point of interests (POIs).
49	Website Optimization Problem and Its Solutions	Shuhei Iitsuka, Yutaka Matsuo	By combining organized algorithms and devices, we propose a rapid testing method that detects high-performing variations with few users.
50	Reciprocity in Social Networks with Capacity Constraints	Bo Jiang, Zhi-Li Zhang, Don Towsley	In this paper we study the problem of maximizing achievable reciprocity for an ensemble of digraphs with the same prescribed in- and out-degree sequences.
51	Learning with Similarity Functions on Graphs using Matchings of Geometric Embeddings	Fredrik D. Johansson, Devdatt Dubhashi	We develop and apply the Balcan-Blum-Srebro (BBS) theory of classification via similarity functions (which are not necessarily kernels) to the problem of graph classification.
52	Structured Hedging for Resource Allocations with Leverage	Nicholas Johnson, Arindam Banerjee	In this paper, we present a formulation for hedging online resource allocations with leverage and propose an efficient data mining algorithm (SHERAL). We pose the problem as a constrained online convex optimization problem.
53	Improved Bounds on the Dot Product under Random Projection and Random Sign Projection	Ata Kaban	In this paper we provide improved bounds on the dot product under random projection that matches the optimal bounds on the Euclidean distance.
54	Accelerated Alternating Direction Method of Multipliers	Mojtaba Kadkhodaie, Konstantina Christakopoulou, Maziar Sanjabi, Arindam Banerjee	In this paper, we introduce the Accelerated Alternating Direction Method of Multipliers (A2DM2) which solves problems with the same structure as ADMM.
55	Deep Computational Phenotyping	Zhengping Che, David Kale, Wenzhe Li, Mohammad Taha Bahadori, Yan Liu	We propose two novel modifications to standard neural net training that address challenges and exploit properties that are peculiar, if not exclusive, to medical data.
56	Leveraging Social Context for Modeling Topic Evolution	Janani Kalyanam, Amin Mantrach, Diego Saez-Trumper, Hossein Vahabi, Gert Lanckriet	In particular, our goal is to both qualitatively and quantitatively analyze when social context actually helps with TDE.
57	Scalable Blocking for Privacy Preserving Record Linkage	Alexandros Karakasidis, Georgia Koloniari, Vassilios S. Verykios	To this end, we propose Multi-Sampling Transitive Closure for Encrypted Fields (MS-TCEF), a novel privacy preserving blocking technique based on the use of reference sets.
58	Real Time Recommendations from Connoisseurs	Noriaki Kawamae	In this paper, we set the goal of real time recommendation, to present these items instantly.
59	Towards Decision Support and Goal Achievement: Identifying Action-Outcome Relationships From Social Media	Emre Kıcıman, Matthew Richardson	In this paper, we investigate the feasibility of mining the relationship between actions and their outcomes from the aggregated timelines of individuals posting experiential microblog reports.
60	On Estimating the Swapping Rate for Categorical Data	Daniel Kifer	In this paper, we consider the problem of inferring such parameters from the data.
61	Simultaneous Discovery of Common and Discriminative Topics via Joint Nonnegative Matrix Factorization	Hannah Kim, Jaegul Choo, Jingu Kim, Chandan K. Reddy, Haesun Park	To address such needs, this paper presents a novel topic modeling method based on joint nonnegative matrix factorization, which simultaneously discovers common as well as discriminative topics given multiple document sets.
62	A Decision Tree Framework for Spatiotemporal Sequence Prediction	Taehwan Kim, Yisong Yue, Sarah Taylor, Iain Matthews	We present a decision tree framework for learning an accurate non-parametric spatiotemporal sequence predictor.
63	TOPTRAC: Topical Trajectory Pattern Mining	Younghoon Kim, Jiawei Han, Cangzhou Yuan	In this paper, we present a latent topic-based clustering algorithm to discover patterns in the trajectories of geo-tagged text messages.
64	From Group to Individual Labels Using Deep Features	Dimitrios Kotzias, Misha Denil, Nando de Freitas, Padhraic Smyth	In this paper we focus on the problem of learning classifiers to make predictions at the instance level.
65	VEWS: A Wikipedia Vandal Early Warning System	Srijan Kumar, Francesca Spezzano, V.S. Subrahmanian	We describe specific behaviors that distinguish between vandals and non-vandals. We leverage multiple classical ML approaches, but develop 3 novel sets of features.
66	Unified and Contrasting Cuts in Multiple Graphs: Application to Medical Imaging Segmentation	Chia-Tung Kuo, Xiang Wang, Peter Walker, Owen Carmichael, Jieping Ye, Ian Davidson	In this paper we study two such questions: i) For a collection of graphs find a single cut that is good for all the graphs and ii) For two collections of graphs find a single cut that is good for one collection but poor for the other.
67	Reducing the Unlabeled Sample Complexity of Semi-Supervised Multi-View Learning	Chao Lan, Jun Huan	In this paper, we improve the state-of-art u.s.c. from O(1/ε) to O(log 1/ε) for small error ε, under mild conditions.
68	Maximum Likelihood Postprocessing for Differential Privacy under Consistency Constraints	Jaewoo Lee, Yue Wang, Daniel Kifer	In this paper, to further improve accuracy, we formulate this post-processing step as a constrained maximum likelihood estimation problem, which is equivalent to constrained L₁ minimization.
69	Online Influence Maximization	Siyu Lei, Silviu Maniu, Luyi Mo, Reynold Cheng, Pierre Senellart	In this paper, we study IM in the absence of complete information on influence probability.
70	The Child is Father of the Man: Foresee the Success at the Early Stage	Liangyue Li, Hanghang Tong	In this paper, we propose a joint predictive model to forecast the long-term scientific impact at the early stage, which simultaneously addresses a number of these open challenges, including the scholarly feature design, the non-linearity, the domain-heterogeneity and dynamics.
71	0-Bit Consistent Weighted Sampling	Ping Li	We provide a simple solution by discarding t^* (which we refer to as the "0-bit" scheme).
72	On the Discovery of Evolving Truth	Yaliang Li, Qi Li, Jing Gao, Lu Su, Bo Zhao, Wei Fan, Jiawei Han	To address this problem, we investigate the temporal relations among both object truths and source reliability, and propose an incremental truth discovery framework that can dynamically update object truths and source weights upon the arrival of new data.
73	MASCOT: Memory-efficient and Accurate Sampling for Counting Local Triangles in Graph Streams	Yongsub Lim, U Kang	In this paper, we propose MASCOT, a memory-efficient and accurate method for local triangle estimation in a graph stream based on edge sampling.
74	A Learning-based Framework to Handle Multi-round Multi-party Influence Maximization on Social Networks	Su-Chen Lin, Shou-De Lin, Ming-Syan Chen	Considering nowadays companies providing similar products or services compete with each other for resources and customers, this work proposes a learning-based framework to tackle the multi-round competitive influence maximization problem on a social network.
75	Temporal Phenotyping from Longitudinal Electronic Health Records: A Graph Based Framework	Chuanren Liu, Fei Wang, Jianying Hu, Hui Xiong	To address this challenge, in this paper, we develop a novel representation, namely the temporal graph, for such event sequences.
76	Spectral Ensemble Clustering	Hongfu Liu, Tongliang Liu, Junjie Wu, Dacheng Tao, Yun Fu	We therefore propose SEC, an efficient Spectral Ensemble Clustering method based on co-association matrix.
77	Fast and Memory-Efficient Significant Pattern Mining via Permutation Testing	Felipe Llinares-López, Mahito Sugiyama, Laetitia Papaxanthos, Karsten Borgwardt	We present a novel algorithm for significant pattern mining, Westfall-Young light.
78	Influence at Scale: Distributed Computation of Complex Contagion in Networks	Brendan Lucier, Joel Oren, Yaron Singer	We describe a novel sampling approach that can be used to design scalable algorithms with provable performance guarantees.
79	FaitCrowd: Fine Grained Truth Discovery for Crowdsourced Data Aggregation	Fenglong Ma, Yaliang Li, Qi Li, Minghui Qiu, Jing Gao, Shi Zhi, Lu Su, Bo Zhao, Heng Ji, Jiawei Han	To capture various expertise levels on different topics, we propose FaitCrowd, a fine grained truth discovery model for the task of aggregating conflicting data collected from multiple users/sources.
80	Algorithmic Cartography: Placing Points of Interest and Ads on Maps	Mohammad Mahdian, Okke Schrijvers, Sergei Vassilvitskii	We present simple, approximately optimal selection algorithms, coupled with incentive compatible pricing schemes in case of advertiser supplied points of interest.
81	Dimensionality Reduction Via Graph Structure Learning	Qi Mao, Li Wang, Steve Goodison, Yijun Sun	We present a new dimensionality reduction setting for a large family of real-world problems.
82	Robust Treecode Approximation for Kernel Machines	William B. March, Bo Xiao, Sameer Tharakan, Chenhan D. Yu, George Biros	We present a theoretical error analysis of our treecode and relate it to the error of Nystrom methods.
83	Inferring Networks of Substitutable and Complementary Products	Julian McAuley, Rahul Pandey, Jure Leskovec	Our goal in this paper is to learn the semantics of substitutes and complements from the text of online reviews.
84	Data-Driven Activity Prediction: Algorithms, Evaluation Methodology, and Applications	Bryan Minor, Janardhan Rao Doppa, Diane J. Cook	In this paper, we make three main contributions.
85	Scalable Large Near-Clique Detection in Large-Scale Networks via Sampling	Michael Mitzenmacher, Jakub Pachocki, Richard Peng, Charalampos Tsourakakis, Shen Chen Xu	In this paper we focus on a family of poly-time solvable formulations, known as the k-clique densest subgraph problem (k-Clique-DSP) [57].
86	Graph Query Reformulation with Diversity	Davide Mottin, Francesco Bonchi, Francesco Gullo	We study a problem of graph-query reformulation enabling explorative query-driven discovery in graph databases.
87	Flexible and Robust Multi-Network Clustering	Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang	In this paper, we propose a flexible and robust framework that allows multiple underlying clustering structures across different networks.
88	Extreme States Distribution Decomposition Method for Search Engine Online Evaluation	Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb Gusev, Pavel Serdyukov	We provide a thorough theoretical analysis of our approach and show experimentally that, other things being equal, it produces more sensitive OEC than the average.
89	Simultaneous Modeling of Multiple Diseases for Mortality Prediction in Acute Hospital Care	Nozomi Nori, Hisashi Kashima, Kazuto Yamashita, Hiroshi Ikai, Yuichi Imanaka	In this paper, we incorporate disease-specific contexts into mortality modeling by formulating the mortality prediction problem as a multi-task learning problem in which a task corresponds to a disease.
90	Fast and Robust Parallel SGD Matrix Factorization	Jinoh Oh, Wook-Shin Han, Hwanjo Yu, Xiaoqian Jiang	This paper proposes a fast and robust parallel SGD matrix factorization algorithm, called MLGF-MF, which is robust to skewed matrices and runs efficiently on block-storage devices (e.g., SSD disks) as well as shared-memory.
91	Efficient PageRank Tracking in Evolving Networks	Naoto Ohsaka, Takanori Maehara, Ken-ichi Kawarabayashi	In this paper, we propose an efficient online algorithm for tracking personalized PageRank in an evolving network.
92	Quick Sensitivity Analysis for Incremental Data Modification and Its Application to Leave-one-out CV in Linear Classification Problems	Shota Okumura, Yoshiki Suzuki, Ichiro Takeuchi	We introduce a novel sensitivity analysis framework for large scale classification problems that can be used when a small number of instances are incrementally added or removed.
93	Non-transitive Hashing with Latent Similarity Components	Mingdong Ou, Peng Cui, Fei Wang, Jun Wang, Wenwu Zhu	In this paper, we propose a non-transitive hashing method, namely Multi-Component Hashing (MuCH), to identify the latent similarity components to cope with the non-transitive similarity relationships.
94	Optimal Kernel Group Transformation for Exploratory Regression Analysis and Graphics	Pan Chao, Qiming Huang, Michael Zhu	In this article, we propose to use optimal group transformations as a general approach for exploring the relationship between Y and X.
95	Discovering and Exploiting Deterministic Label Relationships in Multi-Label Learning	Christina Papagiannopoulou, Grigorios Tsoumakas, Ioannis Tsamardinos	This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels.
96	Subspace Clustering Using Log-determinant Rank Approximation	Chong Peng, Zhao Kang, Huiqing Li, Qiang Cheng	We apply the method of augmented Lagrangian multipliers to optimize this non-convex rank approximation-based objective function and obtain closed-form solutions for all subproblems of minimizing different variables alternatively.
97	A PCA-Based Change Detection Framework for Multidimensional Data Streams: Change Detection in Multidimensional Data Streams	Abdulhakim A. Qahtan, Basma Alharbi, Suojin Wang, Xiangliang Zhang	In this paper, we propose a framework for detecting changes in multidimensional data streams based on principal component analysis, which is used for projecting data into a lower dimensional space, thus facilitating density estimation and change-score calculations.
98	State-Driven Dynamic Sensor Selection and Prediction with State-Stacked Sparseness	Guo-Jun Qi, Charu Aggarwal, Deepak Turaga, Daby Sow, Phil Anno	We introduce the notion of state-stacked sparseness to select a subset of the most critical sensors as a function of evolving system state.
99	SCRAM: A Sharing Considered Route Assignment Mechanism for Fair Taxi Route Recommendations	Shiyou Qian, Jian Cao, Frédéric Le Mouël, Issam Sahel, Minglu Li	In the paper, we propose SCRAM, a sharing considered route assignment mechanism for fair taxi route recommendations.
100	Locally Densest Subgraph Discovery	Lu Qin, Rong-Hua Li, Lijun Chang, Chengqi Zhang	In this paper, we aim to discover top-k such representative locally densest subgraphs of a graph.
101	Virus Propagation in Multiple Profile Networks	Angeliki Rapti, Spyros Sioutas, Kostas Tsichlas, Giannis Tzimas	Can we predict what proportion of the network will actually get "infected" (e.g., spread the idea or buy the competing product), when the nodes of the network appear to have different sensitivity based on their profile?
102	Collective Opinion Spam Detection: Bridging Review Networks and Metadata	Shebuti Rayana, Leman Akoglu	In this work, we propose a new holistic approach called SPEAGLE that utilizes clues from all metadata (text, timestamp, rating) as well as relational data (network), and harness them collectively under a unified framework to spot suspicious users and reviews, as well as products targeted by spam.
103	ClusType: Effective Entity Recognition and Typing by Relation Phrase-Based Clustering	Xiang Ren, Ahmed El-Kishky, Chi Wang, Fangbo Tao, Clare R. Voss, Jiawei Han	In this paper, we investigate entity recognition (ER) with distant-supervision and propose a novel relation phrase-based ER framework, called ClusType, that runs data-driven phrase mining to generate entity mention candidates and relation phrases, and enforces the principle that relation phrases should be softly clustered when propagating type information between their argument entities.
104	Mining Frequent Itemsets through Progressive Sampling with Rademacher Averages	Matteo Riondato, Eli Upfal	We present an algorithm to extract an high-quality approximation of the (top-k) Frequent itemsets (FIs) from random samples of a transactional dataset.
105	Why It Happened: Identifying and Modeling the Reasons of the Happening of Social Events	Yu Rong, Hong Cheng, Zhiyu Mo	Many models have been proposed to explain how information diffuses.
106	Matrix Completion with Queries	Natali Ruchansky, Mark Crovella, Evimaria Terzi	In this work, we address this problem by proposing an active version of matrix completion, where queries can be made to the true underlying matrix.
107	Stochastic Divergence Minimization for Online Collapsed Variational Bayes Zero Inference of Latent Dirichlet Allocation	Issei Sato, Hiroshi Nakagawa	We reformulate the existing SCVB0 inference by using the stochastic divergence minimization algorithm, with which convergence can be analyzed in terms of Martingale convergence theory.
108	Bayesian Poisson Tensor Factorization for Inferring Multilateral Relations from Sparse Dyadic Event Counts	Aaron Schein, John Paisley, David M. Blei, Hanna Wallach	We present a Bayesian tensor factorization model for inferring latent group structures from dynamic pairwise interaction patterns.
109	TimeCrunch: Interpretable Dynamic Graph Summarization	Neil Shah, Danai Koutra, Tianmin Zou, Brian Gallagher, Christos Faloutsos	Our main contributions are (a) formulation: we show how to formalize this problem as minimizing the encoding cost in a data compression paradigm, (b) algorithm: we propose TIMECRUNCH, an effective, scalable and parameter-free method for finding coherent, temporal patterns in dynamic graphs and (c) practicality: we apply our method to several large, diverse real-world datasets with up to 36 million edges and 6.3 million nodes.
110	Inside Jokes: Identifying Humorous Cartoon Captions	Dafna Shahaf, Eric Horvitz, Robert Mankoff	Motivated by the prospect of creating computational models of humor, we study the influence of the language of cartoon captions on the perceived humorousness of the cartoons.
111	Community Detection based on Distance Dynamics	Junming Shao, Zhichao Han, Qinli Yang, Tao Zhou	In this paper, we introduce a new community detection algorithm, called Attractor, which automatically spots communities in a network by examining the changes of "distances" among nodes (i.e. distance dynamics).
112	Discovery of Meaningful Rules in Time Series	Mohammad Shokoohi-Yekta, Yanping Chen, Bilson Campana, Bing Hu, Jesin Zakaria, Eamonn Keogh	In this work, we show why these ideas are not directly suitable for rule discovery in time series.
113	An Evaluation of Parallel Eccentricity Estimation Algorithms on Undirected Real-World Graphs	Julian Shun	This paper presents efficient shared-memory parallel implementations and the first comprehensive experimental study of graph eccentricity estimation algorithms in the literature.
114	Efficient Latent Link Recommendation in Signed Networks	Dongjin Song, David A. Meyer, Dacheng Tao	Since GAUC weights each pairwise comparison equally and the calculation of GAUC requires quadratic time, we derive two lower bounds of GAUC which can be computed in linear time and put more emphasis on ranking positive links on the top and negative links at the bottom of a ranking list.
115	Turn Waste into Wealth: On Simultaneous Clustering and Cleaning over Dirty Data	Shaoxu Song, Chunping Li, Xiaoquan Zhang	To this end, we study a novel problem of clustering and repairing over dirty data at the same time.
116	Set Cover at Web Scale	Stergios Stergiou, Kostas Tsioutsiouliklis	In this work we give the first MapReduce Set Cover algorithm that scales to problem sizes of ∼ 1 trillion elements and runs in log_p Δ iterations for a nearly optimum approximation ratio of p ln Δ, where Δ is the cardinality of the largest set in F A web crawler is a system for bulk downloading of web pages.
117	Exploiting Relevance Feedback in Knowledge Graph Search	Yu Su, Shengqi Yang, Huan Sun, Mudhakar Srivatsa, Sue Kase, Michelle Vanni, Xifeng Yan	In this paper, we study how to improve graph query by relevance feedback.
118	LINKAGE: An Approach for Comprehensive Risk Prediction for Care Management	Zhaonan Sun, Fei Wang, Jianying Hu	In this paper, we propose a data-driven comprehensive risk prediction method, named LINKAGE, which can be used to jointly assess a set of associated risks in support of holistic care management.
119	Transitive Transfer Learning	Ben Tan, Yangqiu Song, Erheng Zhong, Qiang Yang	To solve the TTL problem, we propose a learning framework to mimic the human learning process.
120	PTE: Predictive Text Embedding through Large-scale Heterogeneous Text Networks	Jian Tang, Meng Qu, Qiaozhu Mei	In this paper, we fill this gap by proposing a semi-supervised representation learning method for text data, which we call the predictive text embedding (PTE).
121	An Effective Marketing Strategy for Revenue Maximization with a Quantity Constraint	Ya-Wen Teng, Chih-Hua Tai, Philip S. Yu, Ming-Syan Chen	To fulfill this gap, in this paper, we aim for maximizing the revenue by considering the quantity constraint on the promoted commodity.
122	Scaling Up Stochastic Dual Coordinate Ascent	Kenneth Tran, Saghar Hosseini, Lin Xiao, Thomas Finley, Mikhail Bilenko	In this paper, we introduce an asynchronous parallel version of the algorithm, analyze its convergence properties, and propose a solution for primal-dual synchronization required to achieve convergence in practice.
123	Discovering Valuable items from Massive Data	Hastagiri P. Vanchinathan, Andreas Marfurt, Charles-Antoine Robelin, Donald Kossmann, Andreas Krause	We present an algorithm, GP-SELECT, which utilizes prior knowledge about similarity between items, expressed as a kernel function.
124	Deep Learning Architecture with Dynamically Programmed Layers for Brain Connectome Prediction	Vivek Veeriah, Rohit Durvasula, Guo-Jun Qi	It is critical in the research for epilepsy and other neuropathological diseases.
125	Incorporating World Knowledge to Document Clustering via Heterogeneous Information Networks	Chenguang Wang, Yangqiu Song, Ahmed El-Kishky, Dan Roth, Ming Zhang, Jiawei Han	In this paper, we provide an example of using world knowledge for domain dependent document clustering.
126	Towards Interactive Construction of Topical Hierarchy: A Recursive Tensor Decomposition Approach	Chi Wang, Xueqing Liu, Yanglei Song, Jiawei Han	In this study, we propose a novel method, called STROD, that allows efficient and consistent modification of topic hierarchies, based on a recursive generative model and a scalable tensor decomposition inference algorithm with theoretical performance guarantee.
127	Collaborative Deep Learning for Recommender Systems	Hao Wang, Naiyan Wang, Dit-Yan Yeung	To address this problem, we generalize recently advances in deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback) matrix.
128	Trading Interpretability for Accuracy: Oblique Treed Sparse Additive Models	Jialei Wang, Ryohei Fujimaki, Yosuke Motohashi	This paper proposes oblique treed sparse additive models (OT-SpAMs).
129	Geo-SAGE: A Geographical Sparse Additive Generative Model for Spatial Item Recommendation	Weiqing Wang, Hongzhi Yin, Ling Chen, Yizhou Sun, Shazia Sadiq, Xiaofang Zhou	In light of this, we propose Geo-SAGE, a geographical sparse additive generative model for spatial item recommendation in this paper.
130	Rubik: Knowledge Guided Tensor Factorization and Completion for Health Data Analytics	Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C. Denny, Abel Kho, You Chen, Bradley A. Malin, Jimeng Sun	We propose Rubik, a constrained non-negative tensor factorization and completion method for phenotyping.
131	Regularity and Conformity: Location Prediction Using Heterogeneous Mobility Data	Yingzi Wang, Nicholas Jing Yuan, Defu Lian, Linli Xu, Xing Xie, Enhong Chen, Yong Rui	To address these challenges, in this paper we propose a hybrid predictive model integrating both the regularity and conformity of human mobility as well as their mutual reinforcement.
132	Dynamic Poisson Autoregression for Influenza-Like-Illness Case Count Prediction	Zheng Wang, Prithwish Chakraborty, Sumiko R. Mekaru, John S. Brownstein, Jieping Ye, Naren Ramakrishnan	In this paper, we focus on short-term ILI case count prediction and develop a dynamic Poisson autoregressive model with exogenous inputs variables (DPARX) for flu forecasting.
133	Cinema Data Mining: The Smell of Fear	Jörg Wicker, Nicolas Krauter, Bettina Derstorff, Christof Stönner, Efstratios Bourtsoukidis, Thomas Klüpfel, Jonathan Williams, Stefan Kramer	The paper introduces a new field of application for data mining, where trace gas responses of people reacting on-line to films shown in cinemas (or movie theaters) are related to the semantic content of the films themselves.
134	Predicting Winning Price in Real Time Bidding with Censored Data	Wush Chi-Hsuan Wu, Mi-Yen Yeh, Ming-Syan Chen	We propose to leverage the machine learning and statistical methods to train the winning price model from the bidding history.
135	Diversifying Restricted Boltzmann Machine for Document Modeling	Pengtao Xie, Yuntian Deng, Eric Xing	To solve this problem, we propose Diversified RBM (DRBM) which diversifies the hidden units, to make them cover not only the dominant topics, but also those in the long-tail region.
136	Edge-Weighted Personalized PageRank: Breaking A Decade-Old Performance Barrier	Wenlei Xie, David Bindel, Alan Demers, Johannes Gehrke	In this paper, we describe the first fast algorithm for computing PageRank on general graphs when the edge weights are personalized.
137	Petuum: A New Platform for Distributed Machine Learning on Big Data	Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, Yaoliang Yu	We propose a general-purpose framework that systematically addresses data- and model-parallel challenges in large-scale ML, by leveraging several fundamental properties underlying ML programs that make them different from conventional operation-centric programs: error tolerance, dynamic structure, and nonuniform convergence; all stem from the optimization-centric nature shared in ML programs’ mathematical definitions, and the iterative-convergent behavior of their algorithmic solutions.
138	Longitudinal LASSO: Jointly Learning Features and Temporal Contingency for Outcome Prediction	Tingyang Xu, Jiangwen Sun, Jinbo Bi	We propose an approach to automatically and simultaneously determine both the relevant features and the relevant temporal points that impact the current outcome of the dependent variable.
139	Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems	Feng Yan, Olatunji Ruwase, Yuxiong He, Trishul Chilimbi	This paper develops performance models that quantify the impact of these partitioning and provisioning decisions on overall distributed system performance and scalability.
140	Deep Graph Kernels	Pinar Yanardag, S.V.N. Vishwanathan	In this paper, we present Deep Graph Kernels, a unified framework to learn latent representations of sub-structures for graphs, inspired by latest advancements in language modeling and deep learning.
141	Model Multiple Heterogeneity via Hierarchical Multi-Latent Space Learning	Pei Yang, Jingrui He	To address this problem, we propose a Hierarchical Multi-Latent Space (HiMLS) learning approach to jointly model the triple types of heterogeneity.
142	Structural Graphical Lasso for Learning Mouse Brain Connectivity	Sen Yang, Qian Sun, Shuiwang Ji, Peter Wonka, Ian Davidson, Jieping Ye	Motivated by the hierarchical structure of the brain networks, we consider the problem of estimating a graphical model with tree-structural regularization in this paper.
143	Entity Matching across Heterogeneous Sources	Yang Yang, Yizhou Sun, Jie Tang, Bo Ma, Juanzi Li	In this paper, we formalize the problem as entity matching across heterogeneous sources and propose a probabilistic topic model to solve the problem.
144	An Efficient Semi-Supervised Clustering Algorithm with Sequential Constraints	Jinfeng Yi, Lijun Zhang, Tianbao Yang, Wei Liu, Jun Wang	To address this challenge, we propose an efficient dynamic semi-supervised clustering framework that casts the clustering problem into a search problem over a feasible convex set, i.e., a convex hull with its extreme points being an ensemble of m data partitions.
145	Assembler: Efficient Discovery of Spatial Co-evolving Patterns in Massive Geo-sensory Data	Chao Zhang, Yu Zheng, Xiuli Ma, Jiawei Han	In this paper, we propose a two-stage method called Assember.
146	Dynamic Topic Modeling for Monitoring Market Competition from Online Text and Image Data	Hao Zhang, Gunhee Kim, Eric P. Xing	We propose a dynamic topic model for monitoring temporal evolution of market competition by jointly leveraging tweets and their associated images.
147	Organizational Chart Inference	Jiawei Zhang, Philip S. Yu, Yuanhua Lv	In this paper, we want to study the IOC (Inference of Organizational Chart) problem to identify company internal organizational chart based on the heterogeneous online ESN launched in it.
148	Panther: Fast Top-k Similarity Search on Large Networks	Jing Zhang, Jie Tang, Cong Ma, Hanghang Tong, Yu Jing, Juanzi Li	In this paper, we propose a sampling method that provably and accurately estimates the similarity between vertices.
149	A Collective Bayesian Poisson Factorization Model for Cold-start Local Event Recommendation	Wei Zhang, Jianyong Wang	In this work, we address the new problem of cold-start local event recommendation in EBSNs.
150	Statistical Arbitrage Mining for Display Advertising	Weinan Zhang, Jun Wang	In this paper, we propose a novel data mining paradigm called Statistical Arbitrage Mining (SAM) focusing on mining and exploiting price discrepancies between two pricing schemes.
151	Deep Model Based Transfer and Multi-Task Learning for Biological Image Analysis	Wenlu Zhang, Rongjian Li, Tao Zeng, Qian Sun, Sudhir Kumar, Jieping Ye, Shuiwang Ji	Here, we developed problem-independent feature extraction methods to generate hierarchical representations for ISH images.
152	COSNET: Connecting Heterogeneous Social Networks with Local and Global Consistency	Yutao Zhang, Jie Tang, Zhilin Yang, Jian Pei, Philip S. Yu	In this paper, we propose COSNET (COnnecting heterogeneous Social NETworks with local and global consistency), a novel energy-based model, to address this problem by considering both local and global consistency among multiple networks.
153	SAME but Different: Fast and High Quality Gibbs Parameter Estimation	Huasha Zhao, Biye Jiang, John F. Canny, Bobby Jaros	In this paper we explore the application of SAME to graphical model inference on modern hardware.
154	Multi-Task Learning for Spatio-Temporal Event Forecasting	Liang Zhao, Qian Sun, Jieping Ye, Feng Chen, Chang-Tien Lu, Naren Ramakrishnan	This paper proposes a novel multi-task learning framework which aims to concurrently address all the challenges.
155	SEISMIC: A Self-Exciting Point Process Model for Predicting Tweet Popularity	Qingyuan Zhao, Murat A. Erdogdu, Hera Y. He, Anand Rajaraman, Jure Leskovec	In this paper, we focus on predicting the final number of reshares of a given post.
156	Linear Time Samplers for Supervised Topic Models using Compositional Proposals	Xun Zheng, Yaoliang Yu, Eric P. Xing	In this work we extend the recent sampling advances for unsupervised LDA models to supervised tasks.
157	L∞ Error and Bandwidth Selection for Kernel Density Estimates of Large Data	Yan Zheng, Jeff M. Phillips	In this paper we investigate the challenges in using L_∞ (or worst case) error, a stronger measure than L₁ or L₂.
158	Modeling Truth Existence in Truth Discovery	Shi Zhi, Bo Zhao, Wenzhu Tong, Jing Gao, Dian Yu, Heng Ji, Jiawei Han	By incorporating these three measures, we propose a probabilistic graphical model, which simultaneously infers truth as well as source quality without any a priori training involving ground truth answers.
159	Cuckoo Linear Algebra	Li Zhou, David G. Andersen, Mu Li, Alexander J. Smola	In this paper we present a novel data structure for sparse vectors based on Cuckoo hashing.
160	Integrating Vertex-centric Clustering with Edge-centric Clustering for Meta Path Graph Analysis	Yang Zhou, Ling Liu, David Buttler	This paper presents a meta path graph clustering framework, VEPATHCLUSTER, that combines meta path vertex-centric clustering with meta path edge-centric clustering for improving the clustering quality of heterogeneous networks.
161	Modeling User Mobility for Location Promotion in Location-based Social Networks	Wen-Yuan Zhu, Wen-Chih Peng, Ling-Jyh Chen, Kai Zheng, Xiaofang Zhou	In this paper, we investigate the key techniques that can help businesses promote their locations by advertising wisely through the underlying LBSNs.
162	Co-Clustering based Dual Prediction for Cargo Pricing Optimization	Yada Zhu, Hongxia Yang, Jingrui He	In particular, we propose a probabilistic framework to simultaneously construct dual predictive models and uncover the co-clusters of originations and destinations.
163	Debiasing Crowdsourced Batches	Honglei Zhuang, Aditya Parameswaran, Dan Roth, Jiawei Han	In this paper, we study the data annotation bias when data items are presented as batches to be judged by workers simultaneously.
164	Query Workloads for Data Series Indexes	Kostas Zoumpatianos, Yin Lou, Themis Palpanas, Johannes Gehrke	In this work, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload.
165	Scaling Machine Learning and Statistics for Web Applications	Deepak Agarwal	I will provide an overview of these challenges and the strategies we have adopted at LinkedIn to address those.
166	Hadoop’s Impact on the Future of Data Management	Amr Awadallah	Hadoop’s Impact on the Future of Data Management
167	Should You Trust Your Money to a Robot?	Vasant Dhar	Should You Trust Your Money to a Robot?
168	Data Science at Visa	Waqar Hasan, Min Wang	We will describe technical achievements we have made in the area of fraud and cover some open challenges in data science.
169	How Artificial Intelligence and Big Data Created Rocket Fuel: A Case Study	George John	The case study presentation will present a fast-paced overview of the business and technology context for Rocket Fuel at inception and at present, key learnings and decisions, and the road ahead.
170	Optimizing Marketing Impact through Data Driven Decisioning	Anil Kamath	In this talk we will show how data science and optimization techniques can be applied to cross channel data to attribute marketing effectiveness, drive media planning and real-time optimization of campaigns.
171	Powering Real-time Decision Engines in Finance and Healthcare using Open Source Software	Bassel Ojjeh	This presentation covers how, in collaboration with financial services and healthcare institutions, we built an OSS project to deliver a real-time decisioning engine for their respective applications.
172	Clouded Intelligence	Joseph Sirosh	In this talk I will review what these trends mean for the future of data science and show examples of revolutionary applications that you can build using cloud platforms.
173	Data Science from the Lab to the Field to the Enterprise	Christopher White	This presentation will cover previous work at DARPA, experience building real-world applications for defense and law enforcement to analyze data, and the future of computer science as an enabler for content discovery, information extraction, relevance determination, and information visualization.
174	User Modeling in Telecommunications and Internet Industry	Qiang Yang	What are the "pain" points of users’ In this talk, I will discuss my own experience on user modeling with big data.
175	The Effectiveness of Marketing Strategies in Social Media: Evidence from Promotional Events	Panagiotis Adamopoulos, Vilma Todri	We use a real-world data set and employ a promising research approach combining econometric with predictive modeling techniques in a causal estimation framework that allows for more accurate counterfactuals.
176	Personalizing LinkedIn Feed	Deepak Agarwal, Bee-Chung Chen, Qi He, Zhenhao Hua, Guy Lebanon, Yiming Ma, Pannagadatta Shivaswamy, Hsiao-Ping Tseng, Jaewon Yang, Liang Zhang	More specifically, we focus on the personalization models by generating three kinds of affinity scores: Viewer-ActivityType Affinity, Viewer-Actor Affinity, and Viewer-Actor-ActivityType Affinity.
177	Whither Social Networks for Web Search?	Rakesh Agrawal, Behzad Golshan, Evangelos Papalexakis	We present the results of our empirical study that indicates that by mining Twitter data one can obtain search results that are quite distinct from those produced by Google and Bing.
178	Exploiting Data Mining for Authenticity Assessment and Protection of High-Quality Italian Wines from Piedmont	Marco Arlorio, Jean Daniel Coisson, Giorgio Leonardi, Monica Locatelli, Luigi Portinale	Following Wagstaff’s proposal for practical exploitation of machine learning (and data mining) approaches, we describe how data have been collected and prepared for the production of different datasets, how suitable classification models have been identified and how the interpretation of the results suggests the emergence of an active role of classification techniques, based on standard chemical profiling, for the assesment of the authenticity of the wines target of the study.
179	Predictive Approaches for Low-Cost Preventive Medicine Program in Developing Countries	Yukino Baba, Hisashi Kashima, Yasunobu Nohara, Eiko Kai, Partha Ghosh, Rafiqul Islam, Ashir Ahmed, Masahiro Kuroda, Sozo Inoue, Tatsuo Hiramatsu, Michio Kimura, Shuji Shimizu, Kunihisa Kobayashi, Koji Tsuda, Masashi Sugiyama, Mathieu Blondel, Naonori Ueda, Masaru Kitsuregawa, Naoki Nakashima	In this study, we investigate predictive modeling for providing a low-cost preventive medicine program. In our two-year-long field study in Bangladesh, we collected the health checkup results of 15,075 subjects, the data of 6,607 prescriptions, and the follow-up examination results of 2,109 subjects.
180	Dynamic Hierarchical Classification for Patient Risk-of-Readmission	Senjuti Basu Roy, Ankur Teredesai, Kiyana Zolfaghar, Rui Liu, David Hazel, Stacey Newman, Albert Marinez	In this paper, we describe a supervised learning framework, Dynamic Hierarchical Classification (DHC) for patient’s risk-of-readmission prediction.
181	ALOJA-ML: A Framework for Automating Characterization and Knowledge Discovery in Hadoop Deployments	Josep Lluís Berral, Nicolas Poggi, David Carrera, Aaron Call, Rob Reinauer, Daron Green	This article presents ALOJA-Machine Learning (ALOJA-ML) an extension to the ALOJA project that uses machine learning techniques to interpret Hadoop benchmark performance data and performance tuning; here we detail the approach, efficacy of the model and initial results.
182	Multi-View Incident Ticket Clustering for Optimal Ticket Dispatching	Mirela Madalina Botezatu, Jasmina Bogojeska, Ioana Giurgiu, Hagen Voelzer, Dorothea Wiesmann	We present a novel technique that optimizes the dispatching of incident tickets to the agents in an IT Service Support Environment.
183	Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission	Rich Caruana, Yin Lou, Johannes Gehrke, Paul Koch, Marc Sturm, Noemie Elhadad	In the 30-day hospital readmission case study, we show that the same methods scale to large datasets containing hundreds of thousands of patients and thousands of attributes while remaining intelligible and providing accuracy comparable to the best (unintelligible) machine learning methods.
184	User Conditional Hashtag Prediction for Images	Emily Denton, Jason Weston, Manohar Paluri, Lubomir Bourdev, Rob Fergus	We explore two ways of combining these heterogeneous features into a learning framework: (i) simple concatenation; and (ii) a 3-way multiplicative gating, where the image model is conditioned on the user metadata.
185	Big Data System for Analyzing Risky Procurement Entities	Amit Dhurandhar, Bruce Graves, Rajesh Ravi, Gopikrishanan Maniachari, Markus Ettl	In this paper, we describe a robust tool to identify procurement related fraud/risk, though the general design and the analytical components could be adapted to detecting fraud in other domains.
186	Probabilistic Modeling of a Sales Funnel to Prioritize Leads	Brendan Andrew Duncan, Charles Peter Elkan	Specifically,we present two models, called DQM for direct qualification model and FFM for full funnel model, that can be used to rank initial leads based on their probability of conversion to a sales opportunity, probability of successful sale, and/or expected revenue.
187	Online Topic-based Social Influence Analysis for the Wimbledon Championships	Varun R. Embar, Indrajit Bhattacharya, Vinayaka Pandit, Roman Vaculin	In this paper, we define various functional and usability criteria that social influence scores should satisfy, and propose a multi-dimensional definition of influence that satisfies these criteria.
188	Collective Spammer Detection in Evolving Multi-Relational Social Networks	Shobeir Fakhraei, James Foulds, Madhusudana Shashanka, Lise Getoor	Motivated by the Tagged.com social network, we develop methods to identify spammers in evolving multi-relational social networks.
189	Utilizing Text Mining on Online Medical Forums to Predict Label Change due to Adverse Drug Reactions	Ronen Feldman, Oded Netzer, Aviv Peretz, Binyamin Rosenfeld	We present an end-to-end text mining methodology for relation extraction of adverse drug reactions (ADRs) from medical forums on the Web.
190	One-Pass Ranking Models for Low-Latency Product Recommendations	Antonino Freno, Martin Saveski, Rodolphe Jenatton, Cédric Archambeau	In this paper, we investigate how the practical challenges faced in this setting can be tackled via an online learning to rank approach.
191	On the Reliability of Profile Matching Across Large Online Social Networks	Oana Goga, Patrick Loiseau, Robin Sommer, Renata Teixeira, Krishna P. Gummadi	In this paper, we study the extent to which we can reliably match profiles in practice, across real-world social networks, by exploiting public attributes, i.e., information users publicly provide about themselves.
192	E-commerce in Your Inbox: Product Recommendations at Scale	Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, Doug Sharp	In this paper we describe a system that leverages user purchase history determined from e-mail receipts to deliver highly personalized product ads to Yahoo Mail users.
193	Gender and Interest Targeting for Sponsored Post Advertising at Tumblr	Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Ananth Nagarajan	In this paper, we present a framework that enabled two of the key targeted advertising components for Tumblr, gender and interest targeting.
194	Mining Administrative Data to Spur Urban Revitalization	Ben Green, Alejandra Caro, Matthew Conway, Robert Manduca, Tom Plagge, Abby Miller	In this paper, we apply data science techniques to administrative data to help the City of Memphis, Tennessee improve distressed neighborhoods.
195	Measuring Causal Impact of Online Actions via Natural Experiments: Application to Display Advertising	Daniel N. Hill, Robert Moakler, Alan E. Hubbard, Vadim Tsemekhman, Foster Provost, Kiril Tsemekhman	Here we present a novel framework for estimating causal effects that relies on neither randomized experiments nor adjusting for the potentially explosive number of variables used in predictive models.
196	Focusing on the Long-term: It’s Good for Users and Business	Henning Hohnhold, Deirdre O’Brien, Diane Tang	The results presented in this paper are generalizable in two major ways.
197	Traffic Measurement and Route Recommendation System for Mass Rapid Transit (MRT)	Thomas Holleczek, Dang The Anh, Shanyang Yin, Yunye Jin, Spiros Antonatos, Han Leong Goh, Samantha Low, Amy Shi-Nash	We have therefore developed and deployed a traffic measurement system for a key player in the transportation industry to gain insights into crowd behavior for planning purposes.
198	Real-Time Bid Prediction using Thompson Sampling-Based Expert Selection	Elena Ikonomovska, Sina Jafarpour, Ali Dasdan	In this paper we propose to use probability sampling (via Thompson Sampling) as a meta-learning algorithm that samples from the pool of experts for the purpose of bid prediction.
199	Life-stage Prediction for Product Recommendation in E-commerce	Peng Jiang, Yadong Zhu, Yi Zhang, Quan Yuan	In this paper, we found obvious correlation between life stage and purchasing behavior in many E-commerce categories.
200	Visual Search at Pinterest	Yushi Jing, David Liu, Dmitry Kislyuk, Andrew Zhai, Jiajing Xu, Jeff Donahue, Sarah Tavel	We demonstrate that, with the availability of distributed computation platforms such as Amazon Web Services and open-source tools, it is possible for a small engineering team to build, launch and maintain a cost-effective, large-scale visual search system.
201	Discovering Collective Narratives of Theme Parks from Large Collections of Visitors’ Photo Streams	Gunhee Kim, Leonid Sigal	We present an approach for generating pictorial storylines from large collections of online photo streams shared by visitors to theme parks (e.g. Disneyland), along with publicly available information such as visitor’s maps.
202	A Machine Learning Framework to Identify Students at Risk of Adverse Academic Outcomes	Himabindu Lakkaraju, Everaldo Aguiar, Carl Shan, David Miller, Nasir Bhanpuri, Rayid Ghani, Kecia L. Addison	This paper describes a machine learning framework to identify such students, discusses features that are useful for this task, applies several classification algorithms, and evaluates them using metrics important to school administrators.
203	Probabilistic Graphical Models of Dyslexia	Yair Lakretz, Gal Chechik, Naama Friedmann, Michal Rosen-Zvi	In this study, introducing a novel approach, we use two families of probabilistic graphical models to analyze patterns of reading errors made by dyslexic people: an LDA-based model and two Naëve Bayes models which differ by their assumptions about the generation process of reading errors.
204	Promoting Positive Post-Click Experience for In-Stream Yahoo Gemini Users	Mounia Lalmas, Janette Lehmann, Guy Shaked, Fabrizio Silvestri, Gabriele Tolomei	In this paper, we describe the method we have implemented in Yahoo Gemini to measure the post-click experience on Yahoo mobile news streams via an automatic analysis of advert landing pages.
205	Generic and Scalable Framework for Automated Time-series Anomaly Detection	Nikolay Laptev, Saeed Amizadeh, Ian Flint	This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data.
206	Leveraging Knowledge Bases for Contextual Entity Exploration	Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv	We present a system called Lewis for retrieving contextually relevant entity results leveraging a knowledge graph, and perform a large scale crowdsourcing experiment in the context of an e-reader scenario, which shows that Lewis can outperform the state-of-the-art contextual entity recommendation systems by more than 20% in terms of the MAP score.
207	Click-through Prediction for Advertising in Twitter Timeline	Cheng Li, Yue Lu, Qiaozhu Mei, Dong Wang, Sandeep Pandey	We present the problem of click-through prediction for advertising in Twitter timeline, which displays a stream of Tweets from accounts a user choose to follow.
208	Predicting Voice Elicited Emotions	Ying Li, Jose D. Contreras, Luis J. Salazar	We present the research, and product development and deployment, of Voice Analyzer’ by Jobaline Inc.
209	Discovery of Glaucoma Progressive Patterns Using Hierarchical MDL-Based Clustering	Shigeru Maya, Kai Morino, Hiroshi Murata, Ryo Asaoka, Kenji Yamanishi	In this paper, we propose a method to cluster the spacial patterns of the visual field in glaucoma patients to analyze the progression patterns of glaucoma.
210	Distributed Personalization	Xu Miao, Chun-Te Chu, Lijun Tang, Yitong Zhou, Joel Young, Anmol Bhasin	In this paper, we formalize the generic personalization problem as an optimization problem.
211	Voltage Correlations in Smart Meter Data	Rajendu Mitra, Ramachandra Kota, Sambaran Bandyopadhyay, Vijay Arya, Brian Sullivan, Richard Mueller, Heather Storey, Gerard Labut	This work shows that voltage time series measurements collected from customer smart meters exhibit correlations that are consistent with the hierarchical structure of the distribution network.
212	Analyzing Invariants in Cyber-Physical Systems using Latent Factor Regression	Marjan Momtazpour, Jinghe Zhang, Saifur Rahman, Ratnesh Sharma, Naren Ramakrishnan	We describe a latent factor approach to infer invariants underlying system variables and how we can leverage these relationships to monitor a cyber-physical system.
213	Predicting Future Scientific Discoveries Based on a Networked Analysis of the Past Literature	Meenakshi Nagarajan, Angela D. Wilkins, Benjamin J. Bachman, Ilya B. Novikov, Shenghua Bao, Peter J. Haas, María E. Terrón-Díaz, Sumit Bhatia, Anbu K. Adikesavan, Jacques J. Labrie, Sam Regenbogen, Christie M. Buchovecky, Curtis R. Pickering, Linda Kato, Andreas M. Lisewski, Ana Lelescu, Houyin Zhang, Stephen Boyer, Griff Weber, Ying Chen, Lawrence Donehower, Scott Spangler, Olivier Lichtarge	We present KnIT, the Knowledge Integration Toolkit, a system for accelerating scientific discovery and predicting previously unknown protein-protein interactions.
214	Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues	Vinod Nair, Ameya Raul, Shwetabh Khanduja, Vikas Bahirwani, Qihong Shao, Sundararajan Sellamanickam, Sathiya Keerthi, Steve Herbert, Sudheer Dhulipalla	We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues.
215	Predictive Modeling for Public Health: Preventing Childhood Lead Poisoning	Eric Potash, Joe Brew, Alexander Loewi, Subhabrata Majumdar, Andrew Reece, Joe Walsh, Eric Rozier, Emile Jorgenson, Raed Mansour, Rayid Ghani	This paper describes joint work with the Chicago Department of Public Health (CDPH) in which we build a model that predicts the risk of a child to being poisoned so that an intervention can take place before that happens.
216	Proof Protocol for a Machine Learning Technique Making Longitudinal Predictions in Dynamic Contexts	Kevin B. Pratt	We propose necessary components of the proof protocol and demonstrate results visualizations to support evaluation of the proof components.
217	An Architecture for Agile Machine Learning in Real-Time Applications	Johann Schleier-Smith	Machine learning techniques have proved effective in recommender systems and other applications, yet teams working to deploy them lack many of the advantages that those in more established software disciplines today take for granted.
218	Scalable Machine Learning Approaches for Neighborhood Classification Using Very High Resolution Remote Sensing Imagery	Manu Sethi, Yupeng Yan, Anand Rangarajan, Ranga Raju Vatsavai, Sanjay Ranka	A semi-supervised learning approach for identifying neighborhoods is presented which employs superpixel tessellation representations of VHR imagery.
219	Early Identification of Violent Criminal Gang Members	Elham Shaabani, Ashkan Aleali, Paulo Shakarian, John Bertetto	In this paper, we study the problem of early identification of violent gang members.
220	Spoken English Grading: Machine Learning with Crowd Intelligence	Vinay Shashidhar, Nishant Pandey, Varun Aggarwal	In this paper, we address the problem of grading spontaneous speech using a combination of machine learning and crowdsourcing.
221	Effective Audience Extension in Online Advertising	Jianqiang Shen, Sahin Cem Geyik, Ali Dasdan	In this paper, we formally define the audience extension problem, propose an algorithm that extends a given audience set efficiently under multiple desirable criteria, and experimentally validate its performance.
222	Going In-Depth: Finding Longform on the Web	Virginia Smith, Miriam Connor, Isabelle Stanton	In this work, we develop a system to automatically identify longform content across the web.
223	Early Prediction of Cardiac Arrest (Code Blue) using Electronic Medical Records	Sriram Somanchi, Samrachana Adhikari, Allen Lin, Elena Eneva, Rayid Ghani	In this paper, we describe our work, in partnership with NorthShore University HealthSystem, that preemptively flags patients who are likely to go into cardiac arrest, using signals extracted from demographic information, hospitalization history, vitals and laboratory measurements in patient-level electronic medical records.
224	When-To-Post on Social Networks	Nemanja Spasojevic, Zhisheng Li, Adithya Rao, Prantik Bhattacharyya	In this study, we formulate a when-to-post problem, where the objective is to find the best times for a user to post on social networks in order to maximize the probability of audience responses.
225	Mining for Causal Relationships: A Data-Driven Study of the Islamic State	Andrew Stanton, Amanda Thart, Ashish Jain, Priyank Vyas, Arpan Chatterjee, Paulo Shakarian	In this paper, we present a data-driven approach to analyzing this group using a dataset consisting of 2200 incidents of military activity surrounding ISIS and the forces that oppose it (including Iraqi, Syrian, and the American-led coalition).
226	Transfer Learning for Bilingual Content Classification	Qian Sun, Mohammad Amin, Baoshi Yan, Craig Martell, Vita Markman, Anmol Bhasin, Jieping Ye	In this paper, we take the spam (Spanish) job posting detection as the target problem and build a generic machine learning pipeline for multi-lingual spam detection.
227	FrauDetector: A Graph-Mining-based Framework for Fraudulent Phone Call Detection	Vincent S. Tseng, Jia-Ching Ying, Che-Wei Huang, Yimin Kao, Kuan-Ta Chen	In this paper, we develop a graph-mining-based fraudulent phone call detection framework for a mobile application to automatically annotate fraudulent phone numbers with a "fraud" tag, which is a crucial prerequisite for distinguishing fraudulent phone calls from normal phone calls.
228	Efficient Long-Term Degradation Profiling in Time Series for Complex Physical Systems	Liudmila Ulanova, Tan Yan, Haifeng Chen, Guofei Jiang, Eamonn Keogh, Kai Zhang	In this work, we introduce a novel time series analysis technique that allows the decomposition of the time series into trend and fluctuation components, providing the monitoring software with actionable information about the changes of the system’s behavior over time.
229	Interpreting Advertiser Intent in Sponsored Search	Bhanu C. Vattikonda, Santhosh Kodipaka, Hongyan Zhou, Vacha Dave, Saikat Guha, Alex C. Snoeren	Past work has employed a bag-of-words approach using features extracted from both the query and potential sponsored result to train the ranker.
230	Client Clustering for Hiring Modeling in Work Marketplaces	Vasilis Verroios, Panagiotis Papadimitriou, Ramesh Johari, Hector Garcia-Molina	We propose a Maximum Likelihood definition of the "optimal" client clustering along with an efficient Expectation-Maximization clustering algorithm that can be applied in large marketplaces.
231	Discerning Tactical Patterns for Professional Soccer Teams: An Enhanced Topic Model with Applications	Qing Wang, Hengshu Zhu, Wei Hu, Zhiyong Shen, Yuan Yao	To this end, in this paper we propose an unsupervised approach to automatically discerning the typical tactics, i.e., tactical patterns, of soccer teams through mining the historical match logs.
232	Predicting Serves in Tennis using Style Priors	Xinyu Wei, Patrick Lucey, Stuart Morgan, Peter Carr, Machar Reid, Sridha Sridharan	In this paper we present a method which recommends the most likely serves of a player in a given context.
233	Smart Pacing for Effective Online Ad Campaign Optimization	Jian Xu, Kuang-chih Lee, Wentong Li, Hang Qi, Quan Lu	In this paper, we propose a smart pacing approach in which the delivery pace of each campaign is learned from both offline and online data to achieve smooth delivery and optimal performance goals.
234	From Infrastructure to Culture: A/B Testing Challenges in Large Scale Social Networks	Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, Anmol Bhasin	In this paper, we describe in depth the experimentation platform we have built at LinkedIn and the challenges that arise particularly when running A/B tests at large scale in a social network setting.
235	Tornado Forecasting with Multiple Markov Boundaries	Kui Yu, Dawei Wang, Wei Ding, Jian Pei, David L. Small, Shafiqul Islam, Xindong Wu	In this work, we provide a new solution to use the concept of multiple Markov boundaries in local causal discovery to identify multiple sets of the precursors for tornado forecasting.
236	Gas Concentration Reconstruction for Coal-Fired Boilers Using Gaussian Process	Chao Yuan, Matthias Behmann, Bernhard Meerbeck	We propose a Bayesian approach based on Gaussian process (GP) to address both image reconstruction and path arrangement problems, simultaneously.
237	Annotating Needles in the Haystack without Looking: Product Information Extraction from Emails	Weinan Zhang, Amr Ahmed, Jie Yang, Vanja Josifovski, Alex J. Smola	To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels predicted by binary classifiers (weak learners).
238	Forecasting Fine-Grained Air Quality Based on Big Data	Yu Zheng, Xiuwen Yi, Ming Li, Ruiyuan Li, Zhangqing Shan, Eric Chang, Tianrui Li	In this paper, we forecast the reading of an air quality monitoring station over the next 48 hours, using a data-driven method that considers current meteorological data, weather forecasts, and air quality data of the station and that of other stations within a few hundred kilometers.
239	Building Discriminative User Profiles for Large-scale Content Recommendation	Erheng Zhong, Nathan Liu, Yue Shi, Suju Rajan	In this paper, we propose a hybrid solution that makes use of a latent factor model to infer user interest vectors.
240	Stock Constrained Recommendation in Tmall	Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, Qiang Li	We address this challenge by developing a dual method that reduces the number of variables from n^2 to n, significantly improving the computational efficiency.
241	Predicting Ambulance Demand: a Spatio-Temporal Kernel Approach	Zhengyi Zhou, David S. Matteson	We propose a predictive method using spatio-temporal kernel density estimation (stKDE) to address these challenges, and provide spatial density predictions for ambulance demand in Toronto, Canada as it varies over hourly intervals.
242	Web Personalization and Recommender Systems	Shlomo Berkovsky, Jill Freyne	This tutorial will provide the participants with broad overview and thorough understanding of algorithms and practically deployed Web and mobile applications of personalized technologies.
243	Graph-Based User Behavior Modeling: From Prediction to Fraud Detection	Alex Beutel, Leman Akoglu, Christos Faloutsos	In this tutorial we will answer these questions – connecting graph analysis tools for user behavior modeling to anomaly and fraud detection.
244	Data-Driven Product Innovation	Xin Fu, Hernán Asorey	In this tutorial, we introduce the framework that we created to nurture data-driven product innovations.
245	Dense Subgraph Discovery: KDD 2015 tutorial	Aristides Gionis, Charalampos E. Tsourakakis	In this tutorial we aim to provide a comprehensive overview of (i) major algorithmic techniques for finding dense subgraphs in large graphs and (ii) graph mining applications that rely on dense subgraph extraction.
246	Diffusion in Social and Information Networks: Research Problems, Probabilistic Models and Machine Learning Methods	Manuel Gomez Rodriguez, Le Song	In this tutorial, we will present several diffusion models designed for fine-grained large-scale diffusion and social event data, present some canonical research problem in the context of diffusion, and introduce state-of-the-art algorithms to solve some of these problems, in particular, network estimation, influence estimation and control, and rumor source identification.
247	Social Media Anomaly Detection: Challenges and Solutions	Yan Liu, Sanjay Chawla	In this tutorial, we survey existing work on social media anomaly detection, focusing on the new anomalous phenomena in social media and most recent techniques to detect those special types of anomalies.
248	Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach	Xiang Ren, Ahmed El-Kishky, Chi Wang, Jiawei Han	In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora.
249	VC-Dimension and Rademacher Averages: From Statistical Learning Theory to Sampling Algorithms	Matteo Riondato, Eli Upfal	In this tutorial, we survey the use of Rademacher Averages and the VC-dimension in sampling-based algorithms for graph analysis and pattern mining.
250	Large Scale Distributed Data Science using Apache Spark	James G. Shanahan, Laing Dai	This tutorial will provide an accessible introduction to Spark and its potential to revolutionize academic and commercial data science practices.
251	Medical Mining: KDD 2015 Tutorial	Myra Spiliopoulou, Pedro Pereira Rodrigues, Ernestina Menasalvas	Purpose of this tutorial is to contribute to this learning process.
252	Big Data Analytics: Optimization and Randomization	Tianbao Yang, Qihang Lin, Rong Jin	In the first part, we plan to present the state-of-the-art large-scale optimization algorithms, including various stochastic gradient descent methods, stochastic coordinate descent methods and distributed optimization algorithms, for solving various machine learning problems.
253	Data Driven Science: SIGKDD Panel	Katharina Morik, Hugh Durrant-Whyte, Gary Hill, Dietmar Müller, Tanya Berger-Wolf	Knowledge discovery methods are finding broad application in all areas of scientific endeavor, to explore experimental data, to discover new models, to propose new scientific theories and ideas.