Paper Digest: KDD 2014 Highlights

August 1, 2014June 25, 2020 admin

ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) is one of the top data mining conferences in the world.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team
team@paperdigest.org

TABLE 1: KDD 2014 Papers

	Title	Authors	Highlight
1	The battle for the future of data mining	Oren Etzioni	My talk will describe work at the new Allen Institute for AI towards building the next-generation of text-mining systems.
2	Data, predictions, and decisions in support of people and society	Eric Horvitz	I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems.
3	A data driven approach to diagnosing and treating disease	Eric Schadt	More specifically, we have constructed predictive network models for Alzheimer’s disease, along with other common human diseases such as obesity, diabetes, heart disease, and inflammatory bowel disease, and cancer, and demonstrated a causal network common across all of these diseases(3, 5-10).
4	Bugbears or legitimate threats?: (social) scientists’ criticisms of machine learning?	Sendhil Mullainathan	This talk describes joint work with Jon Kleinberg and individual projects with Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Anuj Shah, Chenhao Tan, Mike Yeomans and Tom Zimmerman.
5	Prediction of human emergency behavior and their mobility following large-scale disaster	Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki	In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident.
6	Inferring user demographics and social strategies in mobile social networks	Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, Nitesh V. Chawla	In this paper, we aim to harness the power of big data to automatically infer users’ demographics based on their daily mobile communication patterns.
7	Travel time estimation of a path using sparse trajectories	Yilun Wang, Yu Zheng, Yexiang Xue	In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources.
8	Modeling human location data with mixtures of kernel densities	Moshe Lichman, Padhraic Smyth	In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data.
9	A cost-effective recommender system for taxi drivers	Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, Hui Xiong	To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers.
10	LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data	Yubin Park, Joydeep Ghosh	This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data.
11	People on drugs: credibility of user statements in health communities	Subhabrata Mukherjee, Gerhard Weikum, Cristian Danescu-Niculescu-Mizil	In this work we propose a method for automatically establishing the credibility of user-generated medical statements and the trustworthiness of their authors by exploiting linguistic cues and distant supervision from expert sources.
12	Unfolding physiological state: mortality modelling in intensive care units	Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits	We examined the use of latent variable models (viz.
13	Unsupervised learning of disease progression models	Xiang Wang, David Sontag, Fei Wang	In this paper, we propose a probabilistic disease progression model that address these challenges.
14	Good-enough brain model: challenges, algorithms and discoveries in multi-subject experiments	Evangelos E. Papalexakis, Alona Fyshe, Nicholas D. Sidiropoulos, Partha Pratim Talukdar, Tom M. Mitchell, Christos Faloutsos	In this work we present a simple, novel good-enough brain model, or GeBM in short, and a novel algorithm Sparse-SysId, which are able to effectively model the dynamics of the neuron interactions and infer the functional connectivity.
15	FUNNEL: automatic mining of spatially coevolving epidemics	Yasuko Matsubara, Yasushi Sakurai, Willem G. van Panhuis, Christos Faloutsos	In this paper, we present FUNNEL, a unifying analytical model for large scale epidemiological data, as well as a novel fitting algorithm, FUNNELFIT, which solves the above problem.
16	Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization	Joyce C. Ho, Joydeep Ghosh, Jimeng Sun	We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision.
17	Scalable noise mining in long-term electrocardiographic time-series to predict death following heart attacks	Chih-Chun Chia, Zeeshan Syed	In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices.
18	From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records	Jiayu Zhou, Fei Wang, Jianying Hu, Jieping Ye	In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices.
19	Clinical risk prediction with multilinear sparse logistic regression	Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, Ian Davidson	We propose a block proximal descent approach to solve the problem and prove its convergence.
20	Dual beta process priors for latent cluster discovery in chronic obstructive pulmonary disease	James C. Ross, Peter J. Castaldi, Michael H. Cho, Jennifer G. Dy	In this paper we introduce a transformative way of looking at the COPD subtyping task.
21	COM: a generative model for group recommendation	Quan Yuan, Gao Cong, Chin-Yew Lin	In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations.
22	Leveraging user libraries to bootstrap collaborative filtering	Laurent Charlin, Richard S. Zemel, Hugo Larochelle	We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents.
23	Topic-factorized ideal point estimation model for legislative voting network	Yupeng Gu, Yizhou Sun, Ning Jiang, Bingyu Wang, Ting Chen	In this paper, we propose a novel topic-factorized ideal point estimation model for a legislative voting network in a unified framework.
24	Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS)	Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J. Smola, Jing Jiang, Chong Wang	In this work we propose a probabilistic model based on collaborative filtering and topic modeling.
25	User effort minimization through adaptive diversification	Mahbub Hasan, Abhijith Kashyap, Vagelis Hristidis, Vassilis Tsotras	In this paper, we show that for different search tasks there is a different ideal balance of relevance and diversity.
26	Relevant overlapping subspace clusters on categorical data	Xiao He, Jing Feng, Bettina Konte, Son T. Mai, Claudia Plant	Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression.
27	Batch discovery of recurring rare classes toward identifying anomalous samples	Murat Dundar, Halid Ziya Yerebakan, Bartek Rajwa	We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects.
28	A dirichlet multinomial mixture model-based approach for short text clustering	Jianhua Yin, Jianyong Wang	In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr.
29	Representative clustering of uncertain data	Andreas Züfle, Tobias Emrich, Klaus Arthur Schmid, Nikos Mamoulis, Arthur Zimek, Matthias Renz	In this paper, we describe a framework, based on possible-worlds semantics; when applied on an uncertain dataset, it computes a set of representative clusterings, each of which has a probabilistic guarantee not to exceed some maximum distance to the ground truth clustering, i.e., the clustering of the actual (but unknown) data.
30	SMVC: semi-supervised multi-view clustering in subspace projections	Stephan Günnemann, Ines Färber, Matthias Rüdiger, Thomas Seidl	In this paper, we join both research areas and present a solution for integrating prior knowledge in the process of detecting multiple clusterings.
31	FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning	Yashoteja Prabhu, Manik Varma	Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35].
32	A multi-class boosting method with direct optimization	Shaodan Zhai, Tian Xia, Shaojun Wang	We present a direct multi-class boosting (DMCBoost) method for classification with the following properties: (i) instead of reducing the multi-class classification task to a set of binary classification tasks, DMCBoost directly solves the multi-class classification problem, and only requires very weak base classifiers; (ii) DMCBoost builds an ensemble classifier by directly optimizing the non-convex performance measures, including the empirical classification error and margin functions, without resorting to any upper bounds or approximations.
33	An efficient algorithm for weak hierarchical lasso	Yashu Liu, Jie Wang, Jieping Ye	In this paper, we propose to directly solve the non-convex weak hierarchical Lasso by making use of the GIST (General Iterative Shrinkage and Thresholding) optimization framework which has been shown to be efficient for solving non-convex sparse formulations.
34	Online multiple kernel regression	Doyen Sahoo, Steven C.H. Hoi, Bin Li	In this paper, we propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks.
35	Class-distribution regularized consensus maximization for alleviating overfitting in model combination	Sihong Xie, Jing Gao, Wei Fan, Deepak Turaga, Philip S. Yu	We propose a novel model called Regularized Consensus Maximization (RCM), which is formulated as an optimization problem to combine the maximum consensus and large margin principles.
36	Large margin distribution machine	Teng Zhang, Zhi-Hua Zhou	In this paper, we propose the Large margin Distribution Machine (LDM), which tries to achieve a better generalization performance by optimizing the margin distribution.
37	Distance metric learning using dropout: a structured regularization approach	Qi Qian, Juhua Hu, Rong Jin, Jian Pei, Shenghuo Zhu	In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML.
38	Box drawings for learning with imbalanced data	Siong Thye Goh, Cynthia Rudin	We propose two machine learning algorithms to handle highly imbalanced classification problems.
39	Incremental and decremental training for linear classification	Cheng-Hao Tsai, Chieh-Yen Lin, Chih-Jen Lin	In this paper, we focus on linear classifiers including logistic regression and linear SVM because of their simplicity over kernel or other methods.
40	Supervised deep learning with auxiliary networks	Junbo Zhang, Guangjian Tian, Yadong Mu, Wei Fan	The major contribution of our work is the exposition of a novel supervised deep learning algorithm, which distinguishes from two unique traits.
41	Sleep analytics and online selective anomaly detection	Tahereh Babaie, Sanjay Chawla, Romesh Abeysuriya	We introduce a new problem, the Online Selective Anomaly Detection (OSAD), to model a specific scenario emerging from research in sleep science.
42	GLAD: group anomaly detection in social media analysis	Rose Yu, Xinran He, Yan Liu	In this paper, we take a generative approach by proposing a hierarchical Bayes model: Group Latent Anomaly Detection (GLAD) model.
43	FBLG: a simple and effective approach for temporal dependence discovery from time series data	Dehua Cheng, Mohammad Taha Bahadori, Yan Liu	We observe that when we look in reversed order of time, the temporal dependence structure of the time series is usually preserved after switching the roles of cause and effect.
44	Learning time-series shapelets	Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme	In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets.
45	Utilizing temporal patterns for estimating uncertainty in interpretable early decision making	Mohamed F. Ghalwash, Vladan Radosavljevic, Zoran Obradovic	In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method.
46	Prototype-based learning on concept-drifting data streams	Junming Shao, Zahra Ahmadi, Stefan Kramer	In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion.
47	Detecting moving object outliers in massive-scale trajectory streams	Yanwei Yu, Lei Cao, Elke A. Rundensteiner, Qin Wang	Our theoretical analysis and empirical study on the Beijing Taxi and GMTI (Ground Moving Target Indicator) datasets demonstrate its effectiveness in capturing abnormal moving objects.
48	The setwise stream classification problem	Charu C. Aggarwal	In this paper, we present a first approach for real time and streaming classification of such data.
49	Streamed approximate counting of distinct elements: beating optimal batch methods	Daniel Ting	This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch.
50	Time-varying learning and content analytics via sparse factor analysis	Andrew S. Lan, Christoph Studer, Richard G. Baraniuk	We propose SPARFA-Trace, a new machine learning-based framework for time-varying learning and content analytics for educational applications.
51	Active-transductive learning with label-adapted kernels	Dan Kushnir	This paper presents an efficient active-transductive approach for classification.
52	Active learning for sparse bayesian multilabel classification	Deepak Vasisht, Andreas Damianou, Manik Varma, Ashish Kapoor	We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17].
53	Large-scale adaptive semi-supervised learning via unified inductive and transductive model	De Wang, Feiping Nie, Heng Huang	To address these two challenges, in this paper, we propose an adaptive semi-supervised learning model.
54	Active semi-supervised learning using sampling theory for graph signals	Akshay Gadde, Aamir Anis, Antonio Ortega	We propose a novel framework for this problem based on our recent results on sampling theory for graph signals.
55	Active collaborative permutation learning	Jialei Wang, Nathan Srebro, James Evans	We consider the problem of Collaborative Permutation Recovery, i.e. recovering multiple permutations over objects (e.g. preference rankings over different options) from limited pairwise comparisons.
56	Effective global approaches for mutual information based feature selection	Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, James Bailey	In this paper, we take a systematic approach to the problem of global MI-based feature selection.
57	Gradient boosted feature selection	Zhixiang Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng	In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements.
58	Simultaneous feature and feature group selection through hard thresholding	Shuo Xiang, Tao Yang, Jieping Ye	In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm.
59	Safe and efficient screening for sparse support vector machine	Zheng Zhao, Jun Liu, James Cox	In this paper, a novel screening technique is proposed to accelerate model selection for l₁-regularized l₂-SVM and effectively improve its scalability.
60	Factorized sparse learning models with interpretable high order feature interactions	Sanjay Purushotham, Martin Renqiang Min, C.-C. Jay Kuo, Rachel Ostroff	In this paper, we propose a factorization based sparse learning framework termed FHIM for identifying high-order feature interactions in linear and logistic regression models, and study several optimization methods for solving them.
61	Parallel gibbs sampling for hierarchical dirichlet processes via gamma processes equivalence	Dehua Cheng, Yan Liu	In this paper, we propose an effective parallel Gibbs sampling algorithm for HDP by exploring its connections with the gamma-gamma-Poisson process.
62	Empirical glitch explanations	Tamraparni Dasu, Ji Meng Loh, Divesh Srivastava	In this paper, we introduce the notion of Empirical Glitch Explanations – concise, multi-dimensional descriptions of subsets of potentially dirty data – and propose a scalable method for empirically generating such explanatory characterizations.
63	Learning with dual heterogeneity: a nonparametric bayes model	Hongxia Yang, Jingrui He	Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler.
64	Online chinese restaurant process	Chien-Liang Liu, Tsung-Hsun Tsai, Chia-Hoang Lee	This work proposes an online Chinese restaurant process (CRP) algorithm, which is an online and nonparametric algorithm, to tackle this problem.
65	Knowledge vault: a web-scale approach to probabilistic knowledge fusion	Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang	Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories.
66	Improving the modified nyström method using spectral shifting	Shusen Wang, Chao Zhang, Hui Qian, Zhihua Zhang	In this paper, we propose a variant of the Nystrom method called the modified Nystrom by spectral shifting (SS-Nystrom).
67	Fast flux discriminant for large-scale sparse nonlinear classification	Wenlin Chen, Yixin Chen, Kilian Q. Weinberger	In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification.
68	Scalable histograms on large probabilistic data	Mingwang Tang, Feifei Li	We introduced novel synopses to reduce communication cost when running our methods in such settings.
69	Correlation clustering in MapReduce	Flavio Chierichetti, Nilesh Dalvi, Ravi Kumar	In this paper we obtain a new algorithm for correlation clustering.
70	Scaling out big data missing value imputations: pythia vs. godzilla	Christos Anagnostopoulos, Peter Triantafillou	In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used.
71	Efficient mini-batch training for stochastic optimization	Mu Li, Tong Zhang, Yuqiang Chen, Alexander J. Smola	This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch.
72	Streaming submodular maximization: massive data summarization on the fly	Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause	In this paper, we address the problem of extracting representative elements from a large stream of data.
73	Distance queries from sampled data: accurate and efficient	Edith Cohen	We derive novel estimators for estimating $L_p$ distance from sampled data.
74	Improved testing of low rank matrices	Yi Li, Zhengyu Wang, David P. Woodruff	We study the problem of determining if an input matrix A εR^{m x n} can be well-approximated by a low rank matrix.
75	DeepWalk: online learning of social representations	Bryan Perozzi, Rami Al-Rfou, Steven Skiena	We present DeepWalk, a novel approach for learning latent representations of vertices in a network.
76	Open-domain quantity queries on web tables: annotation, response, and consensus models	Sunita Sarawagi, Soumen Chakrabarti	Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented.
77	Crowdsourced time-sync video tagging using temporal and personalized topic modeling	Bin Wu, Erheng Zhong, Ben Tan, Andrew Horner, Qiang Yang	In this paper, we propose a new application which extracts time-sync video tags by automatically exploiting crowdsourced comments from video websites such as Nico Nico Douga, where videos are commented on by online crowd users in a time-sync manner.
78	Identifying and labeling search tasks via query-based hawkes processes	Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, Hongyuan Zha	In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries.
79	LaSEWeb: automating search strategies over semi-structured web data	Oleksandr Polozov, Sumit Gulwani	We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns.
80	Personalized search result diversification via structured learning	Shangsong Liang, Zhaochun Ren, Maarten de Rijke	To further enhance the performance, we propose a supervised learning strategy.
81	Efficient multi-task feature learning with calibration	Pinghua Gong, Jiayu Zhou, Wei Fan, Jieping Ye	In this paper, we propose a variant of the calibrated multi-task feature learning formulation by including a squared norm regularizer.
82	Multi-task copula by sparse graph regression	Tianyi Zhou, Dacheng Tao	This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL).
83	Unifying learning to rank and domain adaptation: enabling cross-task document scoring	Mianwei Zhou, Kevin C. Chang	We propose the Tree-structured Boltzmann Machine (T-RBM), a novel two-stage Markov Network, as our solution.
84	Scalable heterogeneous translated hashing	Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang	In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence.
85	Matching users and items across domains to improve the recommendation quality	Chung-Yi Li, Shou-De Lin	Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions.
86	Optimal recommendations under attraction, aversion, and social influence	Wei Lu, Stratis Ioannidis, Smriti Bhagat, Laks V.S. Lakshmanan	In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user’s interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system.
87	ClusCite: effective citation recommendation by information network-based clustering	Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang, Jiawei Han	In this study, we investigate the problem in the context of heterogeneous bibliographic networks and propose a novel cluster-based citation recommendation framework, called ClusCite, which explores the principle that citations tend to be softly clustered into interest groups based on multiple types of relationships in the network.
88	GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation	Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, Yong Rui	Besides, researchers have recently discovered a spatial clustering phenomenon in human mobility behavior on the LBSNs, i.e., individual visiting locations tend to cluster together, and also demonstrated its effectiveness in POI recommendation, thus we incorporate it into the factorization model.
89	Detecting anomalies in dynamic rating data: a robust probabilistic model for rating evolution	Stephan Günnemann, Nikou Günnemann, Christos Faloutsos	In this work, we tackle the following question: Given the time stamped rating data for a product or service, how can we detect the general rating behavior of users as well as time intervals where the ratings behave anomalous?
90	Product selection problem: improve market share by learning consumer behavior	Silei Xu, John Chi-Shing Lui	To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee.
91	TCS: efficient topic discovery over crowd-oriented service data	Yongxin Tong, Caleb Chen Cao, Lei Chen	In particular, in order to train TCS efficiently, we design a novel parameter inference algorithm, the Bucket Parameter Estimation (BPE), which utilizes belief propagation and a new sketching technique, called Pairwise Sketch (pSketch).
92	SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds	Erich Schubert, Michael Weiler, Hans-Peter Kriegel	Our contributions to the detection of emerging trends are three-fold first of all, we propose a significance measure that can be used to detect emerging topics early, long before they become "hot tags", by drawing upon experience from outlier detection.
93	Experiments with non-parametric topic models	Wray L. Buntine, Swapnil Mishra	We look at the comparative behaviour of different models and present some experimental insights.
94	Reducing the sampling complexity of topic models	Aaron Q. Li, Amr Ahmed, Sujith Ravi, Alexander J. Smola	In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics k_d in the document.
95	Dynamics of news events and social media reaction	Mikalai Tsytsarau, Themis Palpanas, Malu Castellanos	In this paper, we study the dynamics of news events and their relation to changes of sentiment expressed on relevant topics.
96	Differentially private network data release via structural inference	Qian Xiao, Rui Chen, Kian-Lee Tan	In this paper, we present a novel data sanitization solution that infers a network’s structure in a differentially private manner.
97	Exponential random graph estimation under differential privacy	Wentian Lu, Gerome Miklau	In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs).
98	Top-k frequent itemsets via differentially private FP-trees	Jaewoo Lee, Christopher W. Clifton	We give an approach that first identifies top-k frequent itemsets, then uses them to construct a compact, differentially private FP-tree.
99	CatchSync: catching synchronized behavior in large directed graphs	Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang	We propose a fast and effective method, CatchSync, which exploits two of the tell-tale signs left in graphs by fraudsters: (a) synchronized behavior: suspicious nodes have extremely similar behavior pattern, because they are often required to perform some task together (such as follow the same user); and (b) rare behavior: their connectivity patterns are very different from the majority.
100	Mobile app recommendations with security and privacy awareness	Hengshu Zhu, Hui Xiong, Yong Ge, Enhong Chen	To fill this crucial void, in this paper, we propose to develop a mobile App recommender system with privacy and security awareness.
101	Fast DTT: a near linear algorithm for decomposing a tensor into factor tensors	Xiaomin Fang, Rong Pan	To overcome these problems, we propose a near linear tensor factorization approach, which decompose a tensor into factor tensors in order to model the higher-order relations, without loss of important information.
102	Clustering and projected clustering with adaptive neighbors	Feiping Nie, Xiaoqian Wang, Heng Huang	In this paper, we propose a novel clustering model to learn the data similarity matrix and clustering structure simultaneously.
103	LWI-SVD: low-rank, windowed, incremental singular value decompositions on time-evolving data sets	Xilun Chen, K. Selcuk Candan	To address these challenges, in this paper, we propose a Low-rank, Windowed, Incremental SVD (LWI-SVD) algorithm, which (a) leverages efficient and accurate low-rank approximations to speed up incremental SVD updates and (b) uses a window-based approach to aggregate multiple incoming updates (insertions or deletions of rows and columns) and, thus, reduces on- line processing costs.
104	Provable deterministic leverage score sampling	Dimitris Papailiopoulos, Anastasios Kyrillidis, Christos Boutsidis	In this work, we provide a novel theoretical analysis of deterministic leverage score sampling.
105	Semantic visualization for spherical representation	Tuan M.V. Le, Hady W. Lauw	In this paper, we address the semantic visualization problem.
106	Grouping students in educational settings	Rakesh Agrawal, Behzad Golshan, Evimaria Terzi	We propose a framework for rigorously studying this question, taking a computational perspective.
107	Inferring gas consumption and pollution emission of vehicles throughout a city	Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu	As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach.
108	Methods for ordinal peer grading	Karthik Raman, Thorsten Joachims	Thus, in this paper we study the problem of automatically inferring student grades from ordinal peer feedback, as opposed to existing methods that require cardinal peer feedback.
109	Exploiting geographic dependencies for real estate appraisal: a mutual perspective of ranking and clustering	Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, Zhi-Hua Zhou	To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power.
110	Towards scalable critical alert mining	Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, Hasan Cam, Jiawei Han, Xifeng Yan	This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized.
111	From labor to trader: opinion elicitation via online crowds as a market	Caleb Chen Cao, Lei Chen, Hosagrahar Visvesvaraya Jagadish	In this paper, we study how to use crowds for Opinion Elicitation.
112	Optimal real-time bidding for display advertising	Weinan Zhang, Shuai Yuan, Jun Wang	In this paper we study bid optimisation for real-time bidding (RTB) based display advertising.
113	Quantifying herding effects in crowd wisdom	Ting Wang, Dashun Wang, Fei Wang	In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making.
114	Modeling delayed feedback in display advertising	Olivier Chapelle	We tackle this issue by introducing an additional model that captures the conversion delay.
115	Networked bandits with disjoint linear payoffs	Meng Fang, Dacheng Tao	In this paper, we study `networked bandits’, a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms.
116	Mining topics in documents: standing on the shoulders of big data	Zhiyuan Chen, Bing Liu	In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics.
117	Integrating spreadsheet data via accurate and low-effort extraction	Zhe Chen, Michael Cafarella	We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort.
118	Sentiment expression conditioned by affective transitions and social forces	Moritz Sudhof, Andrés Goméz Emilsson, Andrew L. Maas, Christopher Potts	We develop a theory of conditional dependencies between emotional states in which emotions are characterized not only by valence (polarity) and arousal (intensity) but also by the role they play in state transitions and social relationships.
119	Entity profiling with varying source reliabilities	Furong Li, Mong Li Lee, Wynne Hsu	In this paper, we present a framework called Comet that interleaves record linkage with error correction, taking into consideration the source reliabilities on various attributes.
120	Open question answering over curated and extracted knowledge bases	Anthony Fader, Luke Zettlemoyer, Oren Etzioni	In this paper, we present OQA, the first approach to leverage both curated and extracted KBs.
121	Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs	Feng Chen, Daniel B. Neill	As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach.
122	Event detection in activity networks	Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, Nikolaj Tatti	We consider the problem of mining activity networks to identify interesting events, such as a big concert or a demonstration in a city, or a trending keyword in a user community in a social network.
123	FEMA: flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery	Meng Jiang, Peng Cui, Fei Wang, Xinran Xu, Wenwu Zhu, Shiqiang Yang	In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining.
124	Profit-maximizing cluster hires	Behzad Golshan, Theodoros Lappas, Evimaria Terzi	Our work presents a detailed analysis of the computational complexity and hardness of approximation of the problem, as well as heuristic, yet effective, algorithms for solving it in practice.
125	On social event organization	Keqian Li, Wei Lu, Smriti Bhagat, Laks V.S. Lakshmanan, Cong Yu	In this paper, we study the key computational problem involved in organization of social events, to our best knowledge, for the first time.
126	A bayesian framework for estimating properties of network diffusions	Varun R. Embar, Rama Kumar Pasumarthi, Indrajit Bhattacharya	In this paper, we propose and study this novel problem in a Bayesian framework by capturing the posterior distribution of these hidden variables given the observed cascades, and computing the expectation of these properties under this posterior distribution.
127	Scalable diffusion-aware optimization of network topology	Elias Boutros Khalil, Bistra Dilkina, Le Song	In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions.
128	Probabilistic latent network visualization: inferring and embedding diffusion networks	Takeshi Kurashima, Tomoharu Iwata, Noriko Takaya, Hiroshi Sawada	This paper proposes a probabilistic model for inferring the diffusion network, which we call Probabilistic Latent Network Visualization (PLNV); it is based on cascade data, a record of observed times of node influence.
129	MMRate: inferring multi-aspect diffusion networks with multi-pattern cascades	Senzhang Wang, Xia Hu, Philip S. Yu, Zhoujun Li	In this paper, we investigate a novel problem of inferring multi-aspect diffusion networks with multi-pattern cascades.
130	Stability of influence maximization	Xinran He, David Kempe	In an attempt to fix the record, the present article combines the problem motivation, models, and experimental results sections from the original incorrect article with the new hardness result.
131	Who to follow and why: link prediction with explanations	Nicola Barbieri, Francesco Bonchi, Giuseppe Manco	In this paper we study link prediction with explanations for user recommendation in social networks.
132	Activity-edge centric multi-label classification for mining heterogeneous information networks	Yang Zhou, Ling Liu	In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features.
133	Meta-path based multi-network collective link prediction	Jiawei Zhang, Philip S. Yu, Zhi-Hua Zhou	In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem.
134	Fast influence-based coarsening for large networks	Manish Purohit, B. Aditya Prakash, Chanhyun Kang, Yao Zhang, V.S. Subrahmanian	Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information.
135	Minimizing seed set selection with probabilistic coverage guarantee in a social network	Peng Zhang, Wei Chen, Xiaoming Sun, Yajun Wang, Jialin Zhang	In this paper, we consider the task of selecting initial seed users of a topic with minimum size so that {\em with a guaranteed probability} the number of users discussing the topic would reach a given threshold.
136	Core decomposition of uncertain graphs	Francesco Bonchi, Francesco Gullo, Andreas Kaltenbrunner, Yana Volkovich	In this paper we provide an analogous tool for uncertain graphs, i.e., graphs whose edges are assigned a probability of existence.
137	Learning multifractal structure in large networks	Austin R. Benson, Carlos Riquelme, Sven Schmit	In this paper, we analyze and improve the multifractal network generators (MFNG) introduced by Palla et al.
138	Temporal skeletonization on sequential data: patterns, categorization, and visualization	Chuanren Liu, Kai Zhang, Hui Xiong, Geoff Jiang, Qiang Yang	To this end, in this paper, we propose a ‘temporal skeletonization’ approach to proactively reduce the representation of sequences to uncover significant, hidden temporal structures.
139	Focused clustering and outlier detection in large attributed graphs	Bryan Perozzi, Leman Akoglu, Patricia Iglesias Sánchez, Emmanuel Müller	In this work, we overcome this limitation and introduce a novel user-oriented approach for mining attributed graphs.
140	Inside the atoms: ranking on a network of networks	Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang	In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network.
141	Community membership identification from small seed sets	Isabel M. Kloumann, Jon M. Kleinberg	We evaluate our methods across multiple domains, using publicly available datasets with labeled, ground-truth communities.
142	Community detection in graphs through correlation	Lian Duan, Willian Nick Street, Yanchi Liu, Haibing Lu	This paper connects modularity-based methods with correlation analysis by subtly reformatting their math formulas and investigates how to fully make use of correlation analysis to change the objective function of modularity-based methods, which provides a more natural and effective way to solve the resolution limit problem.
143	Heat kernel based community detection	Kyle Kloster, David F. Gleich	We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces.
144	On the permanence of vertices in network communities	Tanmoy Chakraborty, Sriram Srinivasan, Niloy Ganguly, Animesh Mukherjee, Sanjukta Bhowmick	In this paper, we demonstrate that compared to other metrics, permanence provides (i) a more accurate estimate of a derived community structure to the ground-truth community and (ii) is more sensitive to perturbations in the network.
145	The interplay between dynamics and networks: centrality, communities, and cheeger inequality	Rumi Ghosh, Shang-hua Teng, Kristina Lerman, Xiaoran Yan	As the first step towards this objective, we introduce an umbrella framework for defining and characterizing an ensemble of dynamic processes on a network.
146	Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches	Yuichi Yoshida	In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k.
147	FAST-PPR: scaling personalized pagerank estimation for large graphs	Peter A. Lofgren, Siddhartha Banerjee, Ashish Goel, C. Seshadhri	We propose a new algorithm, FAST-PPR, for computing personalized PageRank: given start node s and target node t in a directed graph, and given a threshold δ, it computes the Personalized PageRank π_s(t) from s to t, guaranteeing that the relative error is small as long π_s(t) > δ.
148	Graph sample and hold: a framework for big-graph analytics	Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, Ramana Kompella	In this paper, we pro- pose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory.
149	Balanced graph edge partition	Florian Bourse, Marc Lelarge, Milan Vojnovic	We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation.
150	Using strong triadic closure to characterize ties in social networks	Stavros Sintos, Panayiotis Tsaparas	In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks.
151	Analyzing expert behaviors in collaborative networks	Huan Sun, Mudhakar Srivatsa, Shulong Tan, Yang Li, Lance M. Kaplan, Shu Tao, Xifeng Yan	In this work, we attempt to deduce the cognitive process of task routing, and model the decision making of experts as a generative process where a routing decision is made based on mixed routing patterns.
152	Predicting long-term impact of CQA posts: a comprehensive viewpoint	Yuan Yao, Hanghang Tong, Feng Xu, Jian Lu	In this paper, we aim to predict the long-term impact of questions/answers shortly after they are posted in the CQA sites.
153	Who are experts specializing in landscape photography?: analyzing topic-specific authority on content sharing services	Bin Bi, Ben Kao, Chang Wan, Junghoo Cho	In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service.
154	Frontiers in E-commerce personalization	Sri Subramaniam	This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.
155	Predictive modeling in practice: a case study from sprint	Tracy De Poalo, Jeremy Howard	In this talk, Sprint’s Head of Predictive Modeling, Tracey De Poalo, will talk about the process she developed using SAS and logistic regression to build a wide range of models.
156	Medicine in the age of electronic health records	Nigam Shah	We will present approaches to identify novel off-label uses of drugs using the patient feature matrix along with prior knowledge about drugs, diseases, and known usage.
157	Algorithms for interpretable machine learning	Cynthia Rudin	I will describe several approaches, including an algorithm based on discrete optimization, and an algorithm based on Bayesian analysis.
158	Data science through the lens of social science	Drew Conway	In this talk, Drew will examine data science through the lens of the social scientist.
159	Information environment security	Rand Waltzman	The purpose of this talk is to help frame a new science of Information Environment Security (IES) whose goal is to create and apply the tools needed to discover and maintain fundamental models of our ever-changing information environment and to defend us in that environment, both as individuals and collectively, against intentional as well as unintentional attempts to deceive, misinform and otherwise manipulate us.
160	Big data for social good	Nathan Eagle	After providing an overview of the mobile and social media landscapes in emerging markets, we discuss a system that implements polls & mobile subscription compensation.
161	Bringing data science to the speakers of every language	Robert Munro	I will present examples of how natural language processing and distributed human computing are improving the lives of speakers of all the world’s languages, in areas including education, disaster-response, health and access to employment.
162	Guilt by association: large scale malware detection by mining file-relation graphs	Acar Tamersoy, Kevin Roundy, Duen Horng Chau	We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop’s moral that "a man is known by the company he keeps."
163	Mining text snippets for images on the web	Anitha Kannan, Simon Baker, Krishnan Ramnath, Juliet Fiss, Dahua Lin, Lucy Vanderwende, Rizwan Ansary, Ashish Kapoor, Qifa Ke, Matt Uyttendaele, Xin-Jing Wang, Lei Zhang	We propose an algorithm to mine multiple diverse, relevant, and interesting text snippets for images on the web.
164	Predicting student risks through longitudinal analysis	Ashay Tamhane, Shajith Ikbal, Bikram Sengupta, Mayuri Duggirala, James Appleton	In this paper, we report on a large-scale study to identify students at risk of not meeting acceptable levels of performance in one state-level and one national standardized assessment in Grade 8 of a major US school district.
165	Novel geospatial interpolation analytics for general meteorological measurements	Bingsheng Wang, Jinjun Xiong	We propose a Bayesian compressed sensing based non-parametric statistical model to efficiently perform the spatial interpolation task.
166	Targeting direct cash transfers to the extremely poor	Brian Abelson, Kush R. Varshney, Joy Sun	In this work, we streamline an important step in the operations of the NGO by developing and deploying a data-driven system for locating villages with extreme poverty in Kenya and Uganda.
167	Scalable hands-free transfer learning for online advertising	Brian Dalessandro, Daizhuo Chen, Troy Raeder, Claudia Perlich, Melinda Han Williams, Foster Provost	This paper presents a combination of strategies, deployed by the online advertising firm Dstillery, for learning many models from extremely high-dimensional data efficiently and without human intervention.
168	Correlating events with time series for incident diagnosis	Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, Zhe Wang	In this paper, we propose an approach to evaluate the correlation between time series data and event data.
169	Proactive workflow modeling by stochastic processes with application to healthcare operation and management	Chuanren Liu, Yong Ge, Hui Xiong, Keli Xiao, Wei Geng, Matt Perkins	To that end, in this paper, we provide a focused study of workflow modeling by the integrated analysis of indoor location traces in the hospital environment.
170	Activity ranking in LinkedIn feed	Deepak Agarwal, Bee-Chung Chen, Rupesh Gupta, Joshua Hartman, Qi He, Anand Iyer, Sumanth Kolar, Yiming Ma, Pannagadatta Shivaswamy, Ajit Singh, Liang Zhang	In this paper, we report our experience with the problem of ranking activities in the LinkedIn homepage feed.
171	Budget pacing for targeted online advertisements at LinkedIn	Deepak Agarwal, Souvik Ghosh, Kai Wei, Siyu You	We describe a method for improving such ad serving systems by including a budget pacing component that serves ads by being aware of global supply patterns.
172	Large scale predictive modeling for micro-simulation of 3G air interface load	Dejan Radosavljevik, Peter van der Putten	This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large European telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes.
173	Unveiling clusters of events for alert and incident management in large-scale enterprise it	Derek Lin, Rashmi Raghu, Vivek Ramamurthy, Jin Yu, Regunathan Radhakrishnan, Joseph Fernandez	We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning.
174	Style in the long tail: discovering unique interests with latent variable models in large scale social E-commerce	Diane J. Hu, Rob Hall, Josh Attenberg	In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site.
175	Corporate residence fraud detection	Enric Junqué de Fortuny, Marija Stankova, Julie Moeyersoms, Bart Minnaert, Foster Provost, David Martens	This is the first data mining application specifically aimed at finding corporate residence fraud, where we show the predictive value of using both structured and fine-grained invoicing data.
176	Modeling mass protest adoption in social network communities using geometric brownian motion	Fang Jin, Rupinder Paul Khandpur, Nathan Self, Edward Dougherty, Sheng Guo, Feng Chen, B. Aditya Prakash, Naren Ramakrishnan	We propose a bispace model to capture propagation in the union of (exclusively) Twitter and non-Twitter environments.
177	Shallow semantic parsing of product offering titles	Gabor Melli	We present a case study of a deployed data-driven system that first chunks individual titles into semantically classified sub-segments, and then uses this information to improve a hyperlink insertion service.
178	A case study: privacy preserving release of spatio-temporal density in paris	Gergely Acs, Claude Castelluccia	In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week.
179	Scalable near real-time failure localization of data center networks	Herodotos Herodotou, Bolin Ding, Shobana Balakrishnan, Geoff Outhred, Percy Fitter	Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals.
180	Improving management of aquatic invasions by integrating shipping network, ecological, and environmental data: data mining for social good	Jian Xu, Thanuka L. Wickramarathne, Nitesh V. Chawla, Erin K. Grey, Karsten Steinhaeuser, Reuben P. Keller, John M. Drake, David M. Lodge	We present here an approach for addressing the problem at hand via creative use of computational techniques and multiple data sources, thus illustrating how data mining can be used for solving crucial, yet very complex problems towards social good.
181	FoodSIS: a text mining system to improve the state of food safety in singapore	Kiran Kate, Sneha Chaudhari, Andy Prapanca, Jayant Kalagnanam	In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety.
182	A hazard based approach to user return time prediction	Komal Kapoor, Mingxuan Sun, Jaideep Srivastava, Tao Ye	In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return.
183	Predicting employee expertise for talent management in the enterprise	Kush R. Varshney, Vijil Chenthamarakshan, Scott W. Fancher, Jun Wang, Dongping Fang, Aleksandra Mojsilović	In this work, we deploy an analytics-driven solution that infers the expertise of employees through the mining of enterprise and social data that is not specifically generated and collected for expertise inference.
184	Applying data mining techniques to address critical process optimization needs in advanced manufacturing	Li Zheng, Chunqiu Zeng, Lei Li, Yexi Jiang, Wei Xue, Jingxuan Li, Chao Shen, Wubai Zhou, Hongtai Li, Liang Tang, Tao Li, Bing Duan, Ming Lei, Pengnian Wang	In this paper, we design, implement and deploy an integrated solution, named PDP-Miner, which is a data analytics platform customized for process optimization in Plasma Display Panel (PDP) manufacturing.
185	EARS (earthquake alert and report system): a real time decision support system for earthquake crisis management	Marco Avvenuti, Stefano Cresci, Andrea Marchetti, Carlo Meletti, Maurizio Tesconi	In this work we describe the design, implementation and deployment of a decision support system for the detection and the damage assessment of earthquakes in Italy.
186	Knock it off: profiling the online storefronts of counterfeit merchandise	Matthew F. Der, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker	Our approach in this paper is to extract features that reveal when Web pages linked to the same affiliate program share a similar underlying structure.
187	Up next: retrieval methods for large scale related video suggestion	Michael Bendersky, Lluis Garcia-Pueyo, Jeremiah Harmsen, Vanja Josifovski, Dima Lepikhin	In this paper, we focus on the task of video suggestion, commonly found in many online applications.
188	Identifying tourists from public transport commuters	Mingqiang Xue, Huayu Wu, Wei Chen, Wee Siong Ng, Gin Howe Goh	In this joint work with Singapore’s Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA.
189	Spatially embedded co-offence prediction using supervised learning	Mohammad A. Tayebi, Martin Ester, Uwe Glässer, Patricia L. Brantingham	Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders.
190	‘Beating the news’ with EMBERS: forecasting civil unrest using open source indicators	Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, Nathan Self, Rupinder Khandpur, Parang Saraf, Wei Wang, Jose Cadena, Anil Vullikanti, Gizem Korkmaz, Chris Kuhlman, Achla Marathe, Liang Zhao, Ting Hua, Feng Chen, Chang Tien Lu, Bert Huang, Aravind Srinivasan, Khoa Trinh, Lise Getoor, Graham Katz, Andy Doyle, Chris Ackermann, Ilya Zavorin, Jim Ford, Kristen Summers, Youssef Fayed, Jaime Arredondo, Dipak Gupta, David Mares	We describe the design, implementation, and evaluation of EMBERS, an automated, 24×7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources.
191	LASTA: large scale topic assignment on multiple social networks	Nemanja Spasojevic, Jinyun Yan, Adithya Rao, Prantik Bhattacharyya	In this paper, we present ‘LASTA’ (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis.
192	New algorithms for parking demand management and a city-scale deployment	Onno Zoeter, Christopher Dance, Stéphane Clinchant, Jean-Marc Andreoli	This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand.
193	Reducing gang violence through network influence based targeting of social programs	Paulo Shakarian, Joseph Salmento, William Pulleyblank, John Bertetto	In this paper, we study a variant of the social network maximum influence problem and its application to intelligently approaching individual gang members with incentives to leave a gang.
194	Modeling impression discounting in large-scale recommender systems	Pei Lee, Laks V.S. Lakshmanan, Mitul Tiwari, Sam Shah	In this paper, we address modeling impression discounting of recommended items, that is, how to model user’s no-action feedback on impressed recommended items.
195	ISIS: a networked-epidemiology based pervasive web app for infectious disease pandemic planning and response	Richard Beckman, Keith R. Bisset, Jiangzhuo Chen, Bryan Lewis, Madhav Marathe, Paula Stretz	We describe ISIS, a high-performance-computing-based application to support computational epidemiology of infectious diseases.
196	Seven rules of thumb for web site experimenters	Ron Kohavi, Alex Deng, Roger Longbotham, Ya Xu	Some rules of thumb have previously been stated, such as ‘speed matters,’ but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical.
197	Log-based predictive maintenance	Ruben Sipos, Dmitriy Fradkin, Fabian Moerchen, Zhuang Wang	Log-based predictive maintenance
198	Automated hypothesis generation based on mining scientific literature	Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, Olivier Lichtarge	We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses.
199	A system to grade computer programming skills using machine learning	Shashank Srikant, Varun Aggarwal	In this paper, we present a system to grade computer programs automatically.
200	An empirical study of reserve price optimisation in real-time bidding	Shuai Yuan, Jun Wang, Bowei Chen, Peter Mason, Sam Seljan	In this paper, we report the first empirical study and live test of the reserve price optimisation problem in the context of Real-Time Bidding (RTB) display advertising from an operational environment.
201	Large-scale high-precision topic modeling on twitter	Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, Pankaj Gupta	We present a spectrum of topic modeling techniques that contribute to a deployed system.
202	Large scale visual recommendations from street fashion images	Vignesh Jagadeesh, Robinson Piramuthu, Anurag Bhardwaj, Wei Di, Neel Sundaresan	We describe a completely automated large scale visual recommendation system for fashion.
203	We know what you want to buy: a demographic-based system for product recommendation on microblogs	Xin Wayne Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, Xiaoming Li	In this paper, we develop a novel product recommender system called METIS, a MErchanT Intelligence recommender System, which detects users’ purchase intents from their microblogs in near real-time and makes product recommendation based on matching the users’ demographic information extracted from their public profiles with product demographics learned from microblogs and online reviews.
204	Modeling professional similarity by mining professional career trajectories	Ye Xu, Zang Li, Abhishek Gupta, Ahmet Bugdayci, Anmol Bhasin	Using this professional profile dataset, this paper attempts to model profiles of individuals as a sequence of positions held by them as a time-series of nodes, each of which represents one particular position or job experience in the individual’s career trajectory.
205	Filling context-ad vocabulary gaps with click logs	Yukihiro Tagami, Toru Hotta, Yusuke Tanaka, Shingo Ono, Koji Tsukamoto, Akira Tajima	In this work, we propose a translation method that learns the mapping of the contextual information to the textual features of ads by using past click data.
206	Does social good justify risking personal privacy?	Raghu Ramakrishnan, Geoffrey I. Webb	How should we approach this trade-off?
207	Scaling up deep learning	Yoshua Bengio	The tutorial will introduce some of the basic algorithms, both on the supervised and unsupervised sides, as well as discuss some of the guidelines for successfully using them in practice.
208	Constructing and mining web-scale knowledge graphs: KDD 2014 tutorial	Antoine Bordes, Evgeniy Gabrilovich	In this tutorial, we will present the state of the art in constructing, mining, and growing knowledge graphs.
209	Bringing structure to text: mining phrases, entities, topics, and hierarchies	Jiawei Han, Chi Wang, Ahmed El-Kishky	In this tutorial, we provide a comprehensive survey on the state-of-the art of data-driven methods that automatically mine phrases, extract and infer latent structures from text corpus, and construct multi-granularity topical groupings and hierarchies of the underlying themes.
210	Computational epidemiology	Madhav V. Marathe, Anil Kumar S. Vullikanti	In this tutorial, we focus on an approach based on diffusion processes on complex networks.
211	Management and analytic of biomedical big data with cloud-based in-memory database and dynamic querying: a hands-on experience with real-world data	Mengling Feng, Mohammad Ghassemi, Thomas Brennan, John Ellenberger, Ishrar Hussain, Roger Mark	In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP’s cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset.
212	The recommender problem revisited: morning tutorial	Xavier Amatriain, Bamshad Mobasher	In this tutorial we will describe different components of modern recommender systems such as: personalized ranking, similarity, explanations, context-awareness, or search as recommendation.
213	Correlation clustering: from theory to practice	Francesco Bonchi, David Garcia-Soriano, Edo Liberty	The goal of this tutorial is to show how correlation clustering can be a powerful addition to the toolkit of the data mining researcher and practitioner, and to encourage discussions and further research in the area.
214	Deep learning	Ruslan Salakhutdinov	The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community.
215	Network mining and analysis for social applications	Feida Zhu, Huan Sun, Xifeng Yan	In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern.
216	Sampling for big data: a tutorial	Graham Cormode, Nick Duffield	One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph.
217	Statistically sound pattern discovery	Wilhelmiina Hämäläinen, Geoffrey I. Webb	We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns.
218	Recommendation in social media: recent advances and new frontiers	Jiliang Tang, Jie Tang, Huan Liu	In this tutorial, we aim to provide a comprehensive overview of various recommendation tasks in social media, especially their recent advances and new frontiers.