Paper Digest: KDD 2014 Highlights
ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) is one of the top data mining conferences in the world.
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
team@paperdigest.org
TABLE 1: KDD 2014 Papers
Title | Authors | Highlight | |
---|---|---|---|
1 | The battle for the future of data mining | Oren Etzioni | My talk will describe work at the new Allen Institute for AI towards building the next-generation of text-mining systems. |
2 | Data, predictions, and decisions in support of people and society | Eric Horvitz | I will describe efforts to harness data for making predictions and guiding decisions, touching on work in transportation, healthcare, online services, and interactive systems. |
3 | A data driven approach to diagnosing and treating disease | Eric Schadt | More specifically, we have constructed predictive network models for Alzheimer’s disease, along with other common human diseases such as obesity, diabetes, heart disease, and inflammatory bowel disease, and cancer, and demonstrated a causal network common across all of these diseases(3, 5-10). |
4 | Bugbears or legitimate threats?: (social) scientists’ criticisms of machine learning? | Sendhil Mullainathan | This talk describes joint work with Jon Kleinberg and individual projects with Himabindu Lakkaraju, Jure Leskovec, Jens Ludwig, Anuj Shah, Chenhao Tan, Mike Yeomans and Tom Zimmerman. |
5 | Prediction of human emergency behavior and their mobility following large-scale disaster | Xuan Song, Quanshi Zhang, Yoshihide Sekimoto, Ryosuke Shibasaki | In this paper, we build up a large human mobility database (GPS records of 1.6 million users over one year) and several different datasets to capture and analyze human emergency behavior and their mobility following the Great East Japan Earthquake and Fukushima nuclear accident. |
6 | Inferring user demographics and social strategies in mobile social networks | Yuxiao Dong, Yang Yang, Jie Tang, Yang Yang, Nitesh V. Chawla | In this paper, we aim to harness the power of big data to automatically infer users’ demographics based on their daily mobile communication patterns. |
7 | Travel time estimation of a path using sparse trajectories | Yilun Wang, Yu Zheng, Yexiang Xue | In this paper, we propose a citywide and real-time model for estimating the travel time of any path (represented as a sequence of connected road segments) in real time in a city, based on the GPS trajectories of vehicles received in current time slots and over a period of history as well as map data sources. |
8 | Modeling human location data with mixtures of kernel densities | Moshe Lichman, Padhraic Smyth | In this paper we address the problem of learning spatial density models, focusing specifically on individual-level data. |
9 | A cost-effective recommender system for taxi drivers | Meng Qu, Hengshu Zhu, Junming Liu, Guannan Liu, Hui Xiong | To this end, in this paper, we propose to develop a cost-effective recommender system for taxi drivers. |
10 | LUDIA: an aggregate-constrained low-rank reconstruction algorithm to leverage publicly released health data | Yubin Park, Joydeep Ghosh | This paper introduces LUDIA, a novel low-rank approximation algorithm that utilizes aggregation constraints in addition to auxiliary information in order to estimate or "reconstruct" the original individual-level values from aggregate data. |
11 | People on drugs: credibility of user statements in health communities | Subhabrata Mukherjee, Gerhard Weikum, Cristian Danescu-Niculescu-Mizil | In this work we propose a method for automatically establishing the credibility of user-generated medical statements and the trustworthiness of their authors by exploiting linguistic cues and distant supervision from expert sources. |
12 | Unfolding physiological state: mortality modelling in intensive care units | Marzyeh Ghassemi, Tristan Naumann, Finale Doshi-Velez, Nicole Brimmer, Rohit Joshi, Anna Rumshisky, Peter Szolovits | We examined the use of latent variable models (viz. |
13 | Unsupervised learning of disease progression models | Xiang Wang, David Sontag, Fei Wang | In this paper, we propose a probabilistic disease progression model that address these challenges. |
14 | Good-enough brain model: challenges, algorithms and discoveries in multi-subject experiments | Evangelos E. Papalexakis, Alona Fyshe, Nicholas D. Sidiropoulos, Partha Pratim Talukdar, Tom M. Mitchell, Christos Faloutsos | In this work we present a simple, novel good-enough brain model, or GeBM in short, and a novel algorithm Sparse-SysId, which are able to effectively model the dynamics of the neuron interactions and infer the functional connectivity. |
15 | FUNNEL: automatic mining of spatially coevolving epidemics | Yasuko Matsubara, Yasushi Sakurai, Willem G. van Panhuis, Christos Faloutsos | In this paper, we present FUNNEL, a unifying analytical model for large scale epidemiological data, as well as a novel fitting algorithm, FUNNELFIT, which solves the above problem. |
16 | Marble: high-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization | Joyce C. Ho, Joydeep Ghosh, Jimeng Sun | We propose Marble, a novel sparse non-negative tensor factorization method to derive phenotype candidates with virtually no human supervision. |
17 | Scalable noise mining in long-term electrocardiographic time-series to predict death following heart attacks | Chih-Chun Chia, Zeeshan Syed | In this paper, we extend this work and focus on the question of how to reduce its computational complexity for scalable use in large datasets or energy constrained embedded devices. |
18 | From micro to macro: data driven phenotyping by densification of longitudinal electronic medical records | Jiayu Zhou, Fei Wang, Jianying Hu, Jieping Ye | In this paper, we propose a data driven phenotyping framework called Pacifier (PAtient reCord densIFIER), where we interpret the longitudinal EMR data of each patient as a sparse matrix with a feature dimension and a time dimension, and derive more robust patient phenotypes by exploring the latent structure of those matrices. |
19 | Clinical risk prediction with multilinear sparse logistic regression | Fei Wang, Ping Zhang, Buyue Qian, Xiang Wang, Ian Davidson | We propose a block proximal descent approach to solve the problem and prove its convergence. |
20 | Dual beta process priors for latent cluster discovery in chronic obstructive pulmonary disease | James C. Ross, Peter J. Castaldi, Michael H. Cho, Jennifer G. Dy | In this paper we introduce a transformative way of looking at the COPD subtyping task. |
21 | COM: a generative model for group recommendation | Quan Yuan, Gao Cong, Chin-Yew Lin | In this paper, we propose a probabilistic model named COM (COnsensus Model) to model the generative process of group activities, and make group recommendations. |
22 | Leveraging user libraries to bootstrap collaborative filtering | Laurent Charlin, Richard S. Zemel, Hugo Larochelle | We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents. |
23 | Topic-factorized ideal point estimation model for legislative voting network | Yupeng Gu, Yizhou Sun, Ning Jiang, Bingyu Wang, Ting Chen | In this paper, we propose a novel topic-factorized ideal point estimation model for a legislative voting network in a unified framework. |
24 | Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) | Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alexander J. Smola, Jing Jiang, Chong Wang | In this work we propose a probabilistic model based on collaborative filtering and topic modeling. |
25 | User effort minimization through adaptive diversification | Mahbub Hasan, Abhijith Kashyap, Vagelis Hristidis, Vassilis Tsotras | In this paper, we show that for different search tasks there is a different ideal balance of relevance and diversity. |
26 | Relevant overlapping subspace clusters on categorical data | Xiao He, Jing Feng, Bettina Konte, Son T. Mai, Claudia Plant | Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. |
27 | Batch discovery of recurring rare classes toward identifying anomalous samples | Murat Dundar, Halid Ziya Yerebakan, Bartek Rajwa | We present a clustering algorithm for discovering rare yet significant recurring classes across a batch of samples in the presence of random effects. |
28 | A dirichlet multinomial mixture model-based approach for short text clustering | Jianhua Yin, Jianyong Wang | In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. |
29 | Representative clustering of uncertain data | Andreas Züfle, Tobias Emrich, Klaus Arthur Schmid, Nikos Mamoulis, Arthur Zimek, Matthias Renz | In this paper, we describe a framework, based on possible-worlds semantics; when applied on an uncertain dataset, it computes a set of representative clusterings, each of which has a probabilistic guarantee not to exceed some maximum distance to the ground truth clustering, i.e., the clustering of the actual (but unknown) data. |
30 | SMVC: semi-supervised multi-view clustering in subspace projections | Stephan Günnemann, Ines Färber, Matthias Rüdiger, Thomas Seidl | In this paper, we join both research areas and present a solution for integrating prior knowledge in the process of detecting multiple clusterings. |
31 | FastXML: a fast, accurate and stable tree-classifier for extreme multi-label learning | Yashoteja Prabhu, Manik Varma | Our objective, in this paper, is to develop an extreme multi-label classifier that is faster to train and more accurate at prediction than the state-of-the-art Multi-label Random Forest (MLRF) algorithm [2] and the Label Partitioning for Sub-linear Ranking (LPSR) algorithm [35]. |
32 | A multi-class boosting method with direct optimization | Shaodan Zhai, Tian Xia, Shaojun Wang | We present a direct multi-class boosting (DMCBoost) method for classification with the following properties: (i) instead of reducing the multi-class classification task to a set of binary classification tasks, DMCBoost directly solves the multi-class classification problem, and only requires very weak base classifiers; (ii) DMCBoost builds an ensemble classifier by directly optimizing the non-convex performance measures, including the empirical classification error and margin functions, without resorting to any upper bounds or approximations. |
33 | An efficient algorithm for weak hierarchical lasso | Yashu Liu, Jie Wang, Jieping Ye | In this paper, we propose to directly solve the non-convex weak hierarchical Lasso by making use of the GIST (General Iterative Shrinkage and Thresholding) optimization framework which has been shown to be efficient for solving non-convex sparse formulations. |
34 | Online multiple kernel regression | Doyen Sahoo, Steven C.H. Hoi, Bin Li | In this paper, we propose a family of OMKR algorithms for regression and discuss their application to time series prediction tasks. |
35 | Class-distribution regularized consensus maximization for alleviating overfitting in model combination | Sihong Xie, Jing Gao, Wei Fan, Deepak Turaga, Philip S. Yu | We propose a novel model called Regularized Consensus Maximization (RCM), which is formulated as an optimization problem to combine the maximum consensus and large margin principles. |
36 | Large margin distribution machine | Teng Zhang, Zhi-Hua Zhou | In this paper, we propose the Large margin Distribution Machine (LDM), which tries to achieve a better generalization performance by optimizing the margin distribution. |
37 | Distance metric learning using dropout: a structured regularization approach | Qi Qian, Juhua Hu, Rong Jin, Jian Pei, Shenghuo Zhu | In this paper, we exploit the dropout technique, which has been successfully applied in deep learning to alleviate the over-fitting problem, for DML. |
38 | Box drawings for learning with imbalanced data | Siong Thye Goh, Cynthia Rudin | We propose two machine learning algorithms to handle highly imbalanced classification problems. |
39 | Incremental and decremental training for linear classification | Cheng-Hao Tsai, Chieh-Yen Lin, Chih-Jen Lin | In this paper, we focus on linear classifiers including logistic regression and linear SVM because of their simplicity over kernel or other methods. |
40 | Supervised deep learning with auxiliary networks | Junbo Zhang, Guangjian Tian, Yadong Mu, Wei Fan | The major contribution of our work is the exposition of a novel supervised deep learning algorithm, which distinguishes from two unique traits. |
41 | Sleep analytics and online selective anomaly detection | Tahereh Babaie, Sanjay Chawla, Romesh Abeysuriya | We introduce a new problem, the Online Selective Anomaly Detection (OSAD), to model a specific scenario emerging from research in sleep science. |
42 | GLAD: group anomaly detection in social media analysis | Rose Yu, Xinran He, Yan Liu | In this paper, we take a generative approach by proposing a hierarchical Bayes model: Group Latent Anomaly Detection (GLAD) model. |
43 | FBLG: a simple and effective approach for temporal dependence discovery from time series data | Dehua Cheng, Mohammad Taha Bahadori, Yan Liu | We observe that when we look in reversed order of time, the temporal dependence structure of the time series is usually preserved after switching the roles of cause and effect. |
44 | Learning time-series shapelets | Josif Grabocka, Nicolas Schilling, Martin Wistuba, Lars Schmidt-Thieme | In contrast to the state-of-the-art, this paper proposes a novel perspective in terms of learning shapelets. |
45 | Utilizing temporal patterns for estimating uncertainty in interpretable early decision making | Mohamed F. Ghalwash, Vladan Radosavljevic, Zoran Obradovic | In this study, we propose a simple and yet effective method to provide uncertainty estimates for an interpretable early classification method. |
46 | Prototype-based learning on concept-drifting data streams | Junming Shao, Zahra Ahmadi, Stefan Kramer | In this paper, we propose a prototype-based classification model for evolving data streams, called SyncStream, which dynamically models time-changing concepts and makes predictions in a local fashion. |
47 | Detecting moving object outliers in massive-scale trajectory streams | Yanwei Yu, Lei Cao, Elke A. Rundensteiner, Qin Wang | Our theoretical analysis and empirical study on the Beijing Taxi and GMTI (Ground Moving Target Indicator) datasets demonstrate its effectiveness in capturing abnormal moving objects. |
48 | The setwise stream classification problem | Charu C. Aggarwal | In this paper, we present a first approach for real time and streaming classification of such data. |
49 | Streamed approximate counting of distinct elements: beating optimal batch methods | Daniel Ting | This paper advances the state of the art in probabilistic methods for estimating the number of distinct elements in a streaming setting New streaming algorithms are given that provably beat the "optimal" errors for Min-count and HyperLogLog while using the same sketch. |
50 | Time-varying learning and content analytics via sparse factor analysis | Andrew S. Lan, Christoph Studer, Richard G. Baraniuk | We propose SPARFA-Trace, a new machine learning-based framework for time-varying learning and content analytics for educational applications. |
51 | Active-transductive learning with label-adapted kernels | Dan Kushnir | This paper presents an efficient active-transductive approach for classification. |
52 | Active learning for sparse bayesian multilabel classification | Deepak Vasisht, Andreas Damianou, Manik Varma, Ashish Kapoor | We propose a novel inference algorithm for the sparse Bayesian multilabel model of [17]. |
53 | Large-scale adaptive semi-supervised learning via unified inductive and transductive model | De Wang, Feiping Nie, Heng Huang | To address these two challenges, in this paper, we propose an adaptive semi-supervised learning model. |
54 | Active semi-supervised learning using sampling theory for graph signals | Akshay Gadde, Aamir Anis, Antonio Ortega | We propose a novel framework for this problem based on our recent results on sampling theory for graph signals. |
55 | Active collaborative permutation learning | Jialei Wang, Nathan Srebro, James Evans | We consider the problem of Collaborative Permutation Recovery, i.e. recovering multiple permutations over objects (e.g. preference rankings over different options) from limited pairwise comparisons. |
56 | Effective global approaches for mutual information based feature selection | Xuan Vinh Nguyen, Jeffrey Chan, Simone Romano, James Bailey | In this paper, we take a systematic approach to the problem of global MI-based feature selection. |
57 | Gradient boosted feature selection | Zhixiang Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng | In this work we propose a novel feature selection algorithm, Gradient Boosted Feature Selection (GBFS), which satisfies all four of these requirements. |
58 | Simultaneous feature and feature group selection through hard thresholding | Shuo Xiang, Tao Yang, Jieping Ye | In this paper, we fulfill this gap by introducing an efficient sparse group hard thresholding algorithm. |
59 | Safe and efficient screening for sparse support vector machine | Zheng Zhao, Jun Liu, James Cox | In this paper, a novel screening technique is proposed to accelerate model selection for l1-regularized l2-SVM and effectively improve its scalability. |
60 | Factorized sparse learning models with interpretable high order feature interactions | Sanjay Purushotham, Martin Renqiang Min, C.-C. Jay Kuo, Rachel Ostroff | In this paper, we propose a factorization based sparse learning framework termed FHIM for identifying high-order feature interactions in linear and logistic regression models, and study several optimization methods for solving them. |
61 | Parallel gibbs sampling for hierarchical dirichlet processes via gamma processes equivalence | Dehua Cheng, Yan Liu | In this paper, we propose an effective parallel Gibbs sampling algorithm for HDP by exploring its connections with the gamma-gamma-Poisson process. |
62 | Empirical glitch explanations | Tamraparni Dasu, Ji Meng Loh, Divesh Srivastava | In this paper, we introduce the notion of Empirical Glitch Explanations – concise, multi-dimensional descriptions of subsets of potentially dirty data – and propose a scalable method for empirically generating such explanatory characterizations. |
63 | Learning with dual heterogeneity: a nonparametric bayes model | Hongxia Yang, Jingrui He | Based on this model, we propose the NOBLE algorithm using an efficient Gibbs sampler. |
64 | Online chinese restaurant process | Chien-Liang Liu, Tsung-Hsun Tsai, Chia-Hoang Lee | This work proposes an online Chinese restaurant process (CRP) algorithm, which is an online and nonparametric algorithm, to tackle this problem. |
65 | Knowledge vault: a web-scale approach to probabilistic knowledge fusion | Xin Dong, Evgeniy Gabrilovich, Geremy Heitz, Wilko Horn, Ni Lao, Kevin Murphy, Thomas Strohmann, Shaohua Sun, Wei Zhang | Here we introduce Knowledge Vault, a Web-scale probabilistic knowledge base that combines extractions from Web content (obtained via analysis of text, tabular data, page structure, and human annotations) with prior knowledge derived from existing knowledge repositories. |
66 | Improving the modified nyström method using spectral shifting | Shusen Wang, Chao Zhang, Hui Qian, Zhihua Zhang | In this paper, we propose a variant of the Nystrom method called the modified Nystrom by spectral shifting (SS-Nystrom). |
67 | Fast flux discriminant for large-scale sparse nonlinear classification | Wenlin Chen, Yixin Chen, Kilian Q. Weinberger | In this paper, we propose a novel supervised learning method, Fast Flux Discriminant (FFD), for large-scale nonlinear classification. |
68 | Scalable histograms on large probabilistic data | Mingwang Tang, Feifei Li | We introduced novel synopses to reduce communication cost when running our methods in such settings. |
69 | Correlation clustering in MapReduce | Flavio Chierichetti, Nilesh Dalvi, Ravi Kumar | In this paper we obtain a new algorithm for correlation clustering. |
70 | Scaling out big data missing value imputations: pythia vs. godzilla | Christos Anagnostopoulos, Peter Triantafillou | In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. |
71 | Efficient mini-batch training for stochastic optimization | Mu Li, Tong Zhang, Yuqiang Chen, Alexander J. Smola | This paper introduces a technique based on approximate optimization of a conservatively regularized objective function within each minibatch. |
72 | Streaming submodular maximization: massive data summarization on the fly | Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause | In this paper, we address the problem of extracting representative elements from a large stream of data. |
73 | Distance queries from sampled data: accurate and efficient | Edith Cohen | We derive novel estimators for estimating $L_p$ distance from sampled data. |
74 | Improved testing of low rank matrices | Yi Li, Zhengyu Wang, David P. Woodruff | We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. |
75 | DeepWalk: online learning of social representations | Bryan Perozzi, Rami Al-Rfou, Steven Skiena | We present DeepWalk, a novel approach for learning latent representations of vertices in a network. |
76 | Open-domain quantity queries on web tables: annotation, response, and consensus models | Sunita Sarawagi, Soumen Chakrabarti | Our goal is to respond to such queries with a ranked list of quantity distributions, suitably represented. |
77 | Crowdsourced time-sync video tagging using temporal and personalized topic modeling | Bin Wu, Erheng Zhong, Ben Tan, Andrew Horner, Qiang Yang | In this paper, we propose a new application which extracts time-sync video tags by automatically exploiting crowdsourced comments from video websites such as Nico Nico Douga, where videos are commented on by online crowd users in a time-sync manner. |
78 | Identifying and labeling search tasks via query-based hawkes processes | Liangda Li, Hongbo Deng, Anlei Dong, Yi Chang, Hongyuan Zha | In this paper, we propose a probabilistic method for identifying and labeling search tasks based on the following intuitive observations: queries that are issued temporally close by users in many sequences of queries are likely to belong to the same search task, meanwhile, different users having the same information needs tend to submit topically coherent search queries. |
79 | LaSEWeb: automating search strategies over semi-structured web data | Oleksandr Polozov, Sumit Gulwani | We describe the design and implementation of a domain-specific language that enables extracting data from a webpage based on its structure, visual layout, and linguistic patterns. |
80 | Personalized search result diversification via structured learning | Shangsong Liang, Zhaochun Ren, Maarten de Rijke | To further enhance the performance, we propose a supervised learning strategy. |
81 | Efficient multi-task feature learning with calibration | Pinghua Gong, Jiayu Zhou, Wei Fan, Jieping Ye | In this paper, we propose a variant of the calibrated multi-task feature learning formulation by including a squared norm regularizer. |
82 | Multi-task copula by sparse graph regression | Tianyi Zhou, Dacheng Tao | This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL). |
83 | Unifying learning to rank and domain adaptation: enabling cross-task document scoring | Mianwei Zhou, Kevin C. Chang | We propose the Tree-structured Boltzmann Machine (T-RBM), a novel two-stage Markov Network, as our solution. |
84 | Scalable heterogeneous translated hashing | Ying Wei, Yangqiu Song, Yi Zhen, Bo Liu, Qiang Yang | In this paper, we propose a Heterogeneous Translated Hashing (HTH) method with such auxiliary bridge incorporated not only to improve current multi-view search but also to enable similarity search across heterogeneous media which have no direct correspondence. |
85 | Matching users and items across domains to improve the recommendation quality | Chung-Yi Li, Shou-De Lin | Given two homogeneous rating matrices with some overlapped users/items whose mappings are unknown, this paper aims at answering two questions. |
86 | Optimal recommendations under attraction, aversion, and social influence | Wei Lu, Stratis Ioannidis, Smriti Bhagat, Laks V.S. Lakshmanan | In this work, we model interest evolution through dynamic interest cascades: we consider a scenario where a user’s interests may be affected by (a) the interests of other users in her social circle, as well as (b) suggestions she receives from a recommender system. |
87 | ClusCite: effective citation recommendation by information network-based clustering | Xiang Ren, Jialu Liu, Xiao Yu, Urvashi Khandelwal, Quanquan Gu, Lidan Wang, Jiawei Han | In this study, we investigate the problem in the context of heterogeneous bibliographic networks and propose a novel cluster-based citation recommendation framework, called ClusCite, which explores the principle that citations tend to be softly clustered into interest groups based on multiple types of relationships in the network. |
88 | GeoMF: joint geographical modeling and matrix factorization for point-of-interest recommendation | Defu Lian, Cong Zhao, Xing Xie, Guangzhong Sun, Enhong Chen, Yong Rui | Besides, researchers have recently discovered a spatial clustering phenomenon in human mobility behavior on the LBSNs, i.e., individual visiting locations tend to cluster together, and also demonstrated its effectiveness in POI recommendation, thus we incorporate it into the factorization model. |
89 | Detecting anomalies in dynamic rating data: a robust probabilistic model for rating evolution | Stephan Günnemann, Nikou Günnemann, Christos Faloutsos | In this work, we tackle the following question: Given the time stamped rating data for a product or service, how can we detect the general rating behavior of users as well as time intervals where the ratings behave anomalous? |
90 | Product selection problem: improve market share by learning consumer behavior | Silei Xu, John Chi-Shing Lui | To tackle this problem, we propose an efficient greedy-based approximation algorithm with a provable solution guarantee. |
91 | TCS: efficient topic discovery over crowd-oriented service data | Yongxin Tong, Caleb Chen Cao, Lei Chen | In particular, in order to train TCS efficiently, we design a novel parameter inference algorithm, the Bucket Parameter Estimation (BPE), which utilizes belief propagation and a new sketching technique, called Pairwise Sketch (pSketch). |
92 | SigniTrend: scalable detection of emerging topics in textual streams by hashed significance thresholds | Erich Schubert, Michael Weiler, Hans-Peter Kriegel | Our contributions to the detection of emerging trends are three-fold first of all, we propose a significance measure that can be used to detect emerging topics early, long before they become "hot tags", by drawing upon experience from outlier detection. |
93 | Experiments with non-parametric topic models | Wray L. Buntine, Swapnil Mishra | We look at the comparative behaviour of different models and present some experimental insights. |
94 | Reducing the sampling complexity of topic models | Aaron Q. Li, Amr Ahmed, Sujith Ravi, Alexander J. Smola | In this paper we propose an algorithm which scales linearly with the number of actually instantiated topics kd in the document. |
95 | Dynamics of news events and social media reaction | Mikalai Tsytsarau, Themis Palpanas, Malu Castellanos | In this paper, we study the dynamics of news events and their relation to changes of sentiment expressed on relevant topics. |
96 | Differentially private network data release via structural inference | Qian Xiao, Rui Chen, Kian-Lee Tan | In this paper, we present a novel data sanitization solution that infers a network’s structure in a differentially private manner. |
97 | Exponential random graph estimation under differential privacy | Wentian Lu, Gerome Miklau | In this work we propose algorithms for privately estimating the parameters of exponential random graph models (ERGMs). |
98 | Top-k frequent itemsets via differentially private FP-trees | Jaewoo Lee, Christopher W. Clifton | We give an approach that first identifies top-k frequent itemsets, then uses them to construct a compact, differentially private FP-tree. |
99 | CatchSync: catching synchronized behavior in large directed graphs | Meng Jiang, Peng Cui, Alex Beutel, Christos Faloutsos, Shiqiang Yang | We propose a fast and effective method, CatchSync, which exploits two of the tell-tale signs left in graphs by fraudsters: (a) synchronized behavior: suspicious nodes have extremely similar behavior pattern, because they are often required to perform some task together (such as follow the same user); and (b) rare behavior: their connectivity patterns are very different from the majority. |
100 | Mobile app recommendations with security and privacy awareness | Hengshu Zhu, Hui Xiong, Yong Ge, Enhong Chen | To fill this crucial void, in this paper, we propose to develop a mobile App recommender system with privacy and security awareness. |
101 | Fast DTT: a near linear algorithm for decomposing a tensor into factor tensors | Xiaomin Fang, Rong Pan | To overcome these problems, we propose a near linear tensor factorization approach, which decompose a tensor into factor tensors in order to model the higher-order relations, without loss of important information. |
102 | Clustering and projected clustering with adaptive neighbors | Feiping Nie, Xiaoqian Wang, Heng Huang | In this paper, we propose a novel clustering model to learn the data similarity matrix and clustering structure simultaneously. |
103 | LWI-SVD: low-rank, windowed, incremental singular value decompositions on time-evolving data sets | Xilun Chen, K. Selcuk Candan | To address these challenges, in this paper, we propose a Low-rank, Windowed, Incremental SVD (LWI-SVD) algorithm, which (a) leverages efficient and accurate low-rank approximations to speed up incremental SVD updates and (b) uses a window-based approach to aggregate multiple incoming updates (insertions or deletions of rows and columns) and, thus, reduces on- line processing costs. |
104 | Provable deterministic leverage score sampling | Dimitris Papailiopoulos, Anastasios Kyrillidis, Christos Boutsidis | In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. |
105 | Semantic visualization for spherical representation | Tuan M.V. Le, Hady W. Lauw | In this paper, we address the semantic visualization problem. |
106 | Grouping students in educational settings | Rakesh Agrawal, Behzad Golshan, Evimaria Terzi | We propose a framework for rigorously studying this question, taking a computational perspective. |
107 | Inferring gas consumption and pollution emission of vehicles throughout a city | Jingbo Shang, Yu Zheng, Wenzhu Tong, Eric Chang, Yong Yu | As many road segments are not traversed by trajectories (i.e., data sparsity), we propose a Travel Speed Estimation (TSE) model based on a context-aware matrix factorization approach. |
108 | Methods for ordinal peer grading | Karthik Raman, Thorsten Joachims | Thus, in this paper we study the problem of automatically inferring student grades from ordinal peer feedback, as opposed to existing methods that require cardinal peer feedback. |
109 | Exploiting geographic dependencies for real estate appraisal: a mutual perspective of ranking and clustering | Yanjie Fu, Hui Xiong, Yong Ge, Zijun Yao, Yu Zheng, Zhi-Hua Zhou | To this end, in this paper, we propose a geographic method, named ClusRanking, for estate appraisal by leveraging the mutual enforcement of ranking and clustering power. |
110 | Towards scalable critical alert mining | Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, Hasan Cam, Jiawei Han, Xifeng Yan | This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. |
111 | From labor to trader: opinion elicitation via online crowds as a market | Caleb Chen Cao, Lei Chen, Hosagrahar Visvesvaraya Jagadish | In this paper, we study how to use crowds for Opinion Elicitation. |
112 | Optimal real-time bidding for display advertising | Weinan Zhang, Shuai Yuan, Jun Wang | In this paper we study bid optimisation for real-time bidding (RTB) based display advertising. |
113 | Quantifying herding effects in crowd wisdom | Ting Wang, Dashun Wang, Fei Wang | In this paper, we develop a mechanistic framework to model social influence of prior collective opinions (e.g., online product ratings) on subsequent individual decision making. |
114 | Modeling delayed feedback in display advertising | Olivier Chapelle | We tackle this issue by introducing an additional model that captures the conversion delay. |
115 | Networked bandits with disjoint linear payoffs | Meng Fang, Dacheng Tao | In this paper, we study `networked bandits’, a new bandit problem where a set of interrelated arms varies over time and, given the contextual information that selects one arm, invokes other correlated arms. |
116 | Mining topics in documents: standing on the shoulders of big data | Zhiyuan Chen, Bing Liu | In recently years, knowledge-based topic models have been proposed, which ask human users to provide some prior domain knowledge to guide the model to produce better topics. |
117 | Integrating spreadsheet data via accurate and low-effort extraction | Zhe Chen, Michael Cafarella | We propose a two-phase semiautomatic system that extracts accurate relational metadata while minimizing user effort. |
118 | Sentiment expression conditioned by affective transitions and social forces | Moritz Sudhof, Andrés Goméz Emilsson, Andrew L. Maas, Christopher Potts | We develop a theory of conditional dependencies between emotional states in which emotions are characterized not only by valence (polarity) and arousal (intensity) but also by the role they play in state transitions and social relationships. |
119 | Entity profiling with varying source reliabilities | Furong Li, Mong Li Lee, Wynne Hsu | In this paper, we present a framework called Comet that interleaves record linkage with error correction, taking into consideration the source reliabilities on various attributes. |
120 | Open question answering over curated and extracted knowledge bases | Anthony Fader, Luke Zettlemoyer, Oren Etzioni | In this paper, we present OQA, the first approach to leverage both curated and extracted KBs. |
121 | Non-parametric scan statistics for event detection and forecasting in heterogeneous social media graphs | Feng Chen, Daniel B. Neill | As a case study, we consider two applications using Twitter data, civil unrest event detection and rare disease outbreak detection, and present empirical evaluations illustrating the effectiveness and efficiency of our proposed approach. |
122 | Event detection in activity networks | Polina Rozenshtein, Aris Anagnostopoulos, Aristides Gionis, Nikolaj Tatti | We consider the problem of mining activity networks to identify interesting events, such as a big concert or a demonstration in a city, or a trending keyword in a user community in a social network. |
123 | FEMA: flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery | Meng Jiang, Peng Cui, Fei Wang, Xinran Xu, Wenwu Zhu, Shiqiang Yang | In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. |
124 | Profit-maximizing cluster hires | Behzad Golshan, Theodoros Lappas, Evimaria Terzi | Our work presents a detailed analysis of the computational complexity and hardness of approximation of the problem, as well as heuristic, yet effective, algorithms for solving it in practice. |
125 | On social event organization | Keqian Li, Wei Lu, Smriti Bhagat, Laks V.S. Lakshmanan, Cong Yu | In this paper, we study the key computational problem involved in organization of social events, to our best knowledge, for the first time. |
126 | A bayesian framework for estimating properties of network diffusions | Varun R. Embar, Rama Kumar Pasumarthi, Indrajit Bhattacharya | In this paper, we propose and study this novel problem in a Bayesian framework by capturing the posterior distribution of these hidden variables given the observed cascades, and computing the expectation of these properties under this posterior distribution. |
127 | Scalable diffusion-aware optimization of network topology | Elias Boutros Khalil, Bistra Dilkina, Le Song | In this paper, we focus on the widely studied linear threshold diffusion model, and prove, for the first time, that the network modification problems under this model have supermodular objective functions. |
128 | Probabilistic latent network visualization: inferring and embedding diffusion networks | Takeshi Kurashima, Tomoharu Iwata, Noriko Takaya, Hiroshi Sawada | This paper proposes a probabilistic model for inferring the diffusion network, which we call Probabilistic Latent Network Visualization (PLNV); it is based on cascade data, a record of observed times of node influence. |
129 | MMRate: inferring multi-aspect diffusion networks with multi-pattern cascades | Senzhang Wang, Xia Hu, Philip S. Yu, Zhoujun Li | In this paper, we investigate a novel problem of inferring multi-aspect diffusion networks with multi-pattern cascades. |
130 | Stability of influence maximization | Xinran He, David Kempe | In an attempt to fix the record, the present article combines the problem motivation, models, and experimental results sections from the original incorrect article with the new hardness result. |
131 | Who to follow and why: link prediction with explanations | Nicola Barbieri, Francesco Bonchi, Giuseppe Manco | In this paper we study link prediction with explanations for user recommendation in social networks. |
132 | Activity-edge centric multi-label classification for mining heterogeneous information networks | Yang Zhou, Ling Liu | In this paper, we present an activity-edge centric multi-label classification framework for analyzing heterogeneous information networks with three unique features. |
133 | Meta-path based multi-network collective link prediction | Jiawei Zhang, Philip S. Yu, Zhi-Hua Zhou | In this paper, we want to predict the formation of social links in multiple partially aligned social networks at the same time, which is formally defined as the multi-network link (formation) prediction problem. |
134 | Fast influence-based coarsening for large networks | Manish Purohit, B. Aditya Prakash, Chanhyun Kang, Yao Zhang, V.S. Subrahmanian | Using extensive experiments on multiple real datasets, we demonstrate the quality and scalability of COARSENET, enabling us to reduce the graph by 90% in some cases without much loss of information. |
135 | Minimizing seed set selection with probabilistic coverage guarantee in a social network | Peng Zhang, Wei Chen, Xiaoming Sun, Yajun Wang, Jialin Zhang | In this paper, we consider the task of selecting initial seed users of a topic with minimum size so that {\em with a guaranteed probability} the number of users discussing the topic would reach a given threshold. |
136 | Core decomposition of uncertain graphs | Francesco Bonchi, Francesco Gullo, Andreas Kaltenbrunner, Yana Volkovich | In this paper we provide an analogous tool for uncertain graphs, i.e., graphs whose edges are assigned a probability of existence. |
137 | Learning multifractal structure in large networks | Austin R. Benson, Carlos Riquelme, Sven Schmit | In this paper, we analyze and improve the multifractal network generators (MFNG) introduced by Palla et al. |
138 | Temporal skeletonization on sequential data: patterns, categorization, and visualization | Chuanren Liu, Kai Zhang, Hui Xiong, Geoff Jiang, Qiang Yang | To this end, in this paper, we propose a ‘temporal skeletonization’ approach to proactively reduce the representation of sequences to uncover significant, hidden temporal structures. |
139 | Focused clustering and outlier detection in large attributed graphs | Bryan Perozzi, Leman Akoglu, Patricia Iglesias Sánchez, Emmanuel Müller | In this work, we overcome this limitation and introduce a novel user-oriented approach for mining attributed graphs. |
140 | Inside the atoms: ranking on a network of networks | Jingchao Ni, Hanghang Tong, Wei Fan, Xiang Zhang | In this paper, we propose a new network data model, a Network of Networks (NoN), where each node of the main network itself can be further represented as another (domain-specific) network. |
141 | Community membership identification from small seed sets | Isabel M. Kloumann, Jon M. Kleinberg | We evaluate our methods across multiple domains, using publicly available datasets with labeled, ground-truth communities. |
142 | Community detection in graphs through correlation | Lian Duan, Willian Nick Street, Yanchi Liu, Haibing Lu | This paper connects modularity-based methods with correlation analysis by subtly reformatting their math formulas and investigates how to fully make use of correlation analysis to change the objective function of modularity-based methods, which provides a more natural and effective way to solve the resolution limit problem. |
143 | Heat kernel based community detection | Kyle Kloster, David F. Gleich | We present the first deterministic, local algorithm to compute this diffusion and use that algorithm to study the communities that it produces. |
144 | On the permanence of vertices in network communities | Tanmoy Chakraborty, Sriram Srinivasan, Niloy Ganguly, Animesh Mukherjee, Sanjukta Bhowmick | In this paper, we demonstrate that compared to other metrics, permanence provides (i) a more accurate estimate of a derived community structure to the ground-truth community and (ii) is more sensitive to perturbations in the network. |
145 | The interplay between dynamics and networks: centrality, communities, and cheeger inequality | Rumi Ghosh, Shang-hua Teng, Kristina Lerman, Xiaoran Yan | As the first step towards this objective, we introduce an umbrella framework for defining and characterizing an ensemble of dynamic processes on a network. |
146 | Almost linear-time algorithms for adaptive betweenness centrality using hypergraph sketches | Yuichi Yoshida | In this paper, we present a method that directly solves the task, with an almost linear runtime no matter how large the value of k. |
147 | FAST-PPR: scaling personalized pagerank estimation for large graphs | Peter A. Lofgren, Siddhartha Banerjee, Ashish Goel, C. Seshadhri | We propose a new algorithm, FAST-PPR, for computing personalized PageRank: given start node s and target node t in a directed graph, and given a threshold δ, it computes the Personalized PageRank π_s(t) from s to t, guaranteeing that the relative error is small as long πs(t) > δ. |
148 | Graph sample and hold: a framework for big-graph analytics | Nesreen K. Ahmed, Nick Duffield, Jennifer Neville, Ramana Kompella | In this paper, we pro- pose a generic stream sampling framework for big-graph analytics, called Graph Sample and Hold (gSH), which samples from massive graphs sequentially in a single pass, one edge at a time, while maintaining a small state in memory. |
149 | Balanced graph edge partition | Florian Bourse, Marc Lelarge, Milan Vojnovic | We report results of an extensive empirical evaluation on a set of real-world graphs, which quantifies the benefits of edge- vs. vertex-partition, and demonstrates efficiency of natural greedy online assignments for the balanced edge-partition problem with and with no aggregation. |
150 | Using strong triadic closure to characterize ties in social networks | Stavros Sintos, Panayiotis Tsaparas | In this paper, we use the principle of Strong Triadic Closure to characterize the strength of relationships in social networks. |
151 | Analyzing expert behaviors in collaborative networks | Huan Sun, Mudhakar Srivatsa, Shulong Tan, Yang Li, Lance M. Kaplan, Shu Tao, Xifeng Yan | In this work, we attempt to deduce the cognitive process of task routing, and model the decision making of experts as a generative process where a routing decision is made based on mixed routing patterns. |
152 | Predicting long-term impact of CQA posts: a comprehensive viewpoint | Yuan Yao, Hanghang Tong, Feng Xu, Jian Lu | In this paper, we aim to predict the long-term impact of questions/answers shortly after they are posted in the CQA sites. |
153 | Who are experts specializing in landscape photography?: analyzing topic-specific authority on content sharing services | Bin Bi, Ben Kao, Chang Wan, Junghoo Cho | In this paper, we propose a novel model of Topic-specific Authority Analysis (TAA), which addresses the limitations of the previous approaches, to identify authorities specific to given query topic(s) on a content sharing service. |
154 | Frontiers in E-commerce personalization | Sri Subramaniam | This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers. |
155 | Predictive modeling in practice: a case study from sprint | Tracy De Poalo, Jeremy Howard | In this talk, Sprint’s Head of Predictive Modeling, Tracey De Poalo, will talk about the process she developed using SAS and logistic regression to build a wide range of models. |
156 | Medicine in the age of electronic health records | Nigam Shah | We will present approaches to identify novel off-label uses of drugs using the patient feature matrix along with prior knowledge about drugs, diseases, and known usage. |
157 | Algorithms for interpretable machine learning | Cynthia Rudin | I will describe several approaches, including an algorithm based on discrete optimization, and an algorithm based on Bayesian analysis. |
158 | Data science through the lens of social science | Drew Conway | In this talk, Drew will examine data science through the lens of the social scientist. |
159 | Information environment security | Rand Waltzman | The purpose of this talk is to help frame a new science of Information Environment Security (IES) whose goal is to create and apply the tools needed to discover and maintain fundamental models of our ever-changing information environment and to defend us in that environment, both as individuals and collectively, against intentional as well as unintentional attempts to deceive, misinform and otherwise manipulate us. |
160 | Big data for social good | Nathan Eagle | After providing an overview of the mobile and social media landscapes in emerging markets, we discuss a system that implements polls & mobile subscription compensation. |
161 | Bringing data science to the speakers of every language | Robert Munro | I will present examples of how natural language processing and distributed human computing are improving the lives of speakers of all the world’s languages, in areas including education, disaster-response, health and access to employment. |
162 | Guilt by association: large scale malware detection by mining file-relation graphs | Acar Tamersoy, Kevin Roundy, Duen Horng Chau | We present AESOP, a scalable algorithm that identifies malicious executable files by applying Aesop’s moral that "a man is known by the company he keeps." |
163 | Mining text snippets for images on the web | Anitha Kannan, Simon Baker, Krishnan Ramnath, Juliet Fiss, Dahua Lin, Lucy Vanderwende, Rizwan Ansary, Ashish Kapoor, Qifa Ke, Matt Uyttendaele, Xin-Jing Wang, Lei Zhang | We propose an algorithm to mine multiple diverse, relevant, and interesting text snippets for images on the web. |
164 | Predicting student risks through longitudinal analysis | Ashay Tamhane, Shajith Ikbal, Bikram Sengupta, Mayuri Duggirala, James Appleton | In this paper, we report on a large-scale study to identify students at risk of not meeting acceptable levels of performance in one state-level and one national standardized assessment in Grade 8 of a major US school district. |
165 | Novel geospatial interpolation analytics for general meteorological measurements | Bingsheng Wang, Jinjun Xiong | We propose a Bayesian compressed sensing based non-parametric statistical model to efficiently perform the spatial interpolation task. |
166 | Targeting direct cash transfers to the extremely poor | Brian Abelson, Kush R. Varshney, Joy Sun | In this work, we streamline an important step in the operations of the NGO by developing and deploying a data-driven system for locating villages with extreme poverty in Kenya and Uganda. |
167 | Scalable hands-free transfer learning for online advertising | Brian Dalessandro, Daizhuo Chen, Troy Raeder, Claudia Perlich, Melinda Han Williams, Foster Provost | This paper presents a combination of strategies, deployed by the online advertising firm Dstillery, for learning many models from extremely high-dimensional data efficiently and without human intervention. |
168 | Correlating events with time series for incident diagnosis | Chen Luo, Jian-Guang Lou, Qingwei Lin, Qiang Fu, Rui Ding, Dongmei Zhang, Zhe Wang | In this paper, we propose an approach to evaluate the correlation between time series data and event data. |
169 | Proactive workflow modeling by stochastic processes with application to healthcare operation and management | Chuanren Liu, Yong Ge, Hui Xiong, Keli Xiao, Wei Geng, Matt Perkins | To that end, in this paper, we provide a focused study of workflow modeling by the integrated analysis of indoor location traces in the hospital environment. |
170 | Activity ranking in LinkedIn feed | Deepak Agarwal, Bee-Chung Chen, Rupesh Gupta, Joshua Hartman, Qi He, Anand Iyer, Sumanth Kolar, Yiming Ma, Pannagadatta Shivaswamy, Ajit Singh, Liang Zhang | In this paper, we report our experience with the problem of ranking activities in the LinkedIn homepage feed. |
171 | Budget pacing for targeted online advertisements at LinkedIn | Deepak Agarwal, Souvik Ghosh, Kai Wei, Siyu You | We describe a method for improving such ad serving systems by including a budget pacing component that serves ads by being aware of global supply patterns. |
172 | Large scale predictive modeling for micro-simulation of 3G air interface load | Dejan Radosavljevik, Peter van der Putten | This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large European telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes. |
173 | Unveiling clusters of events for alert and incident management in large-scale enterprise it | Derek Lin, Rashmi Raghu, Vivek Ramamurthy, Jin Yu, Regunathan Radhakrishnan, Joseph Fernandez | We propose a framework to cluster alerts and incident tickets based on the text in them, using unsupervised machine learning. |
174 | Style in the long tail: discovering unique interests with latent variable models in large scale social E-commerce | Diane J. Hu, Rob Hall, Josh Attenberg | In this paper, we describe our methods and experiments for deploying two new style-based recommender systems on the Etsy site. |
175 | Corporate residence fraud detection | Enric Junqué de Fortuny, Marija Stankova, Julie Moeyersoms, Bart Minnaert, Foster Provost, David Martens | This is the first data mining application specifically aimed at finding corporate residence fraud, where we show the predictive value of using both structured and fine-grained invoicing data. |
176 | Modeling mass protest adoption in social network communities using geometric brownian motion | Fang Jin, Rupinder Paul Khandpur, Nathan Self, Edward Dougherty, Sheng Guo, Feng Chen, B. Aditya Prakash, Naren Ramakrishnan | We propose a bispace model to capture propagation in the union of (exclusively) Twitter and non-Twitter environments. |
177 | Shallow semantic parsing of product offering titles | Gabor Melli | We present a case study of a deployed data-driven system that first chunks individual titles into semantically classified sub-segments, and then uses this information to improve a hyperlink insertion service. |
178 | A case study: privacy preserving release of spatio-temporal density in paris | Gergely Acs, Claude Castelluccia | In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week. |
179 | Scalable near real-time failure localization of data center networks | Herodotos Herodotou, Bolin Ding, Shobana Balakrishnan, Geoff Outhred, Percy Fitter | Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. |
180 | Improving management of aquatic invasions by integrating shipping network, ecological, and environmental data: data mining for social good | Jian Xu, Thanuka L. Wickramarathne, Nitesh V. Chawla, Erin K. Grey, Karsten Steinhaeuser, Reuben P. Keller, John M. Drake, David M. Lodge | We present here an approach for addressing the problem at hand via creative use of computational techniques and multiple data sources, thus illustrating how data mining can be used for solving crucial, yet very complex problems towards social good. |
181 | FoodSIS: a text mining system to improve the state of food safety in singapore | Kiran Kate, Sneha Chaudhari, Andy Prapanca, Jayant Kalagnanam | In this paper, we present FoodSIS, a system for end-to-end web information gathering for food safety. |
182 | A hazard based approach to user return time prediction | Komal Kapoor, Mingxuan Sun, Jaideep Srivastava, Tao Ye | In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. |
183 | Predicting employee expertise for talent management in the enterprise | Kush R. Varshney, Vijil Chenthamarakshan, Scott W. Fancher, Jun Wang, Dongping Fang, Aleksandra Mojsilović | In this work, we deploy an analytics-driven solution that infers the expertise of employees through the mining of enterprise and social data that is not specifically generated and collected for expertise inference. |
184 | Applying data mining techniques to address critical process optimization needs in advanced manufacturing | Li Zheng, Chunqiu Zeng, Lei Li, Yexi Jiang, Wei Xue, Jingxuan Li, Chao Shen, Wubai Zhou, Hongtai Li, Liang Tang, Tao Li, Bing Duan, Ming Lei, Pengnian Wang | In this paper, we design, implement and deploy an integrated solution, named PDP-Miner, which is a data analytics platform customized for process optimization in Plasma Display Panel (PDP) manufacturing. |
185 | EARS (earthquake alert and report system): a real time decision support system for earthquake crisis management | Marco Avvenuti, Stefano Cresci, Andrea Marchetti, Carlo Meletti, Maurizio Tesconi | In this work we describe the design, implementation and deployment of a decision support system for the detection and the damage assessment of earthquakes in Italy. |
186 | Knock it off: profiling the online storefronts of counterfeit merchandise | Matthew F. Der, Lawrence K. Saul, Stefan Savage, Geoffrey M. Voelker | Our approach in this paper is to extract features that reveal when Web pages linked to the same affiliate program share a similar underlying structure. |
187 | Up next: retrieval methods for large scale related video suggestion | Michael Bendersky, Lluis Garcia-Pueyo, Jeremiah Harmsen, Vanja Josifovski, Dima Lepikhin | In this paper, we focus on the task of video suggestion, commonly found in many online applications. |
188 | Identifying tourists from public transport commuters | Mingqiang Xue, Huayu Wu, Wei Chen, Wee Siong Ng, Gin Howe Goh | In this joint work with Singapore’s Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. |
189 | Spatially embedded co-offence prediction using supervised learning | Mohammad A. Tayebi, Martin Ester, Uwe Glässer, Patricia L. Brantingham | Here we address this important problem by proposing a framework for co-offence prediction using supervised learning. Considering the available information about offenders, we introduce social, geographic, geo-social and similarity feature sets which are used for classifying potential negative and positive pairs of offenders. |
190 | ‘Beating the news’ with EMBERS: forecasting civil unrest using open source indicators | Naren Ramakrishnan, Patrick Butler, Sathappan Muthiah, Nathan Self, Rupinder Khandpur, Parang Saraf, Wei Wang, Jose Cadena, Anil Vullikanti, Gizem Korkmaz, Chris Kuhlman, Achla Marathe, Liang Zhao, Ting Hua, Feng Chen, Chang Tien Lu, Bert Huang, Aravind Srinivasan, Khoa Trinh, Lise Getoor, Graham Katz, Andy Doyle, Chris Ackermann, Ilya Zavorin, Jim Ford, Kristen Summers, Youssef Fayed, Jaime Arredondo, Dipak Gupta, David Mares | We describe the design, implementation, and evaluation of EMBERS, an automated, 24×7 continuous system for forecasting civil unrest across 10 countries of Latin America using open source indicators such as tweets, news sources, blogs, economic indicators, and other data sources. |
191 | LASTA: large scale topic assignment on multiple social networks | Nemanja Spasojevic, Jinyun Yan, Adithya Rao, Prantik Bhattacharyya | In this paper, we present ‘LASTA’ (Large Scale Topic Assignment), a full production system used at Klout, Inc., which mines topical interests from five social networks and assigns over 10,000 topics to hundreds of millions of users on a daily basis. |
192 | New algorithms for parking demand management and a city-scale deployment | Onno Zoeter, Christopher Dance, Stéphane Clinchant, Jean-Marc Andreoli | This paper introduces a novel demand management solution: using data from dedicated occupancy sensors an iteration scheme updates parking rates to better match demand. |
193 | Reducing gang violence through network influence based targeting of social programs | Paulo Shakarian, Joseph Salmento, William Pulleyblank, John Bertetto | In this paper, we study a variant of the social network maximum influence problem and its application to intelligently approaching individual gang members with incentives to leave a gang. |
194 | Modeling impression discounting in large-scale recommender systems | Pei Lee, Laks V.S. Lakshmanan, Mitul Tiwari, Sam Shah | In this paper, we address modeling impression discounting of recommended items, that is, how to model user’s no-action feedback on impressed recommended items. |
195 | ISIS: a networked-epidemiology based pervasive web app for infectious disease pandemic planning and response | Richard Beckman, Keith R. Bisset, Jiangzhuo Chen, Bryan Lewis, Madhav Marathe, Paula Stretz | We describe ISIS, a high-performance-computing-based application to support computational epidemiology of infectious diseases. |
196 | Seven rules of thumb for web site experimenters | Ron Kohavi, Alex Deng, Roger Longbotham, Ya Xu | Some rules of thumb have previously been stated, such as ‘speed matters,’ but we describe the assumptions in the experimental design and share additional experiments that improved our understanding of where speed matters more: certain areas of the web page are more critical. |
197 | Log-based predictive maintenance | Ruben Sipos, Dmitriy Fradkin, Fabian Moerchen, Zhuang Wang | Log-based predictive maintenance |
198 | Automated hypothesis generation based on mining scientific literature | Scott Spangler, Angela D. Wilkins, Benjamin J. Bachman, Meena Nagarajan, Tajhal Dayaram, Peter Haas, Sam Regenbogen, Curtis R. Pickering, Austin Comer, Jeffrey N. Myers, Ioana Stanoi, Linda Kato, Ana Lelescu, Jacques J. Labrie, Neha Parikh, Andreas Martin Lisewski, Lawrence Donehower, Ying Chen, Olivier Lichtarge | We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses. |
199 | A system to grade computer programming skills using machine learning | Shashank Srikant, Varun Aggarwal | In this paper, we present a system to grade computer programs automatically. |
200 | An empirical study of reserve price optimisation in real-time bidding | Shuai Yuan, Jun Wang, Bowei Chen, Peter Mason, Sam Seljan | In this paper, we report the first empirical study and live test of the reserve price optimisation problem in the context of Real-Time Bidding (RTB) display advertising from an operational environment. |
201 | Large-scale high-precision topic modeling on twitter | Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, Pankaj Gupta | We present a spectrum of topic modeling techniques that contribute to a deployed system. |
202 | Large scale visual recommendations from street fashion images | Vignesh Jagadeesh, Robinson Piramuthu, Anurag Bhardwaj, Wei Di, Neel Sundaresan | We describe a completely automated large scale visual recommendation system for fashion. |
203 | We know what you want to buy: a demographic-based system for product recommendation on microblogs | Xin Wayne Zhao, Yanwei Guo, Yulan He, Han Jiang, Yuexin Wu, Xiaoming Li | In this paper, we develop a novel product recommender system called METIS, a MErchanT Intelligence recommender System, which detects users’ purchase intents from their microblogs in near real-time and makes product recommendation based on matching the users’ demographic information extracted from their public profiles with product demographics learned from microblogs and online reviews. |
204 | Modeling professional similarity by mining professional career trajectories | Ye Xu, Zang Li, Abhishek Gupta, Ahmet Bugdayci, Anmol Bhasin | Using this professional profile dataset, this paper attempts to model profiles of individuals as a sequence of positions held by them as a time-series of nodes, each of which represents one particular position or job experience in the individual’s career trajectory. |
205 | Filling context-ad vocabulary gaps with click logs | Yukihiro Tagami, Toru Hotta, Yusuke Tanaka, Shingo Ono, Koji Tsukamoto, Akira Tajima | In this work, we propose a translation method that learns the mapping of the contextual information to the textual features of ads by using past click data. |
206 | Does social good justify risking personal privacy? | Raghu Ramakrishnan, Geoffrey I. Webb | How should we approach this trade-off? |
207 | Scaling up deep learning | Yoshua Bengio | The tutorial will introduce some of the basic algorithms, both on the supervised and unsupervised sides, as well as discuss some of the guidelines for successfully using them in practice. |
208 | Constructing and mining web-scale knowledge graphs: KDD 2014 tutorial | Antoine Bordes, Evgeniy Gabrilovich | In this tutorial, we will present the state of the art in constructing, mining, and growing knowledge graphs. |
209 | Bringing structure to text: mining phrases, entities, topics, and hierarchies | Jiawei Han, Chi Wang, Ahmed El-Kishky | In this tutorial, we provide a comprehensive survey on the state-of-the art of data-driven methods that automatically mine phrases, extract and infer latent structures from text corpus, and construct multi-granularity topical groupings and hierarchies of the underlying themes. |
210 | Computational epidemiology | Madhav V. Marathe, Anil Kumar S. Vullikanti | In this tutorial, we focus on an approach based on diffusion processes on complex networks. |
211 | Management and analytic of biomedical big data with cloud-based in-memory database and dynamic querying: a hands-on experience with real-world data | Mengling Feng, Mohammad Ghassemi, Thomas Brennan, John Ellenberger, Ishrar Hussain, Roger Mark | In this tutorial, the participants will learn the difference between in-memory DBMS and traditional DBMS through hands-on exercises using SAP’s cloud-based HANA in-memory DBMS in conjunction with the Multi-parameter Intelligent Monitoring in Intensive Care (MIMIC) dataset. |
212 | The recommender problem revisited: morning tutorial | Xavier Amatriain, Bamshad Mobasher | In this tutorial we will describe different components of modern recommender systems such as: personalized ranking, similarity, explanations, context-awareness, or search as recommendation. |
213 | Correlation clustering: from theory to practice | Francesco Bonchi, David Garcia-Soriano, Edo Liberty | The goal of this tutorial is to show how correlation clustering can be a powerful addition to the toolkit of the data mining researcher and practitioner, and to encourage discussions and further research in the area. |
214 | Deep learning | Ruslan Salakhutdinov | The goal of the tutorial is to introduce the recent developments of various deep learning methods to the KDD community. |
215 | Network mining and analysis for social applications | Feida Zhu, Huan Sun, Xifeng Yan | In this tutorial, we aim to examine some recent advances in network mining and analysis for social applications, covering a diverse collection of methodologies and applications from the perspectives of event, relationship, collaboration, and network pattern. |
216 | Sampling for big data: a tutorial | Graham Cormode, Nick Duffield | One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. |
217 | Statistically sound pattern discovery | Wilhelmiina Hämäläinen, Geoffrey I. Webb | We present the current state-of-the art solutions and explore in detail how this approach to pattern discovery can deliver efficient and effective discovery of small sets of interesting patterns. |
218 | Recommendation in social media: recent advances and new frontiers | Jiliang Tang, Jie Tang, Huan Liu | In this tutorial, we aim to provide a comprehensive overview of various recommendation tasks in social media, especially their recent advances and new frontiers. |