Paper Digest: SIGMOD 2014 Highlights
The ACM Special Interest Group on Management of Data (SIGMOD) is one of the top conferences on database management systems and data management technology.
To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.
If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.
Paper Digest Team
team@paperdigest.org
TABLE 1: SIGMOD 2014 Papers
Title | Authors | Highlight | |
---|---|---|---|
1 | Edgar F. Codd Innovations Award Talk | Martin Kersten | Edgar F. Codd Innovations Award Talk |
2 | SIGMOD Jim Gray Doctoral Dissertation Award Talk | Aditya Parameswaran | SIGMOD Jim Gray Doctoral Dissertation Award Talk |
3 | SIGMOD Jim Gray Doctoral Dissertation Award Talk | Andy Pavlo | SIGMOD Jim Gray Doctoral Dissertation Award Talk |
4 | How I learned to stop worrying and love compilers | Eric Sedlar | The modern platforms that we want to use to manage our data are far more complex to program efficiently than the machines we used in the past. |
5 | PLANET: making progress with commit processing in unpredictable environments | Gene Pang, Tim Kraska, Michael J. Franklin, Alan Fekete | We propose Predictive Latency-Aware NEtworked Transactions (PLANET), a new transaction programming model and underlying system support to address this issue. |
6 | Lazy evaluation of transactions in database systems | Jose M. Faleiro, Alexander Thomson, Daniel J. Abadi | We introduce a \textit{lazy} transaction execution engine, in which a transaction may be considered durably completed after only partial execution, while the bulk of its operations (notably all reads from the database and all execution of transaction logic) may be deferred until an arbitrary future time, such as when a user attempts to read some element of the transaction’s write-set—all without modifying the semantics of the transaction or sacrificing ACID guarantees. |
7 | Scalable atomic visibility with RAMP transactions | Peter Bailis, Alan Fekete, Joseph M. Hellerstein, Ali Ghodsi, Ion Stoica | In this work, we identify a new isolation model—Read Atomic (RA) isolation—that matches the requirements of these use cases by ensuring atomic visibility: either all or none of each transaction’s updates are observed by other transactions. |
8 | JECB: a join-extension, code-based approach to OLTP data partitioning | Khai Q. Tran, Jeffrey F. Naughton, Bruhathi Sundarmurthy, Dimitris Tsirogiannis | In this paper, we present a low-overhead data partitioning approach, termed JECB, that can reduce the number of distributed transactions in complex database workloads such as TPC-E. |
9 | HYDRA: large-scale social identity linkage via heterogeneous behavior modeling | Siyuan Liu, Shuhui Wang, Feida Zhu, Jinbo Zhang, Ramayya Krishnan | This paper proposes HYDRA, a solution framework which consists of three key steps: (I) modeling heterogeneous behavior by long-term behavior distribution analysis and multi-resolution temporal information matching; (II) constructing structural consistency graph to measure the high-order structure consistency on users’ core social structures across different platforms; and (III) learning the mapping function by multi-objective optimization composed of both the supervised learning on pair-wise ID linkage information and the cross-platform structure consistency maximization. |
10 | In search of influential event organizers in online social networks | Kaiyu Feng, Gao Cong, Sourav S. Bhowmick, Shuai Ma | Hence, we propose three algorithms to find approximate solutions to the problem. |
11 | Influence maximization: near-optimal time complexity meets practical efficiency | Youze Tang, Xiaokui Xiao, Yanchen Shi | This paper presents TIM, an algorithm that aims to bridge the theory and practice in influence maximization. |
12 | Efficient location-aware influence maximization | Guoliang Li, Shuo Chen, Jianhua Feng, Kian-lee Tan, Wen-syan Li | In this paper we study the location-aware influence maximization problem. |
13 | Density-based place clustering in geo-social networks | Jieming Shi, Nikos Mamoulis, Dingming Wu, David W. Cheung | In this paper, we show how the density-based clustering paradigm can be extended to apply on places which are visited by users of a geo-social network. |
14 | Hypersphere dominance: an optimal approach | Cheng Long, Raymond Chi-Wing Wong, Bin Zhang, Min Xie | In this paper, we propose an approach called Hyperbola which is optimal in the sense that it gives neither false positives nor false negatives and runs in linear time wrt the dimensionality. |
15 | Efficient algorithms for optimal location queries in road networks | Zitong Chen, Yubao Liu, Raymond Chi-Wing Wong, Jiamin Xiong, Ganglin Mai, Cheng Long | In this paper, we study the optimal location query problem based on road networks. |
16 | Robust set reconciliation | Di Chen, Christian Konrad, Ke Yi, Wei Yu, Qin Zhang | In this paper, we propose the robust set reconciliation problem, and take a principled approach to address this issue via the earth mover’s distance. |
17 | Storm@twitter | Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, Dmitriy Ryaboy | This paper describes the use of Storm at Twitter. |
18 | Druid: a real-time analytical data store | Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, Deep Ganguli | In this paper, we describe Druid’s architecture, and detail how it supports fast aggregations, flexible filters, and low latency data ingestion. |
19 | The next generation operational data historian for IoT based on informix | Sheng Huang, Yaoliang Chen, Xiaoyan Chen, Kai Liu, Xiaomin Xu, Chen Wang, Kevin Brown, Inge Halilovic | In this paper, we present the next-generation Opera-tional Data Historian (ODH) system that is based on the IBM© Informix© system architecture. In addition, we present the first benchmark, IoT-X, to evaluate technologies on operational data management for IoT. |
20 | GenBase: a complex analytics genomics benchmark | Rebecca Taft, Manasi Vartak, Nadathur Rajagopalan Satish, Narayanan Sundaram, Samuel Madden, Michael Stonebraker | This paper introduces a new benchmark designed to test database management system (DBMS) performance on a mix of data management tasks (joins, filters, etc.) and complex analytics (regression, singular value decomposition, etc.) Such mixed workloads are prevalent in a number of application areas including most science workloads and web analytics. |
21 | How to stop under-utilization and love multicores | Anastasia Ailamaki, Erietta Liarou, Pinar Tözün, Danica Porobic, Iraklis Psaroudakis | In this tutorial, we shed light on the above three challenges and survey recent proposals to alleviate them. |
22 | AutoPlait: automatic mining of co-evolving time sequences | Yasuko Matsubara, Yasushi Sakurai, Christos Faloutsos | In this paper we present AutoPlait, a fully automatic mining algorithm for co-evolving time sequences. |
23 | Resource-oriented approximation for frequent itemset mining from bursty data streams | Yoshitaka Yamamoto, Koji Iwanuma, Shoshi Fukuda | Thus, we present resource-oriented approximation algorithms that fix an upper bound for memory consumption to tolerate bursty transactions. |
24 | On complexity and optimization of expensive queries in complex event processing | Haopeng Zhang, Yanlei Diao, Neil Immerman | This analysis allows us to identify performance bottlenecks in processing those expensive queries, and provides key insights for us to develop a series of optimizations to mitigate those bottlenecks. |
25 | Complex event analytics: online aggregation of stream sequence patterns | Yingmei Qi, Lei Cao, Medhabi Ray, Elke A. Rundensteiner | In this paper, we demonstrate that CEP aggregation can be pushed into the sequence construction process. |
26 | Towards indexing functions: answering scalar product queries | Arijit Khan, Pouya Yanki, Bojana Dimcheva, Donald Kossmann | We present a lightweight, yet scalable, dynamic, and generalized indexing scheme, called the planar index, for answering scalar product queries in an accurate manner, which is based on the idea of indexing function f(x) for each data point x using multiple sets of parallel hyperplanes. |
27 | LINVIEW: incremental view maintenance for complex analytical queries | Milos Nikolic, Mohammed ElSeidy, Christoph Koch | In this paper, we study the incremental view maintenance problem for such complex analytical queries. |
28 | Materialization optimizations for feature selection workloads | Ce Zhang, Arun Kumar, Christopher Ré | Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. |
29 | The analytical bootstrap: a new method for fast error estimation in approximate query processing | Kai Zeng, Shi Gao, Barzan Mozafari, Carlo Zaniolo | In this paper, we introduce a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches. |
30 | TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing | Sairam Gurajada, Stephan Seufert, Iris Miliaraki, Martin Theobald | We investigate a new approach to the design of distributed, shared-nothing RDF engines. |
31 | Querying big graphs within bounded resources | Wenfei Fan, Xin Wang, Yinghui Wu | We propose resource-bounded query answering via a dynamic scheme that reduces big G to GQ. |
32 | Natural language question answering over RDF: a graph data driven approach | Lei Zou, Ruizhe Huang, Haixun Wang, Jeffrey Xu Yu, Wenqiang He, Dongyan Zhao | In this paper, we propose a systematic framework to answer natural language questions over RDF repository (RDF Q/A) from a graph data-driven perspective. |
33 | Scalable similarity search for SimRank | Mitsuru Kusumoto, Takanori Maehara, Ken-ichi Kawarabayashi | We propose a very fast and scalable algorithm for this similarity search problem. |
34 | Orca: a modular query optimizer architecture for big data | Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, Rhonda Baldwin | In this paper we present the architecture of Orca, the new query optimizer for all Pivotal data management products, including Pivotal Greenplum Database and Pivotal HAWQ. |
35 | Parallel I/O aware query optimization | Pedram Ghodsnia, Ivan T. Bowman, Anisoara Nica | We characterize the benefit of exploiting I/O parallelism in database scan operators in SAP SQL Anywhere and propose a novel general I/O cost model that considers the impact of device I/O queue depth in I/O cost estimation. |
36 | Exploiting ordered dictionaries to efficiently construct histograms with q-error guarantees in SAP HANA | Guido Moerkotte, David DeHaan, Norman May, Anisoara Nica, Alexander Boehm | In this paper we extend this concept with a threshold, i.e., an estimate or true cardinality θ, below which we do not care about the q-error because we still expect optimal plans. |
37 | Optimizing queries over partitioned tables in MPP systems | Lyublena Antova, Amr El-Helw, Mohamed A. Soliman, Zhongxian Gu, Michalis Petropoulos, Florian Waas | In this paper, we present optimization techniques for queries over partitioned tables as implemented in Pivotal Greenplum Database. |
38 | Parallel data analysis directly on scientific file formats | Spyros Blanas, Kesheng Wu, Surendra Byna, Bin Dong, Arie Shoshani | In this paper, we present the design of a new scientific data analysis system that efficiently processes queries directly over data stored in the HDF5 file format. |
39 | The PH-tree: a space-efficient storage structure and multi-dimensional index | Tilmann Zäschke, Christoph Zimmerli, Moira C. Norrie | We propose the PATRICIA-hypercube-tree, or PH-tree, a multi-dimensional data storage and indexing structure. |
40 | Incremental elasticity for array databases | Jennie Duggan, Michael Stonebraker | In both steps we propose incremental approaches, affecting a minimum set of data and nodes, while maintaining high performance. |
41 | Efficient summarization framework for multi-attribute uncertain data | Jie Xu, Dmitri V. Kalashnikov, Sharad Mehrotra | We propose a framework that models objects as a set of the corresponding information units and reduces the ummarization problem to that of optimizing probabilistic coverage. |
42 | Fusing data with correlations | Ravali Pochampally, Anish Das Sarma, Xin Luna Dong, Alexandra Meliou, Divesh Srivastava | In this paper we present novel techniques modeling correlations between sources and applying it in truth finding. We provide a comprehensive evaluation of our approach on three real-world datasets with different characteristics, as well as on synthetic data, showing that our algorithms outperform the existing state-of-the-art techniques. |
43 | Descriptive and prescriptive data cleaning | Anup Chalamalla, Ihab F. Ilyas, Mourad Ouzzani, Paolo Papotti | In this paper, we propose a system to address this decoupling. |
44 | Towards dependable data repairing with fixing rules | Jiannan Wang, Nan Tang | Towards dependable data repairing with fixing rules |
45 | A sample-and-clean framework for fast and accurate query processing on dirty data | Jiannan Wang, Sanjay Krishnan, Michael J. Franklin, Ken Goldberg, Tim Kraska, Tova Milo | In this paper, we explore an intriguing opportunity. |
46 | Knowing when you’re wrong: building fast and reliable approximate query processing systems | Sameer Agarwal, Henry Milner, Ariel Kleiner, Ameet Talwalkar, Michael Jordan, Samuel Madden, Barzan Mozafari, Ion Stoica | In this paper, we show that it is possible to implement a query approximation pipeline that produces approximate answers and reliable error bars at interactive speeds. |
47 | Discovering queries based on example tuples | Yanyan Shen, Kaushik Chakrabarti, Surajit Chaudhuri, Bolin Ding, Lev Novik | We propose novel algorithms to solve this problem. |
48 | Interactive data exploration using semantic windows | Alexander Kalinin, Ugur Cetintemel, Stan Zdonik | We present a new interactive data exploration approach, called Semantic Windows (SW), in which users query for multidimensional "windows" of interest via standard DBMS-style queries enhanced with exploration constructs. |
49 | Explore-by-example: an automatic query steering framework for interactive data exploration | Kyriaki Dimitriadou, Olga Papaemmanouil, Yanlei Diao | In this paper, we introduce AIDE, an Automatic Interactive Data Exploration framework, that iteratively steers the user towards interesting data areas and predicts a query that retrieves his objects of interest. |
50 | Durable write cache in flash memory SSD for relational and NoSQL databases | Woon-Hak Kang, Sang-Won Lee, Bongki Moon, Yang-Suk Kee, Moonwook Oh | This paper presents a new SSD prototype called DuraSSD equipped with tantalum capacitors. |
51 | Fast database restarts at facebook | Aakash Goel, Bhuwan Chopra, Ciprian Gerea, Dhruv Mátáni, Josh Metzler, Fahim Ul Haq, Janet Wiener | In this paper, we show that using shared memory provides a simple, effective, fast, solution to upgrading servers. |
52 | SpongeFiles: mitigating data skew in mapreduce using distributed memory | Khaled Elmeleegy, Christopher Olston, Benjamin Reed | We introduce SpongeFiles, a novel distributed-memory abstraction tailored to data processing environments like MapReduce. |
53 | Leveraging compression in the tableau data engine | Richard Michael Grantham Wesley, Pawel Terlecki | In this paper, we describe how the Tableau Data Engine (an internally developed column store) leverages a number of compression techniques to improve query performance. |
54 | Fun with hardware transactional memory | Maurice Herlihy | This talk will argue that HTM is not just a faster way of doing the same old latches and monitors. |
55 | CrowdFill: collecting structured data from the crowd | Hyunjung Park, Jennifer Widom | We present CrowdFill, a system for collecting structured data from the crowd. |
56 | OASSIS: query driven crowd mining | Yael Amsterdamer, Susan B. Davidson, Tova Milo, Slava Novgorodov, Amit Somech | In this paper, we explore a novel approach that broadens crowd data sourcing by enabling users to pose general questions, to mine the crowd for potentially relevant data, and to receive concise, relevant answers that represent frequent, significant data patterns. |
57 | Corleone: hands-off crowdsourcing for entity matching | Chaitanya Gokhale, Sanjib Das, AnHai Doan, Jeffrey F. Naughton, Narasimhan Rampalli, Jude Shavlik, Xiaojin Zhu | We describe Corleone, a HOC solution for EM, which uses the crowd in all major steps of the EM process. |
58 | Efficient cohesive subgraphs detection in parallel | Yingxia Shao, Lei Chen, Bin Cui | In this paper, we propose a novel parallel and efficient truss detection algorithm, called PeTa. |
59 | Parallel subgraph listing in a large-scale graph | Yingxia Shao, Bin Cui, Lei Chen, Lin Ma, Junjie Yao, Ning Xu | In this paper, we design a novel parallel subgraph listing framework, named PSgL. |
60 | OPT: a new framework for overlapped and parallel triangulation in large-scale graphs | Jinha Kim, Wook-Shin Han, Sangyeon Lee, Kyungyeol Park, Hwanjo Yu | In this paper, we propose an overlapped and parallel disk-based triangulation framework for billion-scale graphs, OPT, which achieves the ideal cost by (1) full overlap of the CPU and I/O operations and (2) full parallelism of multi-core CPU and FlashSSD I/O. |
61 | Knowledge expansion over probabilistic knowledge bases | Yang Chen, Daisy Zhe Wang | In this paper, we present ProbKB, a probabilistic knowledge base designed to infer missing facts in a scalable, probabilistic, and principled manner using a relational DBMS. |
62 | InsightNotes: summary-based annotation management in relational databases | Dongqing Xiao, Mohamed Y. Eltabakh | In this paper, we address the challenges that arise from the growing scale of annotations in scientific databases. |
63 | A pivotal prefix based filtering algorithm for string similarity search | Dong Deng, Guoliang Li, Jianhua Feng | To address this problem, we propose a novel pivotal prefix filter which significantly reduces the number of signatures. |
64 | Versatile optimization of UDF-heavy data flows with sofa | Astrid Rheinländer, Martin Beckmann, Anja Kunkel, Arvid Heise, Thomas Stoltmann, Ulf Leser | In this demonstration, we present Meteor, a declarative data flow language, and Sofa, a logical optimizer for UDF-heavy data flows, which are both part of the Stratosphere system. |
65 | ERIS live: a NUMA-aware in-memory storage engine for tera-scale multiprocessor systems | Tim Kiefer, Thomas Kissinger, Benjamin Schlegel, Dirk Habich, Daniel Molka, Wolfgang Lehner | In this demonstration, we present ERIS, our NUMA-aware in-memory storage engine. |
66 | Demonstrating efficient query processing in heterogeneous environments | Tomas Karnagel, Matthias Hille, Mario Ludwig, Dirk Habich, Wolfgang Lehner, Max Heimel, Volker Markl | In prior work, we presented a generic hardware-oblivious database system, where the operators can be executed on the main processor as well as on a large number of accelerator architectures. |
67 | One DBMS for all: the brawny few and the wimpy crowd | Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Angelika Reiser, Alfons Kemper, Thomas Neumann | One DBMS for all: the brawny few and the wimpy crowd |
68 | VQA: vertica query analyzer | Alkis Simitsis, Kevin Wilkinson, Jason Blais, Joe Walsh | We demonstrate VQA using TPC-DS queries which have a wide range of query duration and complexity. |
69 | Palette: enabling scalable analytics for big-memory, multicore machines | Fei Chen, Tere Gonzalez, Jun Li, Manish Marwah, Jim Pruyne, Krishnamurthy Viswanathan, Mijung Kim | In this demo, we present Palette, an analytics framework that exploits large memory to trade space for time while also addressing the challenges of multi-threaded, NUMA-aware programming. |
70 | NaLIR: an interactive natural language interface for querying relational databases | Fei Li, Hosagrahar V Jagadish | In this demo, we present NaLIR, a generic interactive natural language interface for querying relational databases. |
71 | BabbleFlow: a translator for analytic data flow programs | Petar Jovanovic, Alkis Simitsis, Kevin Wilkinson | To address this problem, we present BabbleFlow, a system for enabling flow design at a logical level and automatic translation to physical flows. |
72 | Indexing on modern hardware: hekaton and beyond | Justin Levandoski, David Lomet, Sudipta Sengupta, Adrian Birka, Cristian Diaconu | Recent OLTP support exploits new techniques, running on modern hardware, to achieve unprecedented performance compared with prior approaches. |
73 | CrowdMatcher: crowd-assisted schema matching | Chen Jason Zhang, Ziyuan Zhao, Lei Chen, H. V. Jagadish, Chen Caleb Cao | Thus in this demo, we will show how to utilize the crowd to find the right matching. |
74 | Cloud-based RDF data management | Zoi Kaoudi, Ioana Manolescu | This tutorial presents the challenges faced in order to efficiently handle massive amounts of RDF data in a cloud environment. |
75 | Patience is a virtue: revisiting merge and sort on modern processors | Badrish Chandramouli, Jonathan Goldstein | We revisit the problem of sorting and merging data in main memory, and show that a long-forgotten technique called Patience Sort can, with some key modifications, be made competitive with today’s best comparison-based sorting techniques for both random and almost sorted data. |
76 | Morsel-driven parallelism: a NUMA-aware query evaluation framework for the many-core age | Viktor Leis, Peter Boncz, Alfons Kemper, Thomas Neumann | In response, we present the morsel-driven query execution framework, where scheduling becomes a fine-grained run-time task that is NUMA-aware. |
77 | A comprehensive study of main-memory partitioning and its application to large-scale comparison- and radix-sort | Orestis Polychroniou, Kenneth A. Ross | We revisit the pitfalls of in-cache partitioning, and utilizing the crucial performance factors, we introduce new variants for partitioning out-of-cache. |
78 | An application-specific instruction set for accelerating set-oriented database primitives | Oliver Arnold, Sebastian Haas, Gerhard Fettweis, Benjamin Schlegel, Thomas Kissinger, Wolfgang Lehner | In this paper, we show that the development of a database processor is much more feasible nowadays through the availability of customizable processors. |
79 | Which concepts are worth extracting? | Arash Termehchy, Ali Vakilian, Yodsawalai Chodpathumwan, Marianne Winslett | In this paper, we introduce the problem of cost effective conceptual design, where given a collection, a set of relevant concepts, and a fixed budget, one likes to find a conceptual design that improves the effectiveness of answering queries over the collection the most. |
80 | Querying virtual hierarchies using virtual prefix-based numbers | Curtis E. Dyreson, Sourav S. Bhowmick, Ryan Grapp | In this paper we present a novel strategy to virtually transform the data without instantiating and renumbering. |
81 | NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation | Sumit Gulwani, Mark Marron | This paper describes the design and implementation of a robust natural language based interface to spreadsheet programming. |
82 | Sinew: a SQL system for multi-structured data | Daniel Tahara, Thaddeus Diamond, Daniel J. Abadi | In this paper, we discuss the design of a system that enables developers to continue to represent their data using self-describing formats without moving away from SQL and traditional relational database systems. |
83 | Scalable big graph processing in MapReduce | Lu Qin, Jeffrey Xu Yu, Lijun Chang, Hong Cheng, Chengqi Zhang, Xuemin Lin | In this paper, we study scalable big graph processing in MapReduce. |
84 | Anti-combining for MapReduce | Alper Okcan, Mirek Riedewald | We propose Anti-Combining, a novel optimization for MapReduce programs to decrease the amount of data transferred from mappers to reducers. |
85 | Opportunistic physical design for big data analytics | Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis, Michael J. Carey | We present a semantic model for UDFs that enables effective reuse of views containing UDFs along with a rewrite algorithm that provably finds the minimum-cost rewrite under certain assumptions. |
86 | Stratified-sampling over social networks using mapreduce | Roy Levin, Yaron Kanza | In this paper we consider sampling of large-scale, distributed online social networks, and we show how to deal with cases where several surveys are conducted in parallel—in some surveys it may be desired to share individuals to reduce costs, while in other surveys, sharing should be minimized, e.g., to prevent survey fatigue. |
87 | Demonstration of the Myria big data management service | Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, Dan Suciu | Myria queries are executed on a scalable, parallel cluster that uses both state-of-the-art and novel methods for distributed query processing. |
88 | DataSift: a crowd-powered search toolkit | Aditya Parameswaran, Ming Han Teh, Hector Garcia-Molina, Jennifer Widom | We demonstrate DataSift, a crowd-powered search toolkit that can be instrumented over any corpus supporting a keyword search API, and supports efficient and accurate querying for a rich general class of queries, including those described previously. |
89 | Reactive and proactive sharing across concurrent analytical queries | Iraklis Psaroudakis, Manos Athanassoulis, Matthaios Olma, Anastasia Ailamaki | We show that pull-based sharing for SP eliminates the serialization point imposed by the original push-based approach. |
90 | SLQ: a user-friendly graph querying system | Shengqi Yang, Yanan Xie, Yinghui Wu, Tianyu Wu, Huan Sun, Jian Wu, Xifeng Yan | In this demo, we present SLQ, a user-friendly graph querying system enabling schemales and structures graph querying, where a user need not describe queries precisely as required by most databases. |
91 | TAREEG: a MapReduce-based web service for extracting spatial data from OpenStreetMap | Louai Alarabi, Ahmed Eldawy, Rami Alghamdi, Mohamed F. Mokbel | TAREEG employs MapReduce-based techniques to make it efficient and easy to extract OpenStreetMap data in a standard form with minimal effort. |
92 | Searching with XQ: the exemplar query search engine | Davide Mottin, Matteo Lissandrini, Yannis Velegrakis, Themis Palpanas | At the same time, we highlight the technical challenges for this type of query answering and illustrate the implementation approach we have materialized. |
93 | MeanKS: meaningful keyword search in relational databases with complex schema | Mehdi Kargar, Aijun An, Nick Cercone, Parke Godfrey, Jaroslaw Szlichta, Xiaohui Yu | We demonstrate MeanKS, a new system for meaningful keyword search over relational databases. |
94 | H | Nikolaos Papailiou, Dimitrios Tsoumakos, Ioannis Konstantinou, Panagiotis Karras, Nectarios Koziris | In this paper, we present its key scientific contributions and allow participants to interact with an H2RDF+ deployment over a Cloud infrastructure. |
95 | DoomDB: kill the query | Carsten Binnig, Abdallah Salama, Erfan Zamanian | For the demonstration, we present a computer game called DoomDB. |
96 | Should we all be teaching "intro to data science" instead of "intro to databases"? | Bill Howe, Michael J. Franklin, Juliana Freire, James Frew, Tim Kraska, Raghu Ramakrishnan | We consider how to bring these concepts front and center into the emerging wave of Data Science courses, degree programs and even departments. |
97 | Characterizing and selecting fresh data sources | Theodoros Rekatsinas, Xin Luna Dong, Divesh Srivastava | In this paper, we study the problem of source selection considering dynamic data sources whose content changes over time. |
98 | Sloth: being lazy is a virtue (when issuing database queries) | Alvin Cheung, Samuel Madden, Armando Solar-Lezama | In this paper, we present Sloth, a new system that extends traditional lazy evaluation to expose query batching opportunities during application execution, even across loops, branches, and method boundaries. |
99 | Dynamically optimizing queries over large scale data platforms | Konstantinos Karanasos, Andrey Balmin, Marcel Kutsch, Fatma Ozcan, Vuk Ercegovac, Chunyang Xia, Jesse Jackson | In this paper, we propose new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. |
100 | A software-defined networking based approach for performance management of analytical queries on distributed data stores | Pengcheng Xiong, Hakan Hacigumus, Jeffrey F. Naughton | More specifically, we present a group of methods to leverage SDN’s visibility into and control of the network’s state that enable distributed query processors to achieve performance improvements and differentiation for analytical queries. |
101 | The pursuit of a good possible world: extracting representative instances of uncertain graphs | Panos Parchas, Francesco Gullo, Dimitris Papadias, Franceseco Bonchi | To overcome these problems, we propose algorithms for creating deterministic representative instances of uncertain graphs that maintain the underlying graph properties. |
102 | Navigating the maze of graph analytics frameworks using massive graph datasets | Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Jiwon Seo, Jongsoo Park, M. Amber Hassaan, Shubho Sengupta, Zhaoming Yin, Pradeep Dubey | In this work, we offer a quantitative roadmap for improving the performance of all these frameworks and bridging the "ninja gap". |
103 | Local search of communities in large graphs | Wanyun Cui, Yanghua Xiao, Haixun Wang, Wei Wang | In this paper, we propose a \emph{local search} strategy, which searches in the neighborhood of a vertex to find the best community for the vertex. |
104 | Mining statistically significant connected subgraphs in vertex labeled graphs | Akhil Arora, Mayank Sachan, Arnab Bhattacharya | In this paper, we address the problem of finding statistically significant connected subgraphs where the nodes of the graph are labeled. |
105 | Complete yet practical search for minimal query reformulations under constraints | Ioana Ileana, Bogdan Cautis, Alin Deutsch, Yannis Katsis | We revisit the Chase&Backchase (C&B) algorithm for query reformulation under constraints, which provides a uniform solution to such particular-case problems as view-based rewriting under constraints, semantic query optimization, and physical access path selection in query optimization. |
106 | Query shredding: efficient relational evaluation of queries over nested multisets | James Cheney, Sam Lindley, Philip Wadler | We present a new approach to query shredding, which converts a query returning nested data to a fixed number of SQL queries. |
107 | Plan bouquets: query processing without selectivity estimation | Anshuman Dutt, Jayant R. Haritsa | We propose here a conceptually new approach to address this problem, wherein the compile-time estimation process is completely eschewed for error-prone selectivities. |
108 | Schema-free SQL | Fei Li, Tianyin Pan, Hosagrahar V. Jagadish | In this paper, we propose a query language, Schema-free SQL, which enables its users to query a relational database using whatever partial schema they know. |
109 | iCheck: computationally combating "lies, d–ned lies, and statistics" | You Wu, Brett Walenz, Peggy Li, Andrew Shim, Emre Sonmez, Pankaj K. Agarwal, Chengkai Li, Jun Yang, Cong Yu | For claims based on structured data, we present a system to automatically assess the quality of claims (beyond their correctness) and counter misleading claims that cherry-pick data to advance their conclusions. |
110 | ABS: a system for scalable approximate queries with accuracy guarantees | Kai Zeng, Shi Gao, Jiaqi Gu, Barzan Mozafari, Carlo Zaniolo | Our recently introduced Analytical Bootstrap method combines the strengths of both approaches and provides the basis for our ABS system, which will be demonstrated at the conference. |
111 | NADEEF/ER: generic and interactive entity resolution | Ahmed Elmagarmid, Ihab F. Ilyas, Mourad Ouzzani, Jorge-Arnulfo Quiané-Ruiz, Nan Tang, Si Yin | NADEEF/ER: generic and interactive entity resolution |
112 | SerpentTI: flexible analytics of users, boards and domains for pinterest | Alex Cheng, Mary Malit, Chuanxi Zhang, Nick Koudas | We provide a description of SerpentTI, a system that currently crawls, indexes and aggregates more than 31 million users, 96 million boards and 3.1 billion pins from Pinterest to enable flexible and deep analytics. |
113 | Interactive redescription mining | Esther Galbrun, Pauli Miettinen | We present Siren, a tool for interactive redescription mining. |
114 | ONTOCUBO: cube-based ontology construction and exploration | Carlos Garcia-Alvarado, Carlos Ordonez | In this paper, we present ONTOCUBO, a novel system based on our research for text summarization using ontologies and automatic extraction of concepts for building ontologies using Online Analytical Processing (OLAP) cubes. |
115 | An extendable framework for managing uncertain spatio-temporal data | Tobias Emrich, Maximilian Franzke, Hans-Peter Kriegel, Johannes Niedermayer, Matthias Renz, Andreas Züfle | This demonstration presents our Uncertain-Spatio-Temporal (UST)} framework that we have developed in recent years. |
116 | NewsNetExplorer: automatic construction and exploration of news information networks | Fangbo Tao, George Brova, Jiawei Han, Heng Ji, Chi Wang, Brandon Norick, Ahmed El-Kishky, Jialu Liu, Xiang Ren, Yizhou Sun | Much knowledge can be derived and explored with such an information network if we systematically develop effective and scalable data-intensive information network analysis technologies. Further, we develop a set of news information network exploration and mining mechanisms that explore news in multi-dimensional space, which include (i) OLAP-based operations on the hierarchical dimensional and topical structures and rich-text, such as cell summary, single dimension analysis, and promo- tion analysis, (ii) a set of network-based operations, such as similarity search and ranking-based clustering, and (iii) a set of hybrid operations or network-OLAP operations, such as entity ranking at different granularity levels. |
117 | IQR: an interactive query relaxation system for the empty-answer problem | Davide Mottin, Alice Marascu, Senjuti Basu Roy, Gautam Das, Themis Palpanas, Yannis Velegrakis | We present IQR, a system that demonstrates optimization based interactive relaxations for queries that return an empty answer. |
118 | OceanRT: real-time analytics over large temporal data | Shiming Zhang, Yin Yang, Wei Fan, Liang Lan, Mingxuan Yuan | We demonstrate OceanRT, a novel cloud-based infrastructure that performs online analytics in real time, over large-scale temporal data such as call logs from a telecommunication company. |
119 | H2O: a hands-free adaptive store | Ioannis Alagiannis, Stratos Idreos, Anastasia Ailamaki | In this paper, we present the H2O system which introduces two novel concepts. |
120 | Fine-grained partitioning for aggressive data skipping | Liwen Sun, Michael J. Franklin, Sanjay Krishnan, Reynold S. Xin | In this paper, we propose a fine-grained blocking technique that reorganizes the data tuples into blocks with a goal of enabling queries to skip blocks aggressively. |
121 | DSH: data sensitive hashing for high-dimensional k-nnsearch | Jinyang Gao, Hosagrahar Visvesvaraya Jagadish, Wei Lu, Beng Chin Ooi | In this paper, we propose a new and efficient method called Data Sensitive Hashing (DSH) to address this drawback. |
122 | Fast and unified local search for random walk based k-nearest-neighbor query in large graphs | Yubao Wu, Ruoming Jin, Xiang Zhang | In this paper, we present FLoS (Fast Local Search), a unified local search method for efficient and exact top-k proximity query in large graphs. |
123 | Global immutable region computation | Jilian Zhang, Kyriakos Mouratidis, HweeHwa Pang | In this paper we propose an auxiliary feature to standard top-k query processing. |
124 | Answering top-k representative queries on graph databases | Sayan Ranu, Minh Hoang, Ambuj Singh | In this paper, we solve the problem of top-k representative queries on graph databases. |
125 | Modeling entity evolution for temporal record matching | Yueh-Hsuan Chiang, AnHai Doan, Jeffrey F. Naughton | In our work, we propose and evaluate a more detailed model that focuses on the probability that a given attribute value reappears over time. |
126 | Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation | Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, Jiawei Han | In this paper, we propose to resolve conflicts among multiple sources of heterogeneous data types. |
127 | A probabilistic model for linking named entities in web text with heterogeneous information networks | Wei Shen, Jiawei Han, Jianyong Wang | We propose an effective iterative approach to automatically learning the weights for each meta-path based on the expectation-maximization (EM) algorithm without requiring any training data. |
128 | Matching heterogeneous event data | Xiaochen Zhu, Shaoxu Song, Xiang Lian, Jianmin Wang, Lei Zou | We prove the convergence of iterative similarity computation, and propose several pruning and estimation methods. |
129 | HAWQ: a massively parallel processing SQL engine in hadoop | Lei Chang, Zhanwei Wang, Tao Ma, Lirong Jian, Lili Ma, Alon Goldshuv, Luke Lonergan, Jeffrey Cohen, Caleb Welton, Gavin Sherry, Milind Bhandarkar | This paper presents the novel design of HAWQ, including query processing, the scalable software interconnect based on UDP protocol, transaction management, fault tolerance, read optimized storage, the extensible framework for supporting various popular Hadoop based data stores and formats, and various optimization choices we considered to enhance the query performance. |
130 | Major technical advancements in apache hive | Yin Huai, Ashutosh Chauhan, Alan Gates, Gunther Hagleitner, Eric N. Hanson, Owen O’Malley, Jitendra Pandey, Yuan Yuan, Rubao Lee, Xiaodong Zhang | In this paper, we present a community-based effort on technical advancements in Hive. |
131 | JSON data management: supporting schema-less development in RDBMS | Zhen Hua Liu, Beda Hammerschmidt, Doug McMahon | In this paper, we analyze the way in which requirements differ between management of relational data and management of JSON data. |
132 | Querying encrypted data | Arvind Arasu, Ken Eguro, Raghav Kaushik, Ravishankar Ramamurthy | We cover approaches based on both classic client-server and involving the use of a trusted hardware module where data can be securely decrypted. |
133 | Towards unified ad-hoc data processing | Xiaogang Shi, Bin Cui, Gillian Dobbie, Beng Chin Ooi | In this paper, we present UniAD, a system designed to simplify the programming of data processing tasks and provide efficient execution for user programs. |
134 | Partial results in database systems | Willis Lang, Rimma V. Nehme, Eric Robinson, Jeffrey F. Naughton | We explore ways to characterize and classify these partial results, and describe an analytical framework that allows the system to perform coarse to fine-grained analysis to determine the semantics of a partial result. |
135 | Parallel in-situ data processing with speculative loading | Yu Cheng, Florin Rusu | In this paper, we propose SCANRAW, a novel database physical operator for in-situ processing over raw files that integrates data loading and external tables seamlessly while preserving their advantages: optimal performance across a query workload and zero time-to-query. |
136 | Approximation schemes for many-objective query optimization | Immanuel Trummer, Christoph Koch | This is why we propose several approximation schemes for MOQO that generate guaranteed near-optimal plans in seconds where exhaustive optimization takes hours. |
137 | Querying k-truss community in large and dynamic graphs | Xin Huang, Hong Cheng, Lu Qin, Wentao Tian, Jeffrey Xu Yu | We propose a novel community model based on the k-truss concept, which brings nice structural and computational properties. |
138 | Reachability queries on large dynamic graphs: a total order approach | Andy Diwen Zhu, Wenqing Lin, Sibo Wang, Xiaokui Xiao | To address this deficiency, this paper presents a novel study on reachability indices for large dynamic graphs. |
139 | EAGr: supporting continuous ego-centric aggregate queries over large dynamic graphs | Jayanta Mondal, Amol Deshpande | In this paper, we present EAGr, a system for supporting large numbers of continuous neighborhood-based ("ego-centric") aggregate queries over large, highly dynamic, rapidly evolving graphs. |
140 | Localizing anomalous changes in time-evolving graphs | Kumar Sricharan, Kamalika Das | In this paper, we use the term `localization’ to refer to the problem of identifying abnormal changes in node relationships (edges) that cause anomalous changes in graph structure. |
141 | Online optimization and fair costing for dynamic data sharing in a cloud data market | Ziyang Liu, Hakan Hacigümüs | In this paper, we study a data market framework that enables the sale or sharing of dynamic data, where each sharing is specified by an ad-hoc query. We propose an intuitive online algorithm for sharing plan selection, as well as a set of fair costing criteria and an algorithm that maximizes the fairness. |
142 | A comparison of platforms for implementing and running very large scale machine learning algorithms | Zhuhua Cai, Zekai J. Gao, Shangyu Luo, Luis L. Perez, Zografoula Vagena, Christopher Jermaine | We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must "roll her own" ML code. |
143 | Re-evaluating designs for multi-tenant OLTP workloads on SSD-basedI/O subsystems | Ning Zhang, Junichi Tatemura, Jignesh Patel, Hakan Hacigumus | In this paper, we compare three designs using both open-source and proprietary DBMSs on SSD-based I/O subsystems. |
144 | Secure query processing with data interoperability in a cloud database environment | Wai Kit Wong, Ben Kao, David Wai Lok Cheung, Rongbin Li, Siu Ming Yiu | We propose and analyze a secure query processing system (SDB) on relational tables and a set of elementary operators on encrypted data that allow data interoperability, which allows a wide range of SQL queries to be processed by the SP on encrypted information. |
145 | Are we experiencing a big data bubble? | Fatma Özcan, Nesime Tatbul, Daniel J. Abadi, Marcel Kornacker, C. Mohan, Karthik Ramasamy, Janet Wiener | Are we experiencing a big data bubble? |
146 | Mining latent entity structures from massive unstructured and interconnected data | Jiawei Han, Chi Wang | In this tutorial, we summarize the closely related literature in database systems, data mining, Web, information extraction, information retrieval, and natural language processing, overview a spectrum of data-driven methods that extract and infer such latent structures, from an interdisciplinary point of view, and demonstrate how these structures support entity discovery and management, data understanding, and some new database applications. |
147 | Explainable security for relational databases | Gabriel Bender, Lucja Kot, Johannes Gehrke | To encourage developers and administrators to use security mechanisms more effectively, we propose a novel security model in which all security decisions are formally explainable. |
148 | PrivBayes: private data release via bayesian networks | Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, Xiaokui Xiao | To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. |
149 | PriView: practical differentially private release of marginal contingency tables | Wahbeh Qardaji, Weining Yang, Ninghui Li | We consider the problem of publishing a differentially private synopsis of a d-dimensional dataset so that one can reconstruct any k-way marginal contingency tables from the synopsis. |
150 | Blowfish privacy: tuning privacy-utility trade-offs using policies | Xi He, Ashwin Machanavajjhala, Bolin Ding | In this paper, we present Blowfish, a class of privacy definitions inspired by the Pufferfish framework, that provides a rich interface for this trade-off. |
151 | Overlap interval partition join | Anton Dignös, Michael H. Böhlen, Johann Gamper | We propose Overlap Interval Partitioning (OIP), a new partitioning approach for data with an interval. |
152 | Similarity joins for uncertain strings | Manish Patil, Rahul Shah | We propose various filtering techniques that give upper and (or) lower bound on Pr(ed(R,S) ≤ k) without instantiating possible worlds for either of the strings. |
153 | Track join: distributed joins with minimal network traffic | Orestis Polychroniou, Rajkumar Sen, Kenneth A. Ross | We introduce track join, a novel distributed join algorithm that minimizes network traffic by generating an optimal transfer schedule for each distinct join key. |
154 | On-the-fly token similarity joins in relational databases | Nikolaus Augsten, Armando Miraglia, Thomas Neumann, Alfons Kemper | Our goal is to efficiently compute token similarity joins on-the-fly, i.e., without any precomputed tokens or indexes. |
155 | Tracking set correlations at large scale | Foteini Alvanaki, Sebastian Michel | In this work, we consider the continuous computation of correlations between co-occurring tags that appear in messages published in social media streams. |
156 | Aggregate estimation over a microblog platform | Saravanan Thirumuruganathan, Nan Zhang, Vagelis Hristidis, Gautam Das | In this paper, we consider a novel problem of estimating aggregate queries over microblogs, e.g., "how many users mentioned the word ‘privacy’ in 2013?" |
157 | Tripartite graph clustering for dynamic sentiment analysis on social media | Linhong Zhu, Aram Galstyan, James Cheng, Kristina Lerman | In this work, we propose an unsupervised tri-clustering framework, which analyzes both user-level and tweet-level sentiments through co-clustering of a tripartite graph. |
158 | A temporal context-aware model for user behavior modeling in social media systems | Hongzhi Yin, Bin Cui, Ling Chen, Zhiting Hu, Zi Huang | This paper focuses on analyzing user behaviors in social media systems and designing a latent class statistical mixture model, named temporal context-aware mixture model (TCAM), to account for the intentions and preferences behind user behaviors. |
159 | Indexing for interactive exploration of big data series | Kostas Zoumpatianos, Stratos Idreos, Themis Palpanas | In this paper, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. |
160 | Histograms as a side effect of data movement for big data | Zsolt Istvan, Louis Woods, Gustavo Alonso | In this paper, we show how to calculate statistics as a side effect of data movement within a DBMS using a hardware accelerator in the data path. |
161 | A formal approach to finding explanations for database queries | Sudeepa Roy, Dan Suciu | In this paper we introduce a principled approach to provide explanations for answers to SQL queries based on intervention: removal of tuples from the database that significantly affect the query answers. |
162 | MISO: souping up big data query processing with a multistore system | Jeff LeFevre, Jagan Sankaranarayanan, Hakan Hacigumus, Junichi Tatemura, Neoklis Polyzotis, Michael J. Carey | In this work, we provide what we believe to be the first method to tune the physical design of a multistore system, by focusing on which store to place data. |
163 | Efficient top-K SimRank-based similarity join | Wenbo Tao, Guoliang Li | In this paper we study the problem of top-k SimRank-based similarity join, which finds k pairs of nodes with the largest SimRank values. |
164 | Multi-dimensional data statistics for columnar in-memory databases | Curtis Kroetsch | The research presented here studies the multi-dimensional data statistics in the context of columnar in-memory database systems. |
165 | A user interaction based community detection algorithm for online social networks | Himel Dev | To alleviate the limitations of existing approaches, we propose a novel solution of community detection in OSNs. |
166 | EDS: a segment-based distance measure for sub-trajectory similarity search | Min Xie | In this paper, we study a sub-trajectory similarity search problem which returns for a query trajectory some trajectories from the trajectory database each of which contains a sub-trajectory similar to the query trajectory. |
167 | Spatio-temporal visual analysis for event-specific tweets | Mashaal Musleh | In this poster, we present our on-going work on this module and discuss three of its use cases. |
168 | PackageBuilder: querying for packages of tuples | Kevin Fernandes, Matteo Brucato, Rahul Ramakrishna, Azza Abouzied, Alexandra Meliou | PackageBuilder introduces simple extensions to the SQL language to support package-level predicates, and includes a simple interface that allows users to load datasets and interactively specify package queries. |
169 | Privacy preserving social graphs for high precision community detection | Himel Dev | To resolve this issue, we address the problem of privacy preserving community detection in social networks. |