Paper Digest: COLT 2019 Highlights

June 24, 2019June 18, 2020 admin

Readers can also choose to read this highlight article on our console, which allows users to filter out papers using keywords and find related papers.

The Annual Conference on Learning Theory (COLT) focuses on addressing theoretical aspects of machine learing and related topics.

To help the community quickly catch up on the work presented in this conference, Paper Digest Team processed all accepted papers, and generated one highlight sentence (typically the main topic) for each paper. Readers are encouraged to read these machine generated highlights / summaries to quickly get the main idea of each paper.

If you do not want to miss any interesting academic paper, you are welcome to sign up our free daily paper digest service to get updates on new papers published in your area every day. You are also welcome to follow us on Twitter and Linkedin to get updated with new conference digests.

Paper Digest Team
team@paperdigest.org

TABLE 1: COLT 2019 Papers

	Title	Authors	Highlight
1	Conference on Learning Theory 2019: Preface	Alina Beygelzimer, Daniel Hsu	Conference on Learning Theory 2019: Preface
2	Inference under Information Constraints: Lower Bounds from Chi-Square Contraction	Jayadev Acharya, Cl�ment L Canonne, Himanshu Tyagi	We propose a unified framework to study such distributed inference problems under local information constraints.
3	Learning in Non-convex Games with an Optimization Oracle	Naman Agarwal, Alon Gonen, Elad Hazan	In this paper we show that by slightly strengthening the oracle model, the online and the statistical learning models become computationally equivalent.
4	Learning to Prune: Speeding up Repeated Computations	Daniel Alabi, Adam Tauman Kalai, Katrina Liggett, Cameron Musco, Christos Tzamos, Ellen Vitercik	We present an algorithm that learns to maximally prune the search space on repeated computations, thereby reducing runtime while provably outputting the correct solution each period with high probability.
5	Towards Testing Monotonicity of Distributions Over General Posets	Maryam Aliakbarpour, Themis Gouleakis, John Peebles, Ronitt Rubinfeld, Anak Yodpinyanee	In this work, we consider the sample complexity required for testing the monotonicity of distributions over partial orders.
6	Testing Mixtures of Discrete Distributions	Maryam Aliakbarpour, Ravi Kumar, Ronitt Rubinfeld	In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families.
7	Normal Approximation for Stochastic Gradient Descent via Non-Asymptotic Rates of Martingale CLT	Andreas Anastasiou, Krishnakumar Balasubramanian, Murat A. Erdogdu	We provide non-asymptotic convergence rates of the Polyak-Ruppert averaged stochastic gradient descent (SGD) to a normal random vector for a class of twice-differentiable test functions.
8	Adaptively Tracking the Best Bandit Arm with an Unknown Number of Distribution Changes	Peter Auer, Pratik Gajane, Ronald Ortner	For this setting, we propose an algorithm called ADSWITCH and provide performance guarantees for the regret evaluated against the optimal non-stationary policy.
9	Achieving Optimal Dynamic Regret for Non-stationary Bandits without Prior Information	Peter Auer, Yifang Chen, Pratik Gajane, Chung-Wei Lee, Haipeng Luo, Ronald Ortner, Chen-Yu Wei	This joint extended abstract introduces and compares the results of (Auer et al., 2019) and (Chen et al., 2019), both of which resolve the problem of achieving optimal dynamic regret for non-stationary bandits without prior information on the non-stationarity.
10	A Universal Algorithm for Variational Inequalities Adaptive to Smoothness and Noise	Francis Bach, Kfir Y Levy	We present a universal algorithm for these inequalities based on the Mirror-Prox algorithm.
11	Learning Two Layer Rectified Neural Networks in Polynomial Time	Ainesh Bakshi, Rajesh Jayaram, David P Woodruff	We consider the following fundamental problem in the study of neural networks: given input examples $x \in \mathbb{R}^d$ and their vector-valued labels, as defined by an underlying generative neural network, recover the weight matrices of this network.
12	Private Center Points and Learning of Halfspaces	Amos Beimel, Shay Moran, Kobbi Nissim, Uri Stemmer	We present a private agnostic learner for halfspaces over an arbitrary finite domain $X\subset \R^d$ with sample complexity $\mathsf{poly}(d,2^{\log^*\|X\|})$.
13	Lower bounds for testing graphical models: colorings and antiferromagnetic Ising models	Ivona Bez�kov�, Antonio Blanca, Zongchen Chen, Daniel �tefankovic, Eric Vigoda	For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing.
14	Approximate Guarantees for Dictionary Learning	Aditya Bhaskara, Wai Ming Tai	The goal of our work is to understand what can be said in the absence of such assumptions.
15	The Optimal Approximation Factor in Density Estimation	Olivier Bousquet, Daniel Kane, Shay Moran	We develop two approaches to achieve the optimal approximation factor of $2$: an adaptive one and a static one.
16	Sorted Top-k in Rounds	Mark Braverman, Jieming Mao, Yuval Peres	We consider the sorted top-$k$ problem whose goal is to recover the top-$k$ items with the correct order out of $n$ items using pairwise comparisons.
17	Multi-armed Bandit Problems with Strategic Arms	Mark Braverman, Jieming Mao, Jon Schneider, S. Matthew Weinberg	Our goal is to design an algorithm for the principal incentivizing these arms to pass on as much of their private rewards as possible.
18	Universality of Computational Lower Bounds for Submatrix Detection	Matthew Brennan, Guy Bresler, Wasim Huleihel	Universality of Computational Lower Bounds for Submatrix Detection
19	Optimal Average-Case Reductions to Sparse PCA: From Weak Assumptions to Strong Hardness	Matthew Brennan, Guy Bresler	We give a reduction from $\textsc{pc}$ that yields the first full characterization of the computational barrier in the spiked covariance model, providing tight lower bounds at all sparsities $k$.
20	Learning rates for Gaussian mixtures under group action	Victor-Emmanuel Brunel	We provide an algebraic description and a geometric interpretation of these facts.
21	Near-optimal method for highly smooth convex optimization	S�bastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford	We propose a near-optimal method for highly smooth convex optimization.
22	Improved Path-length Regret Bounds for Bandits	S�bastien Bubeck, Yuanzhi Li, Haipeng Luo, Chen-Yu Wei	We study adaptive regret bounds in terms of the variation of the losses (the so-called path-length bounds) for both multi-armed bandit and more generally linear bandit.
23	Optimal Learning of Mallows Block Model	Robert Busa-Fekete, Dimitris Fotakis, Bal�zs Sz�r�nyi, Manolis Zampetakis	The Mallows model, introduced in the seminal paper of Mallows 1957, is one of the most fundamental ranking distribution over the symmetric group $S_m$.
24	Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret	Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, Lorenzo Rosasco	In this paper, we introduce BKB (\textit{budgeted kernelized bandit}), a new approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and remarkably no assumption on the input space or covariance of the GP.
25	Disagreement-Based Combinatorial Pure Exploration: Sample Complexity Bounds and an Efficient Algorithm	Tongyi Cao, Akshay Krishnamurthy	We design new algorithms for the combinatorial pure exploration problem in the multi-arm bandit framework.
26	A Rank-1 Sketch for Matrix Multiplicative Weights	Yair Carmon, John C Duchi, Sidford Aaron, Tian Kevin	We show that a simple randomized sketch of the matrix multiplicative weight (MMW) update enjoys (in expectation) the same regret bounds as MMW, up to a small constant factor.
27	On the Computational Power of Online Gradient Descent	Vaggos Chatziafratis, Tim Roughgarden, Joshua R. Wang	We prove that the evolution of weight vectors in online gradient descent can encode arbitrary polynomial-space computations, even in very simple learning settings.
28	Active Regression via Linear-Sample Sparsification	Xue Chen, Eric Price	We present an approach that improves the sample complexity for a variety of curve fitting problems, including active learning for linear regression, polynomial regression, and continuous sparse Fourier transforms.
29	A New Algorithm for Non-stationary Contextual Bandits: Efficient, Optimal and Parameter-free	Yifang Chen, Chung-Wei Lee, Haipeng Luo, Chen-Yu Wei	We propose the first contextual bandit algorithm that is parameter-free, efficient, and optimal in terms of dynamic regret.
30	Faster Algorithms for High-Dimensional Robust Covariance Estimation	Yu Cheng, Ilias Diakonikolas, Rong Ge, David P. Woodruff	Our main contribution is to develop faster algorithms for this problem whose running time nearly matches that of computing the empirical covariance.
31	Testing Symmetric Markov Chains Without Hitting	Yeshwanth Cherapanamjeri, Peter L. Bartlett	In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain.
32	Fast Mean Estimation with Sub-Gaussian Rates	Yeshwanth Cherapanamjeri, Nicolas Flammarion, Peter L. Bartlett	We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^{3.5}+n^2d)$ for $n$ i.i.d. samples and that has error bounds matching the sub-Gaussian case.
33	Vortices Instead of Equilibria in MinMax Optimization: Chaos and Butterfly Effects of Online Learning in Zero-Sum Games	Yun Kuen Cheung, Georgios Piliouras	We establish that algorithmic experiments in zero-sum games “fail miserably” to confirm the unique, sharp prediction of maxmin equilibration.
34	Pure entropic regularization for metrical task systems	Christian Coester, James R. Lee	We show that on every $n$-point HST metric, there is a randomized online algorithm for metrical task systems (MTS) that is $1$-competitive for service costs and $O(\log n)$-competitive for movement costs.
35	A near-optimal algorithm for approximating the John Ellipsoid	Michael B. Cohen, Ben Cousins, Yin Tat Lee, Xin Yang	We develop a simple and efficient algorithm for approximating the John Ellipsoid of a symmetric polytope.
36	Artificial Constraints and Hints for Unbounded Online Learning	Ashok Cutkosky	We provide algorithms that guarantees regret $R_T(u)\le \tilde O(G\\|u\\|^3 + G(\\|u\\|+1)\sqrt{T})$ or $R_T(u)\le \tilde O(G\\|u\\|^3T^{1/3} + GT^{1/3}+ G\\|u\\|\sqrt{T})$ for online convex optimization with $G$-Lipschitz losses for any comparison point $u$ without prior knowledge of either $G$ or $\\|u\\|$.
37	Combining Online Learning Guarantees	Ashok Cutkosky	We show how to take any two parameter-free online learning algorithms with different regret guarantees and obtain a single algorithm whose regret is the minimum of the two base algorithms.
38	Learning from Weakly Dependent Data under Dobrushin�s Condition	Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti	Statistical learning theory has largely focused on learning and generalization given independent and identically distributed (i.i.d.) samples.
39	Space lower bounds for linear prediction in the streaming model	Yuval Dagan, Gil Kur, Ohad Shamir	We show that fundamental learning tasks, such as finding an approximate linear separator or linear regression, require memory at least \emph{quadratic} in the dimension, in a natural streaming setting.
40	Computationally and Statistically Efficient Truncated Regression	Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, Manolis Zampetakis	We provide a computationally and statistically efficient estimator for the classical problem of truncated linear regression, where the dependent variable $y = \vec{w}^{\rm T} \vec{x}+{\varepsilon}$ and its corresponding vector of covariates $\vec{x} \in \mathbb{R}^k$ are only revealed if the dependent variable falls in some subset $S \subseteq \mathbb{R}$; otherwise the existence of the pair $(\vec{x},y)$ is hidden.
41	Reconstructing Trees from Traces	Sami Davies, Miklos Z. Racz, Cyrus Rashtchian	We study the problem of learning a node-labeled tree given independent traces from an appropriately defined deletion channel.
42	Is your function low dimensional?	Anindya De, Elchanan Mossel, Joe Neeman	In this paper, we study the problem of testing whether a given $n$ variable function $f : \mathbb{R}^n \to \{0,1\}$, is a linear $k$-junta or $\epsilon$-far from all linear $k$-juntas, where the closeness is measured with respect to the Gaussian measure on $\mathbb{R}^n$.
43	Computational Limitations in Robust Classification and Win-Win Results	Akshay Degwekar, Preetum Nakkiran, Vinod Vaikuntanathan	In this work, we extend their work in three directions.
44	Fast determinantal point processes via distortion-free intermediate sampling	Michal Derezinski	To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time $O\big(\text{number-of-non-zeros}(\mathbf{X})\cdot\log n\big)+\text{poly}(d)$, and (2) a sampling step which runs in $\text{poly}(d)$ time, independent of the number of rows $n$.
45	Minimax experimental design: Bridging the gap between statistical and worst-case approaches to least squares regression	Michal Derezinski, Kenneth L. Clarkson, Michael W. Mahoney, Manfred K. Warmuth	In the process, we develop a new algorithm for a joint sampling distribution called volume sampling, and we propose a new i.i.d. importance sampling method: inverse score sampling.
46	Communication and Memory Efficient Testing of Discrete Distributions	Ilias Diakonikolas, Themis Gouleakis, Daniel M. Kane, Sankeerth Rao	In both these models, we provide efficient algorithms for uniformity/identity testing (goodness of fit) and closeness testing (two sample testing).
47	Testing Identity of Multidimensional Histograms	Ilias Diakonikolas, Daniel M. Kane, John Peebles	We investigate the problem of identity testing for multidimensional histogram distributions.
48	Lower Bounds for Parallel and Randomized Convex Optimization	Jelena Diakonikolas, Crist�bal Guzm�n	Prior to our work, lower bounds for parallel convex optimization algorithms were only known in a small fraction of the settings considered in this paper, mainly applying to Euclidean ($\ell_2$) and $\ell_\infty$ spaces.
49	On the Performance of Thompson Sampling on Logistic Bandits	Shi Dong, Tengyu Ma, Benjamin Van Roy	We study the logistic bandit, in which rewards are binary with success probability $\exp(\beta a^\top \theta) / (1 + \exp(\beta a^\top \theta))$ and actions $a$ and coefficients $\theta$ are within the $d$-dimensional unit ball.
50	Lower Bounds for Locally Private Estimation via Communication Complexity	John Duchi, Ryan Rogers	We develop lower bounds for estimation under local privacy constraints—including differential privacy and its relaxations to approximate or Rényi differential privacy—by showing an equivalence between private estimation and communication-restricted estimation problems.
51	Sharp Analysis for Nonconvex SGD Escaping from Saddle Points	Cong Fang, Zhouchen Lin, Tong Zhang	In this paper, we give a sharp analysis for Stochastic Gradient Descent (SGD) and prove that SGD is able to efficiently escape from saddle points and find an $(\epsilon, O(\epsilon^{0.5}))$-approximate second-order stationary point in $\tilde{O}(\epsilon^{-3.5})$ stochastic gradient computations for generic nonconvex optimization problems, when the objective function satisfies gradient-Lipschitz, Hessian-Lipschitz, and dispersive noise assumptions.
52	Achieving the Bayes Error Rate in Stochastic Block Model by SDP, Robustly	Yingjie Fei, Yudong Chen	We study the statistical performance of the semidefinite programming (SDP) relaxation approach for clustering under the binary symmetric Stochastic Block Model (SBM).
53	High probability generalization bounds for uniformly stable algorithms with nearly optimal rate	Vitaly Feldman, Jan Vondrak	Our proof technique is new and we introduce several analysis tools that might find additional applications.
54	Sum-of-squares meets square loss: Fast rates for agnostic tensor completion	Dylan J. Foster, Andrej Risteski	For agnostic learning of third-order tensors with the square loss, we give the first polynomial time algorithm that obtains a “fast” (i.e., $O(1/n)$-type) rate improving over the rate obtained by reduction to matrix completion.
55	The Complexity of Making the Gradient Small in Stochastic Convex Optimization	Dylan J. Foster, Ayush Sekhari, Ohad Shamir, Nathan Srebro, Karthik Sridharan, Blake Woodworth	We give nearly matching upper and lower bounds on the oracle complexity of finding $\epsilon$-stationary points $(\\|\nabla F(x)\\|\leq\epsilon$ in stochastic convex optimization.
56	Statistical Learning with a Nuisance Component	Dylan J. Foster, Vasilis Syrgkanis	We provide excess risk guarantees for statistical learning in a setting where the population risk with respect to which we evaluate the target model depends on an unknown model that must be to be estimated from data (a “nuisance model”).
57	On the Regret Minimization of Nonconvex Online Gradient Ascent for Online PCA	Dan Garber	In this paper we focus on the problem of Online Principal Component Analysis in the regret minimization framework.
58	Optimal Tensor Methods in Smooth Convex and Uniformly ConvexOptimization	Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, C�sar A. Uribe	We propose a new tensor method, which closes the gap between the lower $\Omega\left(\e^{-\frac{2}{3p+1}} \right)$ and upper $O\left(\e^{-\frac{1}{p+1}} \right)$ iteration complexity bounds for this class of optimization problems.
59	Near Optimal Methods for Minimizing Convex Functions with Lipschitz $p$-th Derivatives	Alexander Gasnikov, Pavel Dvurechensky, Eduard Gorbunov, Evgeniya Vorontsova, Daniil Selikhanovych, C�sar A. Uribe, Bo Jiang, Haoyue Wang, Shuzhong Zhang, S�bastien Bubeck, Qijia Jiang, Yin Tat Lee, Yuanzhi Li, Aaron Sidford	In this merged paper, we consider the problem of minimizing a convex function with Lipschitz-continuous $p$-th order derivatives.
60	Stabilized SVRG: Simple Variance Reduction for Nonconvex Optimization	Rong Ge, Zhize Li, Weiyao Wang, Xiang Wang	In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an $\epsilon$-second-order stationary point using only $\widetilde{O}(n^{2/3}/\epsilon^2+n/\epsilon^{1.5})$ stochastic gradients.
61	Learning Ising Models with Independent Failures	Surbhi Goel, Daniel M. Kane, Adam R. Klivans	We give the first efficient algorithm for learning the structure of an Ising model that tolerates independent failures; that is, each entry of the observed sample is missing with some unknown probability $p$.
62	Learning Neural Networks with Two Nonlinear Layers in Polynomial Time	Surbhi Goel, Adam R. Klivans	We give a polynomial-time algorithm for learning neural networks with one layer of sigmoids feeding into any Lipschitz, monotone activation function (e.g., sigmoid or ReLU).
63	When can unlabeled data improve the learning rate?	Christina G�pfert, Shai Ben-David, Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Ruth Urner	Our analysis focuses on improvements in the \emph{minimax} learning rate in terms of the number of labeled examples (with the number of unlabeled examples being allowed to depend on the number of labeled ones).
64	Sampling and Optimization on Convex Sets in Riemannian Manifolds of Non-Negative Curvature	Navin Goyal, Abhishek Shetty	In this paper, we study sampling and convex optimization problems over manifolds of non-negative curvature proving polynomial running time in the dimension and other relevant parameters.
65	Better Algorithms for Stochastic Bandits with Adversarial Corruptions	Anupam Gupta, Tomer Koren, Kunal Talwar	We present a new algorithm for this problem whose regret is nearly optimal, substantially improving upon previous work.
66	Tight analyses for non-smooth stochastic gradient descent	Nicholas J. A. Harvey, Christopher Liaw, Yaniv Plan, Sikander Randhawa	We prove that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ \emph{with high probability}.
67	Reasoning in Bayesian Opinion Exchange Networks Is PSPACE-Hard	Jan Hazla, Ali Jadbabaie, Elchanan Mossel, M. Amin Rahimian	We study the Bayesian model of opinion exchange of fully rational agents arranged on a network.
68	How Hard is Robust Mean Estimation?	Samuel B. Hopkins, Jerry Li	In this work we give worst-case complexity-theoretic evidence that improving on the error rates of current polynomial-time algorithms for robust mean estimation may be computationally intractable in natural settings.
69	A Robust Spectral Algorithm for Overcomplete Tensor Decomposition	Samuel B. Hopkins, Tselil Schramm, Jonathan Shi	We give a spectral algorithm for decomposing overcomplete order-4 tensors, so long as their components satisfy an algebraic non-degeneracy condition that holds for nearly all (all but an algebraic set of measure $0$) tensors over $(\mathbb{R}^d)^{\otimes 4}$ with rank $n \le d^2$.
70	Sample-Optimal Low-Rank Approximation of Distance Matrices	Pitor Indyk, Ali Vakilian, Tal Wagner, David P Woodruff	In this work we study algorithms for low-rank approximation of distance matrices.
71	Making the Last Iterate of SGD Information Theoretically Optimal	Prateek Jain, Dheeraj Nagaraj, Praneeth Netrapalli	The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality of \emph{last point} of SGD as well as GD.
72	Accuracy-Memory Tradeoffs and Phase Transitions in Belief Propagation	Vishesh Jain, Frederic Koehler, Jingbo Liu, Elchanan Mossel	We prove a conjecture of Evans, Kenyon, Peres, and Schulman (2000) which states that any bounded memory message passing algorithm is statistically much weaker than Belief Propagation for the reconstruction problem.
73	The implicit bias of gradient descent on nonseparable data	Ziwei Ji, Matus Telgarsky	The implicit bias of gradient descent on nonseparable data
74	An Optimal High-Order Tensor Method for Convex Optimization	Bo Jiang, Haoyue Wang, Shuzhong Zhang	In this paper, we propose a new high-order tensor algorithm for the general composite case, with the iteration complexity of O( 1 / k^{(3d+1)/2} ), which matches the lower bound for the d-th order methods as established in Nesterov (2018) and Shamir et al. (2018), and hence is optimal.
75	Parameter-Free Online Convex Optimization with Sub-Exponential Noise	Kwang-Sung Jun, Francesco Orabona	We consider the problem of unconstrained online convex optimization (OCO) with sub-exponential noise, a strictly more general problem than the standard OCO.
76	Sample complexity of partition identification using multi-armed bandits	Sandeep Juneja, Subhashini Krishnasamy	Given a vector of probability distributions, or arms, each of which can be sampled independently, we consider the problem of identifying the partition to which this vector belongs from a finitely partitioned universe of such vector of distributions.
77	Privately Learning High-Dimensional Distributions	Gautam Kamath, Jerry Li, Vikrant Singhal, Jonathan Ullman	We present novel, computationally efficient, and differentially private algorithms for two fundamental high-dimensional learning problems: learning a multivariate Gaussian and learning a product distribution over the Boolean hypercube in total variation distance.
78	On Communication Complexity of Classification Problems	Daniel Kane, Roi Livni, Shay Moran, Amir Yehudayoff	This work studies distributed learning in the spirit of Yao’s model of communication complexity: consider a two-party setting, where each of the players gets a list of labelled examples and they communicate in order to jointly perform some learning task.
79	Non-asymptotic Analysis of Biased Stochastic Approximation Scheme	Belhal Karimi, Blazej Miasojedow, Eric Moulines, Hoi-To Wai	These restrictions are all essentially relaxed in this work.
80	Discrepancy, Coresets, and Sketches in Machine Learning	Zohar Karnin, Edo Liberty	We provide general techniques for bounding the class discrepancy of machine learning problems.
81	Bandit Principal Component Analysis	Wojciech Kotlowski, Gergely Neu	Based on the classical observation that this decision-making problem can be lifted to the space of density matrices, we propose an algorithm that is shown to achieve a regret of $O(d^{3/2}\sqrt{T})$ after $T$ rounds in the worst case.
82	Contextual bandits with continuous actions: Smoothing, zooming, and adapting	Akshay Krishnamurthy, John Langford, Aleksandrs Slivkins, Chicheng Zhang	We study contextual bandit learning for any competitor policy class and continuous action space.
83	Distribution-Dependent Analysis of Gibbs-ERM Principle	Ilja Kuzborskij, Nicol� Cesa-Bianchi, Csaba Szepesv�ri	In this work we study the excess risk suffered by a Gibbs-ERM learner that uses non-convex, regularized empirical risk with the goal to understand the interplay between the data-generating distribution and learning in large hypothesis spaces.
84	Global Convergence of the EM Algorithm for Mixtures of Two Component Linear Regression	Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, Damek Davis	Our analysis reveals that EM exhibits very different behavior in Mixed Linear Regression from its behavior in Gaussian Mixture Models, and hence our proofs require the development of several new ideas.
85	An Information-Theoretic Approach to Minimax Regret in Partial Monitoring	Tor Lattimore, Csaba Szepesv�ri	We prove a new minimax theorem connecting the worst-case Bayesian regret and minimax regret under finite-action partial monitoring with no assumptions on the space of signals or decisions of the adversary.
86	Solving Empirical Risk Minimization in the Current Matrix Multiplication Time	Yin Tat Lee, Zhao Song, Qiuyi Zhang	In this paper, we give an algorithm that runs in time \begin{align} O^ ( ( n^{\omega} + n^{2.5 – \alpha/2} + n^{2+ 1/6} ) \log (n / \delta) ) \end{align*} where $\omega$ is the exponent of matrix multiplication, $\alpha$ is the dual exponent of matrix multiplication, and $\delta$ is the relative accuracy.
87	On Mean Estimation for General Norms with Statistical Queries	Jerry Li, Aleksandar Nikolov, Ilya Razenshteyn, Erik Waingarten	We study the problem of mean estimation for high-dimensional distributions given access to a statistical query oracle.
88	Nearly Minimax-Optimal Regret for Linearly Parameterized Bandits	Yingkai Li, Yining Wang, Yuan Zhou	When the problem dimension is $d$, the time horizon is $T$, and there are $n \leq 2^{d/2}$ candidate actions per time period, we (1) show that the minimax expected regret is $\Omega(\sqrt{dT \log T \log n})$ for every algorithm, and (2) introduce a Variable-Confidence-Level (VCL) SupLinUCB algorithm whose regret matches the lower bound up to iterated logarithmic factors.
89	Sharp Theoretical Analysis for Nonparametric Testing under Random Projection	Meimei Liu, Zuofeng Shang, Guang Cheng	In this paper, we develop computationally efficient nonparametric testing by employing a random projection strategy.
90	Combinatorial Algorithms for Optimal Design	Vivek Madan, Mohit Singh, Uthaipon Tantipongpipat, Weijun Xie	In this paper, we bridge this gap and prove approximation guarantees for the local search algorithms for D-optimal design and A-optimal design problems.
91	Nonconvex sampling with the Metropolis-adjusted Langevin algorithm	Oren Mangoubi, Nisheeth K Vishnoi	Our main technical contribution is an analysis of the Metropolis acceptance probability of MALA in terms of its “energy-conservation error," and a bound for this error in terms of third- and fourth- order regularity conditions.
92	Beyond Least-Squares: Fast Rates for Regularized Empirical Risk Minimization through Self-Concordance	Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, Alessandro Rudi	We consider learning methods based on the regularization of a convex empirical risk by a squared Hilbertian norm, a setting that includes linear predictors and non-linear predictors through positive-definite kernels.
93	Planting trees in graphs, and finding them back	Laurent Massouli�, Ludovic Stephan, Don Towsley	In this paper we study the two inference problems of detection and reconstruction in the context of planted structures in sparse Erdős-Rényi random graphs $\mathcal G(n,\lambda/n)$ with fixed average degree $\lambda>0$.
94	Uniform concentration and symmetrization for weak interactions	Andreas Maurer, Massimiliano Pontil	The method to derive uniform bounds with Gaussian and Rademacher complexities is extended to the case where the sample average is replaced by a nonlinear statistic.
95	Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit	Song Mei, Theodor Misiakiewicz, Andrea Montanari	In this paper we establish stronger and more general approximation guarantees.
96	Batch-Size Independent Regret Bounds for the Combinatorial Multi-Armed Bandit Problem	Nadav Merlis, Shie Mannor	To overcome this problem, we introduce a new smoothness criterion, which we term \emph{Gini-weighted smoothness}, that takes into account both the nonlinearity of the reward and concentration properties of the arms.
97	Lipschitz Adaptivity with Multiple Learning Rates in Online Learning	Zakaria Mhammedi, Wouter M Koolen, Tim Van Erven	In the present work we remove this Lipschitz hyperparameter by designing new versions of MetaGrad and Squint that adapt to its optimal value automatically.
98	VC Classes are Adversarially Robustly Learnable, but Only Improperly	Omar Montasser, Steve Hanneke, Nathan Srebro	We study the question of learning an adversarially robust predictor.
99	Affine Invariant Covariance Estimation for Heavy-Tailed Distributions	Dmitrii M. Ostrovskii, Alessandro Rudi	In this work we provide an estimator for the covariance matrix of a heavy-tailed multivariate distribution.
100	Stochastic Gradient Descent Learns State Equations with Nonlinear Activations	Samet Oymak	We study discrete time dynamical systems governed by the state equation $h_{t+1}=\phi(Ah_t+Bu_t)$.
101	A Theory of Selective Prediction	Mingda Qiao, Gregory Valiant	Our goal is to accurately predict the average observation, and we are allowed to choose the window over which the prediction is made: for some $t < n$ and $m \le n – t$, after seeing $t$ observations we predict the average of $x_{t+1}, \ldots, x_{t+m}$.
102	Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon	Alexander Rakhlin, Xiyu Zhai	We show that minimum-norm interpolation in the Reproducing Kernel Hilbert Space corresponding to the Laplace kernel is not consistent if input dimension is constant.
103	Classification with unknown class-conditional label noise on non-compact feature spaces	Henry Reeve, Kab�n	We investigate the problem of classification in the presence of unknown class-conditional label noise in which the labels observed by the learner have been corrupted with some unknown class dependent probability.
104	The All-or-Nothing Phenomenon in Sparse Linear Regression	Galen Reeves, Jiaming Xu, Ilias Zadik	We study the problem of recovering a hidden binary $k$-sparse $p$-dimensional vector $\beta$ from $n$ noisy linear observations $Y=X\beta+W$ where $X_{ij}$ are i.i.d. $\mathcal{N}(0,1)$ and $W_i$ are i.i.d. $\mathcal{N}(0,\sigma^2)$.
105	Depth Separations in Neural Networks: What is Actually Being Separated?	Itay Safran, Ronen Eldan, Ohad Shamir	In this paper, we study whether such depth separations might still hold in the natural setting of $\mathcal{O}(1)$-Lipschitz radial functions, when $\epsilon$ does not scale with $d$.
106	How do infinite width bounded norm networks look in function space?	Pedro Savarese, Itay Evron, Daniel Soudry, Nathan Srebro	We consider the question of what functions can be captured by ReLU networks with an unbounded number of units (infinite width), but where the overall network Euclidean norm (sum of squares of all weights in the system, except for an unregularized bias term for each unit) is bounded; or equivalently what is the minimal norm required to approximate a given function.
107	Exponential Convergence Time of Gradient Descent for One-Dimensional Deep Linear Neural Networks	Ohad Shamir	We study the dynamics of gradient descent on objective functions of the form $f(\prod_{i=1}^{k} w_i)$ (with respect to scalar parameters $w_1,\ldots,w_k$), which arise in the context of training depth-$k$ linear neural networks.
108	Learning Linear Dynamical Systems with Semi-Parametric Least Squares	Max Simchowitz, Ross Boczar, Benjamin Recht	We analyze a simple prefiltered variation of the least squares estimator for the problem of estimation with biased, \emph{semi-parametric} noise, an error model studied more broadly in causal statistics and active learning.
109	Finite-Time Error Bounds For Linear Stochastic Approximation andTD Learning	R. Srikant, Lei Ying	We consider the dynamics of a linear stochastic approximation algorithm driven by Markovian noise, and derive finite-time bounds on the moments of the error, i.e., deviation of the output of the algorithm from the equilibrium point of an associated ordinary differential equation (ODE).
110	Robustness of Spectral Methods for Community Detection	Ludovic Stephan, Laurent Massouli�	In the sparse case, where edge probabilities are in $O(1/n)$, we introduce a new spectral method based on the distance matrix $D^{(\ell)}$, where $D^{(\ell)}_{ij} = 1$ iff the graph distance between $i$ and $j$, noted $d(i, j)$ is equal to $\ell$.
111	Maximum Entropy Distributions: Bit Complexity and Stability	Damian Straszak, Nisheeth K. Vishnoi	Here we show that these questions are related and resolve both of them.
112	Adaptive Hard Thresholding for Near-optimal Consistent Robust Regression	Arun Sai Suggala, Kush Bhatia, Pradeep Ravikumar, Prateek Jain	We study the problem of robust linear regression with response variable corruptions.
113	Model-based RL in Contextual Decision Processes: PAC bounds and Exponential Improvements over Model-free Approaches	Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford	We design new algorithms for RL with a generic model class and analyze their statistical properties.
114	Stochastic first-order methods: non-asymptotic and computer-aided analyses via potential functions	Adrien Taylor, Francis Bach	We provide a novel computer-assisted technique for systematically analyzing first-order methods for optimization.
115	The Relative Complexity of Maximum Likelihood Estimation, MAP Estimation, and Sampling	Christopher Tosh, Sanjoy Dasgupta	By way of illustration, we show how hardness results for ML estimation of mixtures of Gaussians and topic models carry over to MAP estimation and approximate sampling under commonly used priors.
116	The Gap Between Model-Based and Model-Free Methods on the Linear Quadratic Regulator: An Asymptotic Viewpoint	Stephen Tu, Benjamin Recht	We show that for policy evaluation, a simple model-based plugin method requires asymptotically less samples than the classical least-squares temporal difference (LSTD) estimator to reach the same quality of solution; the sample complexity gap between the two methods can be at least a factor of state dimension.
117	Theoretical guarantees for sampling and inference in generative models with latent diffusions	Belinda Tzen, Maxim Raginsky	We introduce and study a class of probabilistic generative models, where the latent object is a finite-dimensional diffusion process on a finite time interval and the observed variable is drawn conditionally on the terminal point of the diffusion.
118	Gradient Descent for One-Hidden-Layer Neural Networks: Polynomial Convergence and SQ Lower Bounds	Santosh Vempala, John Wilmes	We study the complexity of training neural network models with one hidden nonlinear activation layer and an output weighted sum layer.
119	Estimation of smooth densities in Wasserstein distance	Jonathan Weed, Quentin Berthet	We prove the first minimax rates for estimation of smooth densities for general Wasserstein distances, thereby showing how the curse of dimensionality can be alleviated for sufficiently regular measures.
120	Estimating the Mixing Time of Ergodic Markov Chains	Geoffrey Wolfer, Aryeh Kontorovich	Our key insight is to estimate the pseudo-spectral gap instead, which allows us to overcome the loss of self-adjointness and to achieve a polynomial dependence on $d$ and the minimal stationary probability $\pi_\star$.
121	Stochastic Approximation of Smooth and Strongly Convex Functions: Beyond the $O(1/T)$ Convergence Rate	Lijun Zhang, Zhi-Hua Zhou	In this paper, we make use of smoothness and strong convexity simultaneously to boost the convergence rate.
122	Open Problem: Is Margin Sufficient for Non-Interactive Private Distributed Learning?	Amit Daniely, Vitaly Feldman	Open Problem: Is Margin Sufficient for Non-Interactive Private Distributed Learning?
123	Open Problem: How fast can a multiclass test set be overfit?	Vitaly Feldman, Roy Frostig, Moritz Hardt	Open Problem: How fast can a multiclass test set be overfit?
124	Open Problem: Do Good Algorithms Necessarily Query Bad Points?	Rong Ge, Prateek Jain, Sham M. Kakade, Rahul Kidambi, Dheeraj M. Nagaraj, Praneeth Netrapalli	Basing of these folkore results and some recent developments, this manuscript considers a more subtle question: does any algorithm necessarily (information theoretically) have to query iterates that are sub-optimal infinitely often?
125	Open Problem: Risk of Ruin in Multiarmed Bandits	Filipo S. Perotto, Mathieu Bourgais, Bruno C. Silva, Laurent Vercouter	We formalize a particular class of problems called \textit{survival multiarmed bandits} (S-MAB), which constitutes a modified version of \textit{budgeted multiarmed bandits} (B-MAB) where a true \textit{risk of ruin} must be considered, bringing it closer to \textit{risk-averse multiarmed bandits} (RA-MAB).
126	Open Problem: Monotonicity of Learning	Tom Viering, Alexander Mey, Marco Loog	We pose the question to what extent a learning algorithm behaves monotonically in the following sense: does it perform better, in expectation, when adding one instance to the training set?
127	Open Problem: The Oracle Complexity of Convex Optimization with Limited Memory	Blake Woodworth, Nathan Srebro	We note that known methods achieving the optimal oracle complexity for first order convex optimization require quadratic memory, and ask whether this is necessary, and more broadly seek to characterize the minimax number of first order queries required to optimize a convex Lipschitz function subject to a memory constraint.