# Time-Data Trade-Offs in Resource-Constrained Information and Inference Systems

## Periodic Reporting for period 1 - time-data (Time-Data Trade-Offs in Resource-Constrained Information and Inference Systems)

Reporting period: 2017-09-01 to 2019-02-28

Machine learning systems face a deluge of data in scientific and industrial applications under the promise of potentially huge technological and societal benefits. Massive data, however, presents a fundamental challenge to the back-end learning algorithms, which is captured by the following computational dogma: Running time of a learning algorithm increases with the size of the data. Since the available computational power is growing slowly relative to data sizes, large-scale problems of practical interest require increasingly more time to solve.

Our recent research [1,2] has led us to mathematically demonstrate that this dogma is false in general, and supports an emerging perspective: Data should be treated as a resource that can be traded off with other resources, such as running time. For data acquisition and communications, we have now shown related sampling, power, latency, and circuit area trade-offs in hardware [3-5]. A detailed understanding of time-data and other analogous trade-offs, however, requires our interdisciplinary approach, which is taken by our ERC project (time-data).

Our goal of systematically understanding and expanding on this emerging perspective is ambitious, but promises potentially huge impacts within and beyond the data sciences. To this end, we will seek three closely interrelated research objectives:

1. Fundamental trade-offs in convex optimization: This thrust proposes scalable and universal convex optimization algorithms that not only rigorously trade off their convergence rates with their per-iteration time for these templates, but also match the theoretical lower bounds on their runtime efficiency.

2. Theory and methods for information and computation trade-offs: This thrust rethinks how we formulate learning problems by providing a new hierarchy of estimators and dimensionality reduction techniques for non-linear models, and characterizing their sample complexities and computational trade-offs.

3. Time-data trade-offs in scientific discovery: This thrust demonstrates our rigorous theory by applying it to real massive and complex data problems, such as super-resolved fluorescence microscopy to understand how cells respond to diseases, materials science in automating discovery, and neuroscience to develop energy efficient neural interfaces for bypassing spinal cord injuries.

[1] J. Bruer, J. Tropp, V. Cevher, and S. Becker, “Time-data tradeoffs by aggressive smoothing,” Neural Information Processing Systems (NIPS), Montreal, Quebec, Canada, 2014 [19 pages with supplementary material, 24.7% acceptance rate, http://infoscience.epfl.ch/record/202682].

[2] J. Bruer, J. Tropp, V. Cevher, and S. Becker, “Designing Statistical Estimators That Balance Sample Size, Risk, and Computational Cost,” Journal of Selected Topics in Signal Processing, vol. 9, no. 4, pp. 612–624, June 2015 [14 pages, http://infoscience.epfl.ch/record/204988].

[3] C. Aprile, A. Cevrero, P. A Francese, C. Menolfi, M. Braendli, M. Kossel, T. Morf, L. Kull, I Oezkaya, Y. Leblebici, V. Cevher and T. Toifl “An Eight lanes 7Gb/s/pin Source Synchronous Single-Ended RX with Equalization and Far-End Crosstalk Cancellation for Backplane Channels,” IEEE Journal of Solid State Circuits , 2018 [ 12 pages https://infoscience.epfl.ch/record/233712/files/08246724.pdf].

[4] C. Aprile, K. Ture, L. Baldassarre, M. Shoaran, G. Yilmaz, F. Maloberti, C. Dehollain, Y. Leblebici and V. Cevher, “Adaptive Learning- Based Compressive Sampling for Low-power Wireless Implants ,” IEEE Transactions on Circuits and Systems-I, 2018 [13 pages, https://infoscience.epfl.ch/record/257098].

[5] C. Aprile, J. Wuthrich, L. Baldassarre, Y. Leblebici and V. Cevher, “ An area and power efficient on-the-fly LBCS transformation for implantable neuronal signal acquisition systems,” ACM International Conference on Computing Frontiers ’2018, Ischia, Italy May 2018 [4 pages, https: //infoscience.epfl.ch/record/257097].
To describe our project achievements, we will still use the original proposal outline below and relate the results to the individual work packages. Our full set of publications can be found at http://lions.epfl.ch/publications.

Executive summary: We have managed to closely carry out our research agenda as described in the proposal. Our overall results however have gone well beyond the anticipated goals in our proposal write-up with rather important implications in data sciences and computation. Along with the ERC team, the PI has complemented the project with his own resources for personnel (In particular, Mr. Ya-Ping Hsieh, Mr. Alp Yurtsever, and Mr. Ilija Bogunovic who are not funded by any other agency but with PI's own running budget at EPFL); However, the computational needs of the project may necessitate a bit of diversion from personnel costs towards a purchase of a cluster, moving into the second phase.

====
Objective 1:Fundamental trade-offs in convex optimization
====
We have made significant theoretical and algorithmic progress in convex analysis and optimization, which has direct computational implications in machine learning and artificial intelligence. This claim is backed up by several key publications at premiere conferences at difference disciplines International Conference on Machine Learning (ICML) and Neural Information Processing Systems (NeurIPS) as well as top journals Journal of Optimization Theory and IEEE Transactions on Signal Processing. Along the way, we are happy to report that we resolved an open problem in game theory by developing a simple algorithm that achieves the Nash equilibrium while simultaneously obtaining optimal regret rates.

------
WP1
------
It is fair to say that we started super strong for this work package, which aims to tackle---in an extremely scalable fashion---optimization problems. Indeed, optimization formulations naturally arise in high-dimensional statistical estimation for which, we develop scalable optimization methods that exploit dimensionality reduction, adaptivity, and stochastic approximation at their core to overcome bottlenecks in the machine learning pipeline.

As promised in Task 1, we studied the fundamental trade-offs in primal dual optimization, resulting in new algorithms that achieve unparalleled scaling:

A. Alacaoglu, Q. Tran Dinh, O. Fercoq and V. Cevher “Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization,” Neural Information Processing System conference (NIPS) ’2017, Long Beach, CA, USA December 2017 [9 pages, 20.9% acceptance rate], https://infoscience.epfl.ch/record/232391/files/SMOOTH-CD_MAIN.pdf ].

Our analysis relies on a novel combination of four ideas applied--in a non-trivial fashion--to the primal-dual gap function: smoothing, acceleration, homotopy, and coordinate descent with non-uniform sampling. As a result, our method features the first convergence rate guarantees among the coordinate descent methods.

Our new primal-dual techniques found a great --perhaps not surprising in retrospect-- application in game theory. For instance, simple zero-sum games have been studied extensively, often from the standpoint of analyzing the convergence to the Nash equilibrium. At the equilibrium, the players employ a min-max pair of strategies where no player can improve their pay-off by a unilateral deviation. When the behavior of each player is explained by a noregret algorithm, it is possible to significantly improve convergence rates beyond the so-called black-box, adversarial dynamics. This observation was first made by Daskalakis et al., in 2011, which left an open problem on the existence of of a simple algorithm that converges at optimal rates for both regret and the value of the game in an uncoupled manner, both against honest (i.e. cooperative) and dishonest (i.e. arbitrarily adversarial) behavior.

Along with Masters project and intern students, the PI with his PhD student Ya-Ping Hsieh resolved
In addition to resolving some open problems, our results went well beyond the state-of-the-art. I will below provide a context for the results and elaborate further in three headings.

{Estimation:}
Optimization formulations naturally arise in high-dimensional statistical estimation for which, we develop scalable optimization methods that exploit dimensionality reduction, adaptivity, and stochastic approximation at their core to overcome bottlenecks in the machine learning pipeline.

For instance, our results now indicate that storage, rather than arithmetic computation, can be the critical bottleneck that prevents us from solving many large scale optimization problems. In particular, semidefinite programs, which are at the top of the convex optimization hierarchy, often have low-rank solutions that can be represented with $\mathcal{O}(n)$-storage, yet semidefinite programming (SDP) algorithms require us to store a matrix decision variable with size $\mathcal{O}(n^2)$.

In our recent research (Yurtsever AISTATS/ICML), we developed new algorithms that can solve convex optimization problems in space required to specify the problem and its solution. Our key insight is to design algorithms that maintain only a small sketch of the decision variable or exploit the complementary slackness of the primal-dual formulations. For the SDP formulations, we obtain an approximate solution within an $\epsilon$-error region in the objective residual and distance to feasible set, after a total of $\texttt{Const}\cdot \epsilon^{-5/2}\log(n/\epsilon)$ matrix vector multiplications for the linear minimization oracle (approximate eigenvalue calculation), and an additional $\mathcal{O}(\max(n,d)/\epsilon^2)$ arithmetic operations for the remaining arithmetics. $\texttt{Const}$ is problem independent.

On the sketching side, we have proposed new single-view low-rank matrix sketching methods, for positive-semidefinite matrices. Our research produces informative error bounds that predict algorithm parameters (hence our approach is essentially tuning-free), and guides the user for efficient and stable implementations as well as the right ways to preserve structural properties.

We have designed scalable primal-dual (sub)gradient methods for convex minimization with affine constraints. For instance, one of these methods (our earlier Universal Primal Dual work, funded by our previous ERC Future-Proof) AUniPD is universal, in the sense it adapts to the unknown smoothness level of the underlying problem. In a similar vein, we have designed a universal accelerated gradient method in this ERC (time-data), called AcceleGrad, for unconstrained convex minimization. AcceleGrad leverages an adaptive step-size rule based on the gradient norms, as in AdaGrad, can exploit stochastic gradients while not requiring a necessary line-search step as in AUniPD.

In this setting, we have obtained additional results, such as the first convergence result for composite convex minimization as well as the Frank-Wolfe method for non-Lipschitz objectives, the first stochastic forward Douglas-Rachford splitting framework , the first coordinate descent framework for three-composite minimization, non-Euclidean training methods for neural networks.

{Decisions:}
Many decision problems can be tackled using the diminishing returns properties in the underlying formulations. On this front, we focus on submodular optimization (a discrete analog of convexity) and develop methodology in the Bayesian optimization framework (BO) for optimal decision making.

The BO problem consists of sequentially optimizing a black-box function based on noisy feedback, where the function is modeled via a Gaussian process. This problem not only has far-reaching applications such as robotics and sensor networks, but can also boost the performance of virtually any machine learning algorithm, including deep neural networks, by providing an automatic procedure for tuning the hyperparameters.

While a v