Stochastic gradient descent

6/20/2023

(3) A job should produce comparable results if the data is presented in a different order. (2) A job run on a CPU and GPU should produce more » identical results. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Reproducible results are essential to machine learning research. We use TensorFlow as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus. The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. This heterogeneous cluster uses innovative scheduling technology, Slurm, that manages a network of CPUs and graphics processing units (GPUs). The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data.

more » Finally, we present several numerical examples to illustrate the performances of our new algorithms. In special cases, where complexity bounds are known for some particular sketching algorithms, such as coordinate descent methods for optimization problems with a single linear coupled constraint, our theory recovers the best known bounds. Additionally, if the objective function satisfies a strong convexity type condition, both algorithms converge linearly in expectation. In the smooth convex case, we derive for both algorithms, non-accelerated and A-RSD, sublinear convergence rates in the expected values of the objective function. For the general case, when the objective function is smooth and non-convex, we prove for the non-accelerated variant sublinear rate in expectation for an appropriate optimality measure. To our knowledge, this is the first convergence analysis of RSD algorithms for optimization problems with multiple non-separable linear constraints. Based on these sampling conditions we develop new sketch descent methods for solving general smooth linearly constrained problems, in particular, random sketch descent (RSD) and accelerated random sketch descent (A-RSD) methods. Thus, we first investigate necessary and sufficient conditions for the sketch sampling to have well-defined algorithms. Due to the non-separability of the constraints, arbitrary random sketching would not be guaranteed to work. Ībstract In this paper we consider large-scale smooth optimization problems with multiple linear coupled constraints. We conclude by empirically demonstrating the utility of our approach for both convex linear-model and deep learning tasks. Second, we greedily choose a fixed scan-order to minimize the metric used in our condition and show that we can obtain more accurate solutions from the same number of epochs of SGD. First, using quasi-Monte-Carlo methods, we achieve unprecedented accelerated convergence rates for learning with data augmentation. Motivated by our theory, we propose two new example-selection approaches. We show that our approach suffices to recover, and in some cases improve upon, previous state-of-the-art analyses for four known example-selection schemes: (1) shuffle once, (2) random reshuffling, (3) random reshuffling with data echoing, and (4) Markov Chain Gradient Descent. In this paper, we develop a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings. Recent results show that accelerated rates are possible in a variety of cases for permutation-based sample orders, in which each example from the training set is used once before any example is reused. Training example order in SGD has long been known to affect convergence rate.

0 Comments

Stochastic gradient descent

Leave a Reply.

Author

Archives

Categories