본 문서는 “QSGD : Communication-Efficient SGD via Gradient Quantization and Encoding” [1] 을 학습 및 요약한 내용이다.
- Introduction
- Preliminaries
- Corollary 2.2
- Quantized Stochastic Gradient Descent (QSGD)
- QSGD Variants
- Experiments
- Conclusion
- Reference
Introduction
- Object function $f(x) : \mathbf{R}^n \rightarrow \mathbf{R}$
- Define a Stochastic gradients $\tilde{g}$ such that $\mathbb{E}[\tilde{g}(x)] \rightarrow \nabla f(x)$
- Standard SGD
\(x_{t+1} = x_t - \eta_t \tilde{g}(x_t)\)
- $\mathcal{F}_m = {X_1, \cdots X_m }$ is generated by the distribution $D$ and it is i.i.d
- The loss function $l(X, \theta)$
- We want to find a model $\theta^*$ which minimizes $f(\theta) = \mathbb{E}_{X \sim D}[l(X, \theta)]$
Preliminaries
Set $\mathcal{X} \subseteq \mathbb{R}^n$ be a known convex set. let $f:\mathcal{X} \rightarrow \mathbb{R}$ be differentiable, convex, smooth, and unknown.
Definition 2.1
Fix $f:\mathcal{X} \rightarrow \mathbb{R}$. A stochastic gradient for $f$ is a random functiuon $\tilde{g}(x)$ so that $\mathbb{E}[\tilde{g}] = \nabla f(x)$ . We say the stochastic gradient has second moment at most $B$ if $\mathbb{E}[| \tilde{g}(x)|_2^2] \leq B$ for all $x \in \mathcal{X}$. We say it has variance at most $\sigma^2$ if $\mathbb{E}[| \tilde{g}(x) - \nabla f(x) |_2^2 ] \leq \sigma^2$ for all $ x \in \mathcal{X}$.
Theorem 2.1
Let $\mathcal{X} \subseteq \mathbb{R}^n$ be convex, and let $f:\mathcal{X} \rightarrow \mathbb{R}$ be unknown, convex and L-smooth. Let $x_0 \in \mathcal{X}$ be given, and let $R^2 = \sup_{x \in \mathcal{X}} | x - x_0 |^2$. Let $T > 0$ be fixed. Given repeated independent access to stochastic gradients with variance bound $\sigma^2$ for $f$, SGD with initial point $x_0$ and constant step sizes $\eta_t = \frac{1}{L + 1/\gamma}$, where $\gamma = \frac{R}{\sigma} \sqrt{\frac{2}{T}}$, acheives \(\mathbb{E}\left[ f \left( \frac{1}{T} \sum_{t=0}^T x_t \right) \right] - \min_{x \in \mathcal{X}} f(x) \leq R \sqrt{ \frac{2 \sigma^2}{T} } + \frac{LR^2}{T}\)
Minibatched SGD
Update equation for mini batched SGD \(x_{t+1} = \prod_{\mathcal{X}} \left( x_t - \eta_t \tilde{G}_t (x_t) \right)\) where $\tilde{G}t(x_t) = \frac{1}{m} \sum{i=1}^m \tilde{g}_{t, i}$
Corollary 2.2
Let $\mathcal{X}, f, L, x_0$ and $R$ be as in Theorem 2.1. Fix $\epsilon > 0$. Suppose we run parallel SGD on K processors, each with access to independent stochastic gradients with the second moment bound $B$, with step size $\eta_t = 1/(L + \sqrt{K}/\gamma)$, where $\gamma$ is as in Theorem 2.1. Then if
\[T = O \left( R^2 \cdot \max \left(\frac{2B}{K \epsilon^2}, \frac{L}{\epsilon} \right) \right), \;\;then \;\; \mathbb{E} \left[ f\left( \frac{1}{T} \sum_{t=0}^T x_t \right)\right] - \min_{x \in \mathcal{X}} f(x) \leq \epsilon\]Quantized Stochastic Gradient Descent (QSGD)
$|v|_0$ denote the number of nonzero for $v \in \mathbb{R}^n$ and $sgn(x) \in { -1, 1}$ with $sgn(0) = 1$
Generalized Stochastic Quantization and Coding
For any $v \in \mathbb{R}^n$with $v \neq 0$, $Q_s(v)$ is defined as
\(Q_s(v_i) = \| v \|_2 \cdot sgn (v_i) \cdot \xi _i (v,s)\)
where $\xi _i (v,s)$ ‘s are independent random variables defined as follows.
- Let $0 \leq l < s$ be an integer such that \(\frac{|v_i|}{\|v\|}_2 \in [ l/s, (l+1)/s]\).
- That is $[ l/s, (l+1)/s]$ is the quantization interval .Then \(\xi_i(v,s) = \begin{cases} l/s & \text{ with probablity } 1 - p\left( \frac{v_i}{\|v\|_2}, s \right) \\ (l+1)/s & \text{ otherwise } \end{cases}\)
-
여기서, $p(a,s) = as - l$ for any $a \in [0,1]$. If $v = 0$ then we define $Q(v,s)=0$ -$p(a,s) = as - l$ 에서
\(\frac{|v_i|}{\|v\|_2} \cdot s -l \Rightarrow \frac{l}{s} \cdot s - l = 0 \text{ or } \frac{l+1}{s} \cdot s - l = 1\)
- 그러므로 $\xi_i(v,s) $ 이 1이 될 확률로 $p\left( \frac{v_i}{|v|_2}, s \right)$를 간주하면 된다.
- Expectation은 \(\mathbb{E}(\xi(v,s)) = \frac{ \left| v_i \right| }{ \|v\|_2 }\)
Lemma 3.1
For any $v \in \mathbb{R}^n$, we have that
- $\mathbb{E}(Q_S(v)) = v$ (unbiasedness)
- $\mathbb{E}( | Q_s(v)- v |_2^2) \leq \min(\frac{n}{s^2}, \frac{\sqrt{n}}{s})| v |_2^2$ (variance bound)
- $\mathbb{E}(|Q_s(v)) |_0 \leq s(s + \sqrt{n})$ (sparsity)
- 이 때, $s$ 를 Qunatization level이라 한다.
Theorem 3.2
Let $f : \mathbb{R}^n \rightarrow \mathbb{R}$ be fixed, and let $x \in \mathbb{R}^n$ be arbtrary.Fix $s \geq 2$ quantization levels. If $\tilde{g}(x)$ is a stochastic gradientfor $f$ at $x$ with second moment bound $B$, then $Q_s (\tilde{g}(x))$ is a stochastic gradient for $f$ at $x$ with variance bound $\min \left( \frac{n}{s^2}, \frac{\sqrt{n}}{s}\right)B$. Moreover, there is an encoding scheme so that in expectation, the number of bits to communicate $Q_s(\tilde{g}(x))$ is upper bounded by \(\left( 3+ \left( \frac{3}{2} + o(1)\right) \log \left( \frac{2(s^2 + n)}{s(s+ \sqrt{n})}\right ) \right) s(s+ \sqrt{n}) + 32\)
3.2 QSGD Gurantees
Theorem 3.4 (Smooth convex QSGD)
Let $ \mathcal{X}, f, L, x_0$ and $R$ be as in theorem 2.1. Fix $\epsilon > 0$ , Suppose we run parallel QSGD with $s$ quantization level on $K$ processors accessing independent stochastic gradients with second moment bound $B$. with step size $\eta_t = \frac{1}{L + \sqrt{K}/\gamma}$, where $\gamma$ is as in Theorem 2.1 with $\sigma = B’$, where $B’= \min \left( \frac{n}{s^2}, \frac{\sqrt{n}}{s}\right)B$ . Then, if $T= O\left( R^2 \cdot \max \left( \frac{2B’}{K\epsilon^2}, \frac{L}{\epsilon} \right) \right)$ , then $\mathbb{E} \left[ f \left( \frac{1}{T} \sum_{t=0}^T x_t \right) \right] - \min_{x \in \mathcal{X}} f(x) \leq \epsilon$ . Moreover, QSGD requires $\left( 3+ \left( \frac{3}{2} + o(1)\right) \log \left( \frac{2(s^2 + n)}{s(s+ \sqrt{n})}\right ) \right) s(s+ \sqrt{n}) + 32$ bits of communication per round. In the special case, when $s=\sqrt{n}$, this can be reduced to 2.8n + 32
Theorem 3.5 (QSGD for non convex optimization)
Let $f : \mathbb{R}^n \rightarrow \mathbb{R}$ be a L-smooth (possibly non-convex) function, and let $x_1$ be an arbitrary initial point. Let $T > 0$ be fixed, and $s > 0$. Then there is a random stopping time $R$ supported on ${ 1, \cdots , N }$ so that QSGD with quantization level $s$, constant stepsizes $\eta = O(1/L)$ and access to stochastic gradients of $f$ with the second moment bound $B$ satisfies $\frac{1}{L}\mathbb{E} [| \nabla f(x) |_2^2 ] \leq O \left( \frac{\sqrt{L(f(x_1) - f^* }}{N} + \frac{\min(n/s^2, \sqrt{n}/s)B}{L}\right)$ Moreover, the communication cost is the same as in Theorem 3.4
3.3 Quantized Variance-Reduced SGD
- 다음을 가정하자.
- K개의 Processor, m개의 parameter가 있고 각 processor i는 함수족 ${ f_{im/K}, \cdots, f_{(i+1)m/(K-1)} }$ 에 Access 한다고 가정하자.
- Processor $i$에 대하여 let $h_i = \frac{1}{m}\sum_{j=im/K}^{(i+1)m/K-1} f_i$ be the portion of $f$, so that $f = \sum_{i=1}^K h_i$
- 이 떄 목표는 $f = \sum_{i=1}^K h_i$를 최소화 시키는 것.
Algorithm Description
Let $\tilde{Q}(v) = Q(v, \sqrt{n})$ where $Q(v,s)$ is defined as in Section 3.1
- Begin at $y^{(1)} = x_0$, where $x_0$ is an arbitrary point
- At the beginning of epoch $p$, each processor broadcasts $\nabla h_i(y^{(p)})$ that is unquantized full gradient, from which the processors each aggregate $\nabla f (y^{(p)}) = \sum_{i=1}^m \nabla h_i (y^{(p)})$. -Whithin each epoch, for each iteration $t=1, \cdots T$, and for each processor $i=1, \cdots K$, we let $j_{i, t}^{(p)}$ be a uniformly random integer from $[m]$ completely independent from everything else.
- 그러면, iteration $t$ in epoch $p$, processor $i$ broadcasts the update vector \(u_{t, i}^{(p)} = \tilde{Q} \left( \nabla f_{j_{i,t}^{p}} (x_t^{(p)}) - \nabla f_{j_{i,t}^{p}} (y_t^{(p)}) + \nabla f(y^{(p)} \right)\)
- Each processor computes the total update
\(u_t^{(p)} = \frac{1}{K}\sum_{i=1}^K u_{t,i}\)
- Sets $x_{t+1}^{(p)} = x_t^{(p)} - \eta u_t^{(p)}$
- At the end of epoch $p$ each processor sets $y^{(p+1)} = \frac{1}{T} \sum_{t=1}^T x_t^{(p)}$
Theorem 3.6
Let $f(x) = \frac{1}{m} \sum_{i=1}^m f_i(x)$, where $f$ is l-strongly convex, and $f_i$ are convex and L-smooth, for all $i$, Let $x^* $ be the unique minimizer of $f$ over $\mathbb{R}^n$. Then, if $\eta = O(1/L)$ and $T = O(L/l)$, then QSVRG with initial point $y^{(1)}$ ensures $\mathbb{E}[f(y^{(p+1)})] - f(x^* )\leq 0.9^p \left( f(y^{(1)} - f(x^* ) \right)$ for any epoch $p \geq 1$. Moreover, QSVRG with $T$ iterations per epoch requires $ \leq (F + 2.8n) (T+1) + Fn$ bits of communication per epoch.
- 방정식 $\mathbb{E}[f(y^{(p+1)})] - f(x^* ) \leq 0.9^p \left( f(y^{(1)} - f(x^* ) \right)$ Epoch가 크면 클수록 Converge 됨을 증명하였다.
- 필요 비트량 역시 제시 하였다.
- 만일, 내가 생각하는 알고리즘이라고 한다면, 비트량이 일정하게 되므로 일종의 Threshold로 위 조건을 놓을 수 있을 것이다.
Discussion
- Full version [1]의 경우 위 Theorem의 증명과 SGD 적용 사례가 모두 들어가 있다.
QSGD Variants
Experiments
- Vision 실험에서는 AlexNet[2], VGG[3], ResNet[4], LSTM[5]에 본 알고리즘을 적용
Accuracy
Full-precision에 대하여 Accuracy 측면만 살펴보아도 충분히 유의미하다고 볼 수 있다.
- Communication 적인 이점 (즉, Bit량 감축)
-
Computation 이점 (연산 속도 이점) 이 이미 존재하므로 Accuracy가 얼마나 좋은지만 살펴보면 된다.
- 4/8 bit gradient quantization 이면 Full-precision에 비하여 충분히 좋은 성능 (심지어 좀 더 좋은 결과를 내 놓는다., 아마도 Qunatization 자체가 Stochastic noise를 섞는 효과가 있기 때문)
- 8 bit 512 bucket size에서 가장 좋은 성능이 나왔다.
- Full presicion 보다 성능이 좋았다.
- Qunatization 자체가 Stochastic 적인 특성을 가지는 것으로 보인다.
- Noise를 추가하는 알고리즘을 사용했을 때 아무런 benefit이 없었다.
- 이는 Qunatization 자체가 Zero-mean Noise를 부가하는 것과 같은 효과가 있기 때문이다.
- Accuracy를 증가시키는 시도들, 여러 시도들은 LSTM에서 가장 높은 성과를 얻을 수 있었다.
Conclusion
실험결과 Full-precision에 대하여 QSGD는 충분한 경쟁력을 가지고 있다.
Reference
[1] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, Milan Vojnovic, “QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding”, https://arxiv.org/abs/1610.02132
[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[3]Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[5]Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
Comments