[Lecture Note 10] Quasi Newton · Jinwuk Seok's Mathematical Pages

Newton Method
Steepest Descent
Conjugate Method
Quasi Newton
Quasi Newton : Variable metric method
- Idea
Generating $\beta_j$

다음과 같은 Quadratic Problem을 생각해 보면 $\min_{x \in \mathbb{R}^n} f(x) \Rightarrow f(x) = \frac{1}{2} \langle x, Hx \rangle + \langle d, x \rangle$

Newton Method

$x_{i+1} = x_i - \lambda_i H^{-1} g_i \;\;\; \text{where }\;\; g_i = \nabla f(x_i) = Hx + d$

We need only 1 iteration to find the minimizer
However, the problem finding $H$ is so expensive

Steepest Descent

$x_{i+1} = x_i - \lambda_i g_i \;\;\; \text{where }\;\; g_i = \nabla f(x_i), \;\; \lambda_i = \arg \min f(x_i - \lambda g_i)$

Conjugate Method

$x_{i+1} = x_i - \lambda_i h_i \;\;\; \text{update } h_i \text{ so that } h_i \text{ is } H \text{-Conjugate}$

It needs at most $n$ iterations to find the minimizer

Quasi Newton

Estimate the Hessian $H$ i.e. construct a sequence of ${ H_i }$ so that $H_i \rightarrow H $ (or construct a sequence of ${ \beta_i }$ so that $ \beta_i \rightarrow \beta = H^{-1}$ ) Update rule $x_{i+1} = x_i - \lambda_i \beta_i g_i$

위와 같은 Quadaratic Problem의 경우 최대 $n$ iterations 이 minimizer를 찾는데 필요. (i.e. $\beta_n = H^{-1}$)

Quasi Newton : Variable metric method

Idea

$\min_{x \in \mathbb{R}^n} f(x)$ 를 2차 함수 근사화 하는 것 즉, 다음과 같이 근사하여 생각한다. $f(x) = \langle d, x \rangle + \frac{1}{2} \langle x , Hx \rangle$ given a positive definite $n \times n$ Matrix $H$.

이때, 다음과 같이 Search Direction을 놓는다고 가정하자. (아래의 경우 $df(x, h)$ 에서 $df(x, h) = 0$ 이 되면 최적이다.) 그리고, 다음 조건을 통해 Search Direction의 특성을 분석한다.

Search direction : $h(x) = \arg \min { \frac{1}{2} | h |_H^2 + df(x, h) }$

Steepest descent의 경우

$h(x) = \arg \min \{ \frac{1}{2} \| h \|^2 + \langle \nabla f(x), h \rangle \}, \;\;\; \frac{\partial h(x)}{\partial h} = h + \nabla f(x) = 0 \Rightarrow h(x) = - \nabla f(x)$

Newton Method의 경우

$h(x) = \arg \min \{ \frac{1}{2} \| h \|_H^2 + df(x, h) \}$ where $H$ is the Hessian of $f(x)$ at $x$

\[\frac{\partial h(x)}{\partial h} = Hh + \nabla f(x) = 0 \Rightarrow h = -H^{-1} \nabla f(x)\]

Quasi Newton의 아이디어

Hessian $H$ 대신 $Q$ Matrix를 통해 $h(x) = \arg \min \{ \frac{1}{2} \| h \|_Q^2 + df(x, h) \}$

Newton 방법을 살펴보면 Let $f(x) = \langle d, x \rangle + \frac{1}{2} \langle x , Hx \rangle$. Let $x_0, \cdots x_n$ be a sequence of distict vectors, and let $B = H^{-1}$ Set $\begin{align} g_i &= \nabla f(x_i) = Hx + d, \\ \Delta x_i &= x_{i+1} - x_i, \\ \Delta g_i &= g_{i+1} - g_i = Hx_{i+1} + d - Hx_i - d = H\Delta x_i \\ & \Rightarrow H^{-1}\Delta g_i = \Delta x_i \;\;\; \text{or} \;\;\; B \Delta g_i = \Delta x_i \;\; \forall i \\ & \Rightarrow B [\Delta g_0 \cdots \Delta g_{n-1}] = [\Delta x_0 \cdots \Delta x_{n-1}] \\ & \Rightarrow B \Delta G = \Delta X \end{align}$

만일, $\Delta G$ is non-singular, then $B = H^{-1} = \Delta X \Delta G^{-1}$

One starts with $B_0$, a Symmetric Positive Definite $n \times n$ which is an initial guess of $H^{-1}$ and $x_0$ . 그리고 다음의 update rule을 생각해 본다면 $x_{i+1} = x_i - \lambda_i \beta_i g_i$ 여기서 $\lambda_i$는 다음으로 정의된다. $\lambda_i = \arg \min f(x_i - \lambda_i \beta_i g_i)$ 여기서 $\beta_{i+1}$은 반드시 다음의 Quasi-Newton 특성을 만족하여야 한다. $\beta_{i+1} \Delta g_k = \Delta x_k, \;\;\; k=0, 1, \cdots, i$

**Example ** $\beta_0 = I = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} , \;\;\; \Delta g_0 = \begin{bmatrix} 2 \\ 0 \\ 0 \end{bmatrix} , \;\;\; \Delta x_0 = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$ 여기서 Quasi Newton 조건을 만족하는 $\Delta \beta$를 찾는 문제 $\beta_1 = \beta_0 + \Delta \beta \;\;\;\text{ so that }\;\;\; \beta_1 \cdot \begin{bmatrix} 2 \\ 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix}$ 간단히 $\beta_1 = \begin{bmatrix} \frac{1}{2} & 0 & 0 \\ \frac{1}{2} & 1 & 0 \\ \frac{1}{2} & 0 & 1 \end{bmatrix} \;\;\; \Delta \beta = \beta_1 - \beta_0= \begin{bmatrix} -\frac{1}{2} & 0 & 0 \\ \frac{1}{2} & 0 & 0 \\ \frac{1}{2} & 0 & 0 \end{bmatrix}$ 그러므로 $x_2 = x_1 - \lambda_1 \beta_1 g_i, \;\;\; g_2 = \nabla f(x_2)$ 여기서 $\Delta g_1 = g_2 - g_1 = \begin{bmatrix} 0 \\ 1 \\ 1 \end{bmatrix} ,\;\;\; \Delta x_1 = x_2 - x_1 = \begin{bmatrix} 2 \\ 3 \\ 1 \end{bmatrix}$ 이라 하면 $\beta_2 = \beta_1 + \Delta \beta_1$ find $\Delta \beta_1$ so that $\beta_2 \Delta g_i = \Delta x_i$ for $i=0,1$

다시 말하면 $\beta_2 \Delta g_0 = \Delta x_0 \Rightarrow (\beta_1 + \Delta \beta_1) \Delta g_0 = \Delta x_0 \Rightarrow \Delta \beta_1 \Delta g_0 = 0 \\ \therefore \Delta g_0 \in \mathcal{N}(\Delta \beta_1)$ 또한 $\beta_2 \Delta g_1 = \Delta x_1 \Rightarrow (\beta_1 + \Delta \beta_1) \Delta g_1 = \Delta x_1 \Rightarrow \beta_1 \Delta g_1 - \Delta x_1 = -\Delta \beta_1 \Delta g_1 \\ \therefore \Delta g_1 \in \mathcal{R}(\Delta \beta_1)$

Generating $\beta_j$

Quasi Newton Property $\beta_{i+1} \Delta g_k = \Delta x_k, \;\;\; k=0,1, \cdots i$

Proposition : Rank-One Method

Supppose that $H$ is an $n \times n$ nonsingular matrix and let $\beta = H^{-1}$ let $H^* = H + ab^T$ for some $a, b \in \mathbb{R}^n$ If $\beta^* = {H^*}^{-1}$ exists, then it is given by $\beta^* = \beta - \frac{\beta ab^T \beta}{1 + b^T \beta a}$

Note : $ab^T$ 의 성질

$ab^T$ 의 Range space : $a$ 에 의해 Span 된다 $b^T x$ 는 Scalar 이므로 즉

\[ab^T x = a(b^T x)\]

$ab^T$ 의 Null space : $b$가 $x$의 Orthogonal Space

\[a(b^T x) = 0\]

proof

Since $\beta^$ exists we can set $\beta^ = \beta + \Delta \beta$, then

$\begin{align} \beta^* H^* = I &= (\beta + \Delta \beta)(H + ab^T) = \beta H + \beta ab^T + \Delta \beta ab^T + \Delta \beta H \\ &= I + \beta ab^T + \Delta \beta H^* \end{align}$ 그러므로 다음이 성립해야 한다. $0 = \beta ab^T + \Delta \beta H^* \tag{1}$ 그리고 $\Delta \beta H^* = -\beta ab^T \Rightarrow \Delta \beta = -\beta ab^T {H^*}^{-1} \triangleq \beta a C^T \therefore C^T \triangleq b^T {H^*}^{-1} \in \mathbb{R}^n \tag{2}$ Substitute (2) into (1) $\begin{align} 0 &= \beta ab^T + \beta aC^T H^* = \beta a (b^T + C^T H + C^T ab^T) \Rightarrow b^T + C^T H + C^T ab^T = 0 \\ C^T H &= -(1 + C^T a)b^T \Rightarrow C^T = -(1+C^T a)b^T H^{-1} = -(1+C^T a)b^T \beta \tag{3} \end{align}$

식 (3)을 다시 정리하면 $C^T = -(1+C^T a)b^T \beta \Rightarrow C^T = -b^T \beta - C^T ab^T \beta$ 여기에 $a$를 곱하면 $C^T a = -b^T \beta a- C^T ab^T \beta a \in \mathbb{R} \Rightarrow C^T a (1 + b^T \beta a) = -b^T \beta a$ 그러므로 $C^T a = \frac{-b^T \beta a}{1+ b^T \beta a} \tag{4}$ Substitute (4) into (3) $\begin{align} C^T = -b^T \beta + \frac{b^T \beta a b^T \beta}{1+ b^T \beta a} &= \frac{-b^T\beta(1+b^T \beta a) + b^T \beta a b^T \beta}{1 + b^T \beta a} \\ &= \frac{-b^T\beta -b^T\beta a b^T \beta + b^T \beta a b^T \beta}{1 + b^T \beta a} \\ &= \frac{-b^T \beta}{1 + b^T \beta a} \end{align}$ 따라서 $\beta^* = \beta + \Delta \beta = \beta + \beta aC^T = \beta - \frac{\beta ab^T \beta }{1 + b^T \beta a}$ Q.E.D

Compute $\beta_{i+1}$

Let $\beta_0 = I$ using a update value $\beta_{i+1} = \beta_i + \alpha_i z_i z_i^T$ (이것을 rank-1 Property 라고 한다)for $i=0, 1, 2, \cdots$ with $\beta_{i+1}$ required to satisfy. $\beta_{i+1} \Delta g_k = \Delta x_k$, $k=0, 1, \cdots, i$ where $x_i, \Delta x_i, \Delta g_i$ are computed by $\begin{align} x_{i+1} &= x_i - \lambda_i \beta_i g_i \\ \lambda_i &= \arg \min f(x_i - \lambda \beta_i g_i) \\ g_i &= \nabla f(x_i) \\ \Delta x_i &= x_{i+1} - x_i \\ \Delta g_i &= g_{i+1} - g_i \end{align}$

Find $\alpha_i, z_i$

Quasi-newton property has to be satisfied. i.e. $\beta_{i+1} \Delta g_i = \Delta x_i$ $\begin{align} (\beta_i + \alpha_i z_i z_i^T) \Delta g_i = \Delta x_i &\Rightarrow \Delta x_i = \beta_i \Delta g_i + \alpha_i z_i \langle z_i, \Delta g_i \rangle \\ &\Rightarrow \alpha_i z_i = \frac{\Delta x_i - \beta_i \Delta g_i}{\langle z_i, \Delta g_i \rangle} \end{align} \tag{5}$ Since $\begin{align} \langle \Delta x_i, \Delta g_i \rangle &= \langle \Delta g_i, \beta \Delta g_i \rangle + \alpha_i \langle z_i, \Delta g_i \rangle^2 \\ \alpha_i \langle z_i, \Delta g_i \rangle^2 &= -\langle \Delta g_i, \beta_i \Delta g_i - \Delta x_i \rangle = \langle \Delta g_i, \Delta x_i - \beta_i \Delta g_i \rangle \end{align} \tag{6}$ Substitute (5), (6) into $\beta_{i+1} = \beta_i + \alpha_i z_i z_i^T$ $\begin{align} \beta_{i+1} &= \beta_i + \frac{\Delta x_i - \beta_i \Delta g_i}{\langle z_i, \Delta g_i \rangle} \left( \frac{\Delta x_i - \beta_i \Delta g_i}{\alpha \langle z_i, \Delta g_i \rangle} \right)^T \\ &= \frac{(\Delta x_i - \beta_i \Delta g_i)(\Delta x_i - \beta_i \Delta g_i)^T}{\langle \Delta g_i, \Delta x_i - \beta_i \Delta g_i \rangle} \end{align}$

Note

$\beta_{i+1} = \beta_i + \frac{(\Delta x_i - \beta_i \Delta g_i)(\Delta x_i - \beta_i \Delta g_i)^T}{\langle \Delta g_i, \Delta x_i - \beta_i \Delta g_i \rangle}$ 위 에서 $\langle \Delta g_i, \Delta x_i - \beta_i \Delta g_i \rangle \neq 0$ 단, 현재는 $k=i$ 에서만 구한 것, 따라서 $k=i-1, \cdots 0$ 에서도 성립함을 증명해야 한다.

Theorem

The Matrix $\beta_i$ satisfies Quasi-Newton property i.e. $\beta_{i+1} \Delta g_k = \Delta x_k, \;\;\; k=0, 1, \cdots i-1, i$

proof

By Induction,

$\beta_1 \Delta g_0 = \Delta x_0$, by construction
Suppose that $\beta_k \Delta g_j = \Delta x_j$, $j=0, 1, \cdots k-1$ are satisfied.
Need to prove that

$\beta_{k+1} \Delta g_j = \Delta x_j, \;\;\; j=0, 1, \cdots k$ if $j=k$ then by construction that are satisfied.

for $j \leq k-1$ $\begin{align} \beta_{k+1} \Delta g_j = \beta_k \Delta g_j + y_k \langle \Delta x_k - \beta_k \Delta g_k, \Delta g_j \rangle \beta_{k+1} = \beta_k + \frac{(\Delta x_k - \beta_k \Delta g_k)(\Delta x_k - \beta_k \Delta g_k)^T}{\langle g_k, \Delta x_k - \beta_k \Delta g_k \rangle} \end{align}$ Let $y_k \triangleq \frac{\Delta x_k - \beta_k \Delta g_k}{\langle g_k, \Delta x_k - \beta_k \Delta g_k \rangle}$ Since $\beta_k \Delta g_i = \Delta x_j$ (by hypothesis) and $\beta_k$ is symmetric and $\beta_k \Delta g_j = \Delta x_j ,\;\;\; \Delta g_j = \beta_k^{-1} \Delta x_j , \;\;\; \beta_k^{-1} = H$ we can conclude that $\begin{align} \langle \Delta x_k - \beta_k \Delta g_k, \Delta g_j \rangle &= \langle \Delta x_k, \Delta g_j \rangle - \langle \beta_k \Delta g_k, \Delta g_j \rangle \\ &= \langle \Delta x_k, H\Delta x_j \rangle - \langle \Delta x_k, H\Delta x_j \rangle = 0 \end{align}$ Thus, $\beta_{k+1} = \Delta g_j = \beta_k \Delta g_j = \Delta x_j$ Q.E.D

Rank-1 test method requires that $\langle \Delta g_k, \Delta x_k - \beta_k \Delta g_k \rangle \neq 0 $ 이러한 단점을 보완하기 위하여 Rank-2 test model이 개발 되었다.

Note

다시말해 rank-one Method는 $\beta_{i+1} = \beta_i + \alpha_i z_i z_i^T$ 를 의미하여

$z_i = \Delta x_i - \beta_i \Delta g_i$
$\alpha_i = (\langle \Delta g_i, \Delta x_i - \beta_i \Delta g_i \rangle)^{-1}$

을 의미한다.

Rank Two Method

In rank one method

\[\begin{align} \beta_{i+1} &= \beta_i + \alpha_i (\Delta x_i - \beta_i \Delta g_i)(\Delta x_i - \beta_i \Delta g_i)^T \;\;\; \alpha_i \in \mathbb{R} \\ &= \beta_i + \alpha_i \Delta x_i \Delta x_i^T - \alpha_i (\beta_i \Delta g_i \Delta x_i ^T + \Delta x_i (\beta_i \Delta g_i)^T) + \alpha_i \beta_i \Delta g_i (\beta_i \Delta g_i)^T \end{align}\]

여기에서 다음은 Symmetric 이다.

$\beta_i$
$\Delta x_i \Delta x_i^T$
$(\beta_i \Delta g_i \Delta x_i ^T + \Delta x_i (\beta_i \Delta g_i)^T)$
$\beta_i \Delta g_i (\beta_i \Delta g_i)^T$

그러나 다음 항은 Symmetric이 아니다.

$\beta_i \Delta g_i \Delta x_i ^T $ 이 항은 대칭이 되는 다른 항을 더하였기 때문에 Symmetric이 된다. (위에서 세번째 항목)

Idea

Supress the non-symmetric terms, and let

\[\beta_{i+1}^{DFP} \triangleq \beta_i + \beta_i \Delta x_i \Delta x_i^T + \gamma (\beta_i \Delta g_i)(\beta_i \Delta g_i)^T\]

it is invented by Davison, Fletcher, Powell .. 그래서 DFP 법이라고 한다.

Since it has to satisfy Quasi-Newton method property, i.e. $\beta_{i+1} \Delta g_i = \Delta x_i$

\[\Delta x_i = \beta_i \Delta g_i + \beta_i \Delta x_i \Delta x_i^T \Delta g_i + \gamma_i (\beta_i \Delta g_i)(\beta_i \Delta g_i)^T \Delta g_i\]

Pick

\[\beta_i = \frac{1}{\langle \Delta x_i, \Delta g_i \rangle}, \;\;\; \gamma_i = \frac{-1}{\langle \beta_i \Delta g_i, \Delta g_i \rangle}\]

여기서 $\langle \beta_i \Delta g_i, \Delta g_i \rangle$ 은 0이 되지 않는다. ($\Delta g_i = 0$이면 최적점 이다.) 그리고 $\langle \Delta x_i, \Delta g_i \rangle$ 은 0이 될 가능성이 극히 작다. ($\Delta x_i$ 와 $\Delta g_i$가 Orthogonal 한 경우? )

DFP : Variable Metric Algorithm

Procesdure	Processing
Data	$x_o \in \mathbb{R}^n$
	$B_0$: a Symmetric positive definite matrix $n \times n$ ex) $I$
Step 0	Set $i=0$
Step 1	If $g_i = 0$ stop else continue
	Step Size Rule :
	$\lambda_i = \arg \min f(x_i - \lambda \beta_i g_i$)
Step 2	Update
	$x_{i+1} = x_i - \lambda_i \beta_i g_i$
	$\Delta x_i = x_{i+1} - x_i $
	$\Delta g_i = g_{i+1} - g_i $
	$\beta_{i+1} = \beta_i + \frac{1}{\langle \Delta x_i , g_i \rangle} \Delta x_i \Delta x_i^T - \frac{(\beta_i \Delta g_i)(\beta_i \Delta g_i)^T}{\langle \Delta g_i, \beta_i \Delta g_i \rangle }$
Step 3	Set $i=i+1$ and goto step 1

Results

Suppose that $f(x) = \frac{1}{2} \langle x, Hx \rangle + \langle d, x \rangle $ with $H$ a symmetric positive definite $n \times n$ matrix.

If $\beta_i$ is a symmetric positive definite matrix then $\beta_{i+1}$ is a symmetric positive definite matrix
$\beta_{i+1} \Delta g_k = \Delta x_k$, $k=0, 1, \cdots, i$ ($\beta_n = H^{-1}$)
$x_{n+1}$ is a minimizer of $f(\cdot)$
${ \Delta x_i }_{i=0}^{n-1}$ are H-conjugate i.e. $\langle \Delta x_i, H\Delta x_j \rangle = 0$ for $i \neq j$

1 $\beta \Delta G = \Delta X \;\;\; \because \beta = H^{-1}$

2 $\Delta G = \beta^{-1} \Delta X = H \Delta X$

The difference between 1, and 2 $\beta \leftrightarrow H, \;\;\; \Delta g_i \leftrightarrow \Delta x_i \;\;\text{(duality holds)}$ Since we have $\beta_{i+1} = \beta_i + \frac{\Delta x_i \Delta g_i^T}{\langle \Delta x_i, \Delta g_i \rangle} - \frac{(\beta \Delta g_i)(\beta \Delta g_i)^T}{\langle \Delta g_i, \beta_i \Delta g_i}$ by duality, we can estimate the Hessian $H$ by

\[H_{i+1} = H_i + \frac{\Delta g_i \Delta g_i^T}{\langle \Delta g_i, \Delta x_i \rangle} - \frac{(H_i \Delta x_i)(H_i \Delta x_i)^T}{\langle x_i , H_i \Delta x_i \rangle}\]

Duality

\[\beta_i \rightarrow H_i \;\;\; \Delta g_i \rightarrow \Delta x_i \;\;\; \Delta x_i \rightarrow \Delta g_i\]

But $x_{i+1}$ is updated by $x_{i+1} = x_i - \lambda_i H_i^{-1} g_i \;\;\; \text{i.e.} \;\;\; H_i^{-1} \beta_i$ is needed

Since $(A + ab^T)^{-1} = A^{-1} - \frac{(A^{-1}a)(b^T A^{-1})}{1 + b^T A^{-1} a}$ 위에서 $H$의 Inverse를 생각해 보면 $H_{i+1} = C_i - \frac{(H_i \Delta x_i)(H_i \Delta x_i)^T}{\langle \Delta x_i, H_i \Delta x_i \rangle}$ 의 Inverse 와 $C_i = H_i + \frac{\Delta g_i \Delta g_i^T}{\langle \Delta g_i, \Delta x_i \rangle}$ 의 Inverse를 생각할 수 있다.

(Matrix Inversion Lemma 에 의하여 … 각각을 만들어 보면)

We can obtain $\beta_{i+1} = H_{i+1}^{-1}$ so that $\beta_{i+1}^{BFGS} = \beta_i + \left( \frac{1 + \Delta x_i^T \beta_i \Delta x_i}{\langle \Delta x_i, \Delta g_i \rangle}\right) \frac{\Delta g_i \Delta g_i^T}{\langle \Delta g_i, \Delta x_i \rangle} - \frac{\Delta g_i \Delta x_i^T \beta_i + \beta_i \Delta x_i \Delta g_i^T}{\langle \Delta x_i, \Delta g_i \rangle}$

… invented by Broyden, Fletcher Goldfard Shannon BFGS Method

Broyden Family

$\beta_{i+1}^{\phi} = \phi_i \beta_{i+1}^{DFP} + (1 - \phi_i) \beta_{i+1}^{BFGS}$

Nonlinear Optimization UST Lecture

Newton Method

Steepest Descent

Conjugate Method

Quasi Newton

Quasi Newton : Variable metric method

Idea

Steepest descent의 경우

Newton Method의 경우

Quasi Newton의 아이디어

Generating $\beta_j$

Proposition : Rank-One Method

proof

Compute $\beta_{i+1}$

Find $\alpha_i, z_i$

Note

Theorem

proof

Note

Rank Two Method

Idea

DFP : Variable Metric Algorithm

Results

Broyden Family

Comments

Related Posts

Natural Policy Gradient and Fisher Information Matrix 25 Sep 2022

Convergence analysis for Stochastic Gradient Descent 10 Nov 2019

Manifold Denosing 14 Sep 2019