Convex optimization notes from Yuanming Shi

这篇博客只记录每章我认为需要记忆的一些名词定义和公式。这套课件是我导师自己写的，吸收了各家之长，知识更广泛，更贴近于科研需要的知识。

注意：这个笔记只用来整理知识架构，备忘，所以里面有些公式因为在Markdown里面不能完美复现，比如向量和矩阵的加粗显示，所以本笔记里很多公式里是向量但是我没加粗。使用的时候一定要注意。可以再在书里查看一下。

注：由于公式中的字符在md转html过程中出现问题，导致里面的公式带有全部都没有正常显示，排版全部混乱，可下载相应的md文件离线查看

Convex set
Convex function
Convex Optimization Problems
Lagrange Duality
Constructive Convex Analysis and Disciplined Convex Programming
Gradient method
- Gradient methords for unconstrained problem
- Gradient methods for constrained problems
Subgradient methods
Proximal gradient methods
Accelerated gradient methods
Smoothing for nonsmooth optimization
Dual and primal-dual methods
Alternating direction method of multipliers (ADMM)
Quasi-Newton methods
Stochastic gradient methods

Convex set

Affine set : $x_1,x_2 \in C,\theta \in R\Rightarrow x=\theta x_1+(1-\theta x_2)\in C$ (Line)
Convex set : $x_1,x_2 \in C,\theta \in [0,1]\Rightarrow x=\theta x_1+(1-\theta x_2)\in C$ (Line segment)
Hyperplane : ${x a^Tx=b}$

Polyhedra : ${ {\mathbf{x} }

{\mathbf{A} }{\mathbf{x} }\preceq{\mathbf{b} }},{\mathbf{C} }{\mathbf{x} }={\mathbf{d} }}$ ($A\in R^{m\times n},C\in R^{p\times n},\preceq$ is componentwise inequality)

Euclidean Ball :$B(x_c,r)={x {\left| {x - {x_c} } \right|_2} \le r}={x_c+ru {\left| u\right|_2} \le 1}$
Ellipsoid : $E(x_c,P)={x (x-x_c)^TP^{-1}(x-x_c) \le 1}={x_c+Au {\left| u\right|2} \le 1}$, $P\in S^n{++}$
Norm: a function $\left| {\cdot } \right|$ that satifies
- $\left| {x} \right| \ge 0$; $\left| {x} \right| = 0$ if and only if $x=0$
- $\left| {tx} \right| = t \left| {x} \right|$ for $t\in R$
- $\left| {x+y} \right| \le \left| {x} \right|+\left| {y} \right|$
- Norm ball : ${x \left| {x-x_c} \right|\le r}$
- Norm cone: ${(x,t)\in R^{n+1} \left| {x} \right|\le t}$ (Euclidean norm cone, second-order cone, ice-cream cone)

Positive semidefinite

$S^n$ is set of symmetric $n\times n$ matrices

$S^n_+={\mathbf{X}\in S^n

\mathbf{X} \succeq 0}$: positive semidefinite $n \times n$ matrices。$\mathbf{X}\in S^n_+\Leftrightarrow z^T \mathbf{X} z \ge 0 {\rm \ for\ all\ z }$

$S^n_{++}={\mathbf{X}\in S^n \mathbf{X} \succ 0}$: positive definite $n \times n$ matrices。

Operation that Preserve Convexity
- intersection
- affine function
- perspective function
- linear-fractional function
Proper cone:
- $K$ is closed (contains its boundary)
- $K$ is solid (has nonempty interior)
- $K$ is pointed (contains no line $x\in K,-x \in K\Rightarrow x=0$)
Dual Cones $K^*={y y^Tx \ge0 {\rm \ for\ all\ }x \in K}$
- self-dual
- $y \succeq_{K^*}0 \Leftrightarrow y^Tx \ge 0 {\rm\ for\ all\ }x \succeq_K0$

Convex function

Convex function : $f(\theta x+(1-\theta)y\le \theta f(x)+(1-\theta)f(y)$ for $0\le\theta\le1$
Example on $R^n$
- Affine function $f(\mathbf{x})=\mathbf{a}^T\mathbf{x}+b$ is convex and convave
- Norms $\left |\mathbf{x} \right|$ are convex
- Quadratic function $f(\mathbf{x})=\mathbf{x}^T\mathbf{P}\mathbf{x}+2\mathbf{q}^T\mathbf{x}+r$ is convex if and only if $\mathbf{P}\succeq0$
- Geometric mean $f(\mathbf{x})=(\prod\nolimits_{i = 0}^n x_i )^{1/n}$ is convave
- log-sum-exp $f(\mathbf{x})=\log \sum_ie^{x_i}$ is convex
- Quadratic over linear $f(\mathbf{x},y)=\mathbf{x}^T\mathbf{x}/y$ is convex on $R^n \times R_{++}$
Epigraph ${\rm epi\ } f ={(x,t)\in R^{n+1} x\in {\rm dom \ }f, f(x)\le t} $

Restriction of a Convex Function to a Line

$f:R^n \rightarrow R$ is convex if and only if the function $g:R \rightarrow R, g(t)=f(\mathbf{x}+t\mathbf{v}), {\rm dom\ }g={t

\mathbf{x}+t\mathbf{v}\in {\rm dom\ }f}$ is conve for any $\mathbf{x}\in {\rm dom\ }f,\mathbf{v} \in R^n$

arbitrary line

First-order condition:$f(\mathbf{y})\ge f(\mathbf{x})+\nabla f(\mathbf{x})^T(\mathbf{y}-\mathbf{x})\ \ \ \forall \mathbf{x},\mathbf{y}\in {\rm dom} f$
Second-order condition: $\nabla^2f(\mathbf{x})\succeq0 \ \ \ \forall \mathbf{x}\in {\rm dom} f $
Operations that preserve convexity
- nonnegative weight sum $\alpha_1f_1+\alpha_2f_2$
- composition with affine function $f(\mathbf{A}\mathbf{x}+\mathbf{b})$
- composition with scalar function $f(x)=h(g(\mathbf{x}))$
- pointwise maximum $f:=\max{f_1,\cdots,f_n}$
- supermum,minimization $g(x)=\mathop {\rm sup}\limits_{\mathbf{y}\in A}f(x,y) ,g(x)=\mathop {\rm inf}\limits_{\mathbf{y}\in A}f(x,y) $
- perspection $g(x,t) =tf(x,t), {\rm dom\ } g= {(x,t)\in R^{n+1} x/t \in {\rm dom\ }f,t >0}$
Quasi-convexity : the sublevel sets $S_\alpha={\mathbf{x}\in {\rm dom\ }f f(\mathbf{x})\le \alpha} $ is all convex
Log-convexity : $\log f$ is convex

Convex Optimization Problems

stationary point $\nabla f(\mathbf{x})=0$
- local minimum and gobal minimum
- Saddle point: if $\mathbf{x}$ is a stationary opint and for any neighborhood $B \subseteq R^n $ exist $\mathbf{y},\mathbf{z} \in B$ such that $f(\mathbf{z})\le f(\mathbf{x})\le f(\mathbf{y})$ and $\lambda_{\min}(\nabla^2f(x))\le0$
Convex Optimization problem
- $\begin{array}{*{20}{c} } { {\rm{minimize} } }&{ {f_0} (x)}&{}
  { {\rm{subject\ to} } }&{ {f_i}(\mathbf{x}) \le 0}&{i = 1, \cdots ,m}
  {}&{ {\bf{Ax} } = {\bf{b} } }&{} \end{array}$
- $f_0,f_1,\cdots,f_m$ are convex and equality constraints are affine
Oprimality Criterion
- Minimum principle: $\nabla f_0(\mathbf{x^})^T(\mathbf{y}-\mathbf{x}^)\ge0 $ for all feasible $\mathbf{y}$
- Unconstrained problem: $\nabla f_0(\mathbf{x^})=0,x^\in {\rm dom\ }f$
- Equality constrained problem $\mathbf{A}\mathbf{x^}=\mathbf{b},\nabla f_0(\mathbf{x^})+\mathbf{A}\mathbf{v}=0$
- minimization over nonnegative orthant:$x \succeq0,\left{ {\begin{array}{*{20}{c} } { {\nabla _i}{f_0}(x) \ge 0}&{ {x_i} = 0}
  { {\nabla _i}{f_0}(x) = 0}&{ {x_i} > 0} \end{array} } \right.$
Equivalent Reformulations
- Introducing slack variables for linear inequalities$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x,s} }&{ {f_0}(x)}&{}
  { {\rm{subject\ to} } }&{ {\mathbf{a}_i}\mathbf{x} +s_i=b_i}&{i = 1, \cdots ,m}
  {}&{s_i\ge0}&{} \end{array}$
- epigraph form$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x,t} }&{t}&{}
  { {\rm{subject\ to} } }&{f_0(x)-t\le0}&{}\{}&{ {f_i}(\mathbf{x}) \le 0}&{i = 1, \cdots ,m}
  {}&{\bf{Ax} } = {\bf{b} }&{} \end{array}$
- minimizing over some variables$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{ { {\tilde f}_0}(\mathbf{x})}&{}
  { {\rm{subject\ to} } }&{ {f_i}(\mathbf{x}) \le 0}&{i = 1, \cdots ,m} \end{array}$ where ${ { {\tilde f}_0}(\mathbf{x})}={\rm inf}_yf_0(\mathbf{x},\mathbf{y})$
- Quasi-convex optmization: $f_0$ is quasiconvex, and $f_1,\cdots,f_m$ are convex
Classes of Convex Problem
- LP(Linear Programming) : objective and constraint functions are affine
- QP(Quadratic Programming): convex quadratic objective and affine constraint function
- QCQP(Quadratically Constrained Quadratic Programming), inequality constraint function is quadratic
- SOCP(Second-Order Cone Programming): linear objective and second-order cone inequality constrains $\left| A_ix+b\right|_2\le c_i^Tx+d_i\ \ \ i=1,\cdots,m$
- SDP(Semi-Definite Programming)$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{\mathbf{c}^T\mathbf{x} }&{}
  { {\rm{subject\ to} } }&{x_1\mathbf{F}_1+\cdots+x_n\mathbf{F}_n\preceq\mathbf{G} }&{}\{}&{\mathbf{A}\mathbf{x}=\mathbf{b} }\end{array}$

Lagrange Duality

Primal problem: $\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{f_0(\mathbf{x})}&{}
{ {\rm{subject\ to} } }&{f_i(x)\le0}&{i = 1, \cdots ,m}\{}&{ {h_i}(\mathbf{x}) == 0}&{i = 1, \cdots ,p} \end{array}$
Lagrangian $L(\mathbf{x},\mathbf{\lambda},\mathbf{v})=f_0(\mathbf{x})+\sum\limits_{i = 1}^m { { {\bf{\lambda } }i}{f_i}({\bf{x} })}+\sum\limits{i = 1}^p { { {\bf{v} }_i}{h_i}({\bf{x} })}$
Dual function: $g(\mathbf{\lambda},\mathbf{v})=\mathop{\rm inf}\limits_{x\in D}L(\mathbf{x},\mathbf{\lambda},\mathbf{v})$, concave

$f_0(x)\ge L(\mathbf{x},\mathbf{\lambda},\mathbf{v}) \ge g(\mathbf{\lambda},\mathbf{v})$
Lagrange dual problem: $\begin{array}{*{20}{c} } {\mathop {\rm{maximize} }\limits_{\mathbf{\lambda},\mathbf{v} } }&{g(\mathbf{\lambda},\mathbf{v})}&{}
{ {\rm{subject\ to} } }&{\mathbf{\lambda}\succeq 0}&{} \end{array}$
Duality
- week duality: $d^\le p^$(对偶问题的最优解小于原问题的最优解)
- strong duality: $d^= p^$, is very desirable, does not hold in general, usually holds for convex problem
- duality gap: $p^-d^$
slater’s Constraint Qualification: conditions that guarantee strong duality in convex problem
- $\exists x \in {\rm int\ }D),f_i(x)<0$
Complementary slackness $\lambda_i^f_i(x^)=0$（互补松弛性）
KKT condition
1. primial feasibility: $f_i(x)\le0,i=1,\cdots,m,h_i(x)=0,i=1,\cdots,p$
2. dual feasibility: $\lambda\succeq0$
3. complementary slackness: $\lambda_i^f_i(x^)=0$ for $i=1,\cdots,m$
4. zero gradient of Lagrangian with respect to $\mathbf{x}$ : $\nabla f_0(\mathbf{x})+\sum\limits_{i = 1}^m { { {\bf{\lambda } }i}{\nabla f_i}({\bf{x} })}+\sum\limits{i = 1}^p { { {\bf{v} }_i}{\nabla h_i}({\bf{x} })}=0$
- We already known that if strong duality holds and $x,\lambda,v$ are optimal, then they must satisfy the KKT condition.
- If $x^,\lambda ^,v^*$ satisfy the KKT conditions for a convex problem, then they are optimal.

Constructive Convex Analysis and Disciplined Convex Programming

Conic program: $\begin{array}{*{20}{c} } { {\rm{minimize} } }&{c^Tx}&{}
{ {\rm{subject\ to} } }&{Ax=b}&{x\in K} \end{array}$,where $K$ is convex cone.
- the modern canonical form
- there are well developed solvers for cone programs
如何求解一个凸问题
1. 用一个现存的定制的的求解器去求解特定的问题
2. 利用现存的流行方法研发一个针对你的问题的新的求解器
3. 把你的问题转换为一个cone program (CVX)，然后用一个标准cone program 求解器(SDPT3, MDSEK)

Basic example

Convex	Concave
$x^p(p\ge0 {\rm \ or\ } p\le0)$ $e^x$ $x\log x$ $a^Tx+b$ $x^TPx(P\succeq0)$ $\left\|x\right\|$(any norm) $\max{x_1,\dots,x_n}$	$x^p(0\le p\le1)$ $\log x$ $\sqrt{xy}$ $x^TPx(P\preceq0)$ $\min{x_1,\dots,x_n}$
$x^2/y(y>0),x^Tx/y(y>0),x^TY^{-1}x(Y\succ0)$ $\log(e^{x_1}+\cdots+e^{x_n})$ $f(x)=x_{[1]}+\cdots+x_{[k]}$(sum of lagrest $k$ entries) $f(x,y)=x\log(x/y)$ (x,y>0) $\lambda_{\max}(X)(X^T=X)$	$\log\det X,(\det X)^{1/n}$ $(X\succ 0)$ $\log\Phi(x)$ ($\Phi$is Gaussian CDF) $\lambda_{\min}(X)(X^T=X)$

Calculus rules
- nonnegative scaling: $\alpha f$
- sum : $f+g$
- affine composition : $f(Ax+b)$
- pointwise maximum : $\max_if_i(x)$
- composition: $h(f(x))$, $h$ convex increasing, $f$ convex
Disciplined convex programing(DCP)

描述凸优化问题的框架；基于Constructive Convex Analysis；凸性是充分而不必要条件；基于 domain specific languages (DSL) and tools
- zero or one objective, with form
  - minimize {scalar convex expression} or
  - maximize {scalar convex expression}
- zero or more constraints, with form
  - {convex expression} <={concave expression} or
  - {concave expression} >={convex expression} or
  - {affine expression} >={affine expression}
很容易构建一个DCP分析器；不难将DCP转化为cone problem
Modeling languages
- CVX Matlab Grant,Boyd 2005
- CVXPY Python Diamond, Boyd 2013
slides 中有示例代码，这里不再赘述

Gradient method

Gradient methords for unconstrained problem

Dradient descent (GD) $x^{t+1}=x^{t}-\eta_t\nabla f(x^t)$, a.k.a. steepest descent
Exact Line Search : $\eta_t=\arg \mathop \min \limits_{\eta \ge 0} f(x^t- \eta \nabla f(x^t))$
strongly convex and smooth
- A twice-differentiable function $f$ is said to be $\mu$-strongly convex and $L$-smooth if $0 \preceq \mu I \preceq\nabla^2f(x)\preceq LI, \forall x$
- more on strong convexity
  1. $f(y)\ge f(x)+\nabla f(x)^T(x-y)+\frac{\mu}{2}\left |x-y \right|^2_2, \forall x,y$
  2. $f(\lambda x+(1-\lambda)y)\le \lambda f(x)+(1-\lambda)f(y)-\frac{\mu}{2}\lambda(1-\lambda)\left |x-y \right|^2_2, \forall x,y,0\le\lambda\le1$
  3. $\langle \nabla f(x)-\nabla f(y),x-y\rangle \ge\mu\left |x-y \right|^2_2, \forall x,y$
  4. $\nabla^2 f(x)\succeq \mu I, \forall x$
- more on smoothness
  1. $f(y)\le f(x)+\nabla f(x)^T(x-y)+\frac{L}{2}\left |x-y \right|^2_2, \forall x,y$
  2. $f(\lambda x+(1-\lambda)y)\ge \lambda f(x)+(1-\lambda)f(y)-\frac{L}{2}\lambda(1-\lambda)\left |x-y \right|^2_2, \forall x,y,0\le\lambda\le1$
  3. $\langle \nabla f(x)-\nabla f(y),x-y\rangle \ge \frac{1}{L}\left |\nabla f(x)-\nabla f(y)\right|^2_2, \forall x,y$
  4. $\left |\nabla f(x)-\nabla f(y)\right|_2\le L\left |x-y \right|_2, \forall x,y$ (L-Lipschitz gradient)
  5. $\left | \nabla^2 f(x)\right| \le L, \forall x$
Backtracking line search
1. Initialize $\eta=1,0<\alpha\le1/2,0<\beta<1$
2. while $f(x^t-\eta\nabla f(x^t))>f(x^t)-\alpha\eta\left | \nabla f(x^t)\right|^2_2$ do:
3. $\eta \leftarrow\beta\eta$
- $f(x^t)-f(x^) \le \left( 1-\min{2\mu\alpha,2\beta\alpha\mu/L}\right)^{t}(f(x^0)-f(x^))$

Local strong convexity

Let $f$ be locally $\mu$-strongly convex and $L$-smooth such that $\mu I \preceq\nabla^2f(x)\preceq LI, \forall x\in B_0$, where $B_0:={x

\left|x-x^\right|_2\le \left|x^0-x^\right|_2}$

Regularity condition: $\langle \nabla f(x),x-x^\rangle \ge\frac{\mu}{2}\left |x-x^ \right|^2_2+\frac{1}{2L}\left |\nabla f(x)\right|^2_2, \forall x$
- $\left|x^t-x^\right|^2_2\le \left( 1- \frac{\mu}{L}\right)^t \left|x^0-x^\right|^2_2$

problem	algorithm	stepsize rule	convergence rate
Quadratic minimization ${\rm minimize}_x f(x):=\frac{1}{2}(x-x^)^TQ(x-x^)$	GD	$\eta_t=\frac{2}{\lambda_1(Q)+\lambda_n(Q)}$(constant stepsize)	$\left\|x^t-x^\right\|_2\le \left( \frac{\lambda_1(Q)-\lambda_n(Q)}{\lambda_1(Q)+\lambda_n(Q)}\right)^t \left\|x^0-x^\right\|_2$
Quadratic minimization ${\rm minimize}_x f(x):=\frac{1}{2}(x-x^)^TQ(x-x^)$	GD	Exact Line Search	$f(x^t)-f(x^) \le \left( \frac{\lambda_1(Q)-\lambda_n(Q)}{\lambda_1(Q)+\lambda_n(Q)}\right)^{2t}(f(x^0)-f(x^))$
Strongly convex and smooth functions	GD	$\eta_t=\frac{2}{\mu+L}$(constant stepsize)	$\left\|x^t-x^\right\|_2\le \left( \frac{\kappa-1}{\kappa+1}\right)^t \left\|x^0-x^\right\|_2$ $\kappa:=L/\mu$
Convex and smooth functions	GD	$\eta_t=\frac{1}{L}$	$f(x^t)-f(x^) \le \frac{2L\left\|x^0-x^\right\|_2^2}{t}$

Gradient methods for constrained problems

Frank-Wolfe/ conditional gradient algorithm
1. for $t=0,1,\cdots$ do
2. $y^t:=\arg\min_{x\in C}\langle \nabla f(x^t),x \rangle$ (direction finding)
3. $x^{t+1}=(1-\eta_t)x^t+\eta_ty^t$ (line search and update)
- stepsize $\eta_t$ determined by line search or $\eta_t=\frac{2}{t+2}$
- Frank-Wolfe can also be applied to nonconvex problems
Optimizing over Atomic Sets
- Gauge function $U_D(x):={\rm inf}_{t\ge0}{t x\in tD}$
- Support function $U^*D:={\rm sup}{s\in D}\langle s,y \rangle$
Projected gradient methods
1. for $t=0,1,\cdots$ do
2. $x^{t+1}=\mathcal{P}_{\mathcal{C} }(x^t-\eta_t\nabla f(x^t))$
  
  where $\mathcal{P}{\mathcal{C} }(x)=\arg\min{z\in \mathcal{C} }\left|x-z\right|_2$ is Euclidean projection onto $\mathcal{C}$
- Projection theorem
  - Let $\mathcal{C}$ be closed convex set, Then $x_\mathcal{C}$ is projection of $x$ onto $\mathcal{C}$ iff $(x-x_\mathcal{C})^T(z-x_\mathcal{C})\le0,\forall z\in \mathcal{C}$
  - $-\nabla f(x^t)^T(x^{t+1}-x^t)\ge 0$
  - Nonexpansicness of projection: $\left|\mathcal{P}{\mathcal{C} }(x)- \mathcal{P}{\mathcal{C} }(z)\right|_2\le \left|x-z\right|_2$

problem	algorithm	stepsize rule	convergence rate	iteration complexity
convex & smooth	Frank-Wolfe	$\eta_t=\frac{2}{t+2}$	$f(x^t)-f(x^*) \le \frac{2Ld_C^2}{t+2}$ $d_C={\rm sup}_{x,y\in C}\left\|x-y\right\|_2$	$O(\frac{1}{\epsilon})$
strongly convex & smooth ($x\in {\rm int}(\mathcal{C})$)	Projected GD	$\eta_t=\frac{2}{\mu+L}$	$\left\|x^t-x^\right\|_2\le \left( \frac{\kappa-1}{\kappa+1}\right)^t \left\|x^0-x^\right\|_2$ $\kappa:=L/\mu$
strongly convex & smooth	Projected GD	$\eta_t=\frac{1}{L}$	$\left\|x^t-x^\right\|_2\le \left( 1- \frac{\mu}{L}\right)^t \left\|x^0-x^\right\|_2$ $O((1-\frac{1}{\kappa})^t$	$O(\kappa\log\frac{1}{\epsilon})$
convex & smooth	Projected GD	$\eta_t=\frac{1}{L}$	$f(x^t)-f(x^) \le \frac{3L\left\|x^0-x^\right\|_2^2+f(x^0)-f(x^*)}{t+1}$ $O(\frac{1}{t})$	$O(\frac{1}{\epsilon})$

Subgradient methods

generalizing steepest descent
1. $d^t\in \arg\min_{\left|d \right|2\le1}f’(x^t,d)$, where $f’(x^t,d)=\lim{\alpha \downarrow0} \frac{f(x+\alpha d)-f(x)}{\alpha}$
2. $x^{t+1}=x^t+\eta_td^t$
问题：1. Finding steepest descent direction involves expensive computation, 2，stepsize rule 不好选择，可能收敛不到最优解，特别是不可导的点。
(projected) subgradient method

主要就是为了解决不可导的点的问题的
- $x^{t+1}=\mathcal{P}_{\mathcal{C} }(x^t-\eta_tg^t)$ , where $g^t$ is any subgradient of $f$ at $x^t$

Subgradient

We say $g$ is subgradient of $f$ at point $x$ if $f(z)\ge f(x)+g^T(z-x), \forall z$
- set of all subgradients of $f$ at $x$ is called subdifferential of $f$ at $x$, denoted by $\partial f(x)$

Basic rules

scaling: $\partial (\alpha f)=\alpha\partial (f)$
summation: $\partial (f_1+f_2)=\partial (f_1)\partial (f_2)$
affine transformation : $\partial (f(Ax+b))=A^T\partial f(Ax+b)$
chain rule: $\partial (g\circ f)=g’(f(x))\partial f(x)$

composition: suppose $f(x)=h(f_1(x),\cdots,f_n(x)),q=\nabla h(y)

_{y=[f_1(x),\cdots,f_n(x)]},g_i\in \partial f_i(x)$. then $q_1g_1+\cdots+q_ng_n\in \partial f(x)$

pointwise maximum: $f(x)=\max_{1\le i\le k}f_i(x)$, then $\partial f(x)={\rm conv}{ \cup {\partial f_i(x)

f_i(x)=f(x)}}$

pointwise supremum : $f(x)=\sup_{\alpha\in \mathcal{F} }f_\alpha(x)$, then $\partial f(x)={\rm closure}({\rm conv}{ \cup {\partial f_\alpha(x)

f_\alpha(x)=f(x)}})$

negative subgradent is not necessarily descent direction (lack of continuity)
- $f^{ {\rm best},t}:=\mathop \min\limits_{1\le i\le t} f(x^i)$
- $f^{\rm opt}:=\min_x f(x)$

Lipschitz function : $ f(x)-f(z) \le L_f|x-z|_2, \forall x,z$
Projected subgradient update：$x^{t+1}=\mathcal{P}_{\mathcal{C} }(x^t-\eta_tg^t)$
majorizing function: $|x^{t+1}-x^|_2^2\le|x^{t}-x^|_2^2-2\eta_t(f(x^t)-f^{\rm opt})+\eta^2_t|g^t|_2^2$
- Polyak’s stepsize rule: $\eta_t=\frac{f(x^t)-f^{\rm opt} }{|g^t|_2^2}\Rightarrow |x^{t+1}-x^|_2^2\le|x^{t}-x^|_2^2-\frac{(f(x^t)-f^{\rm opt})^2}{|g^t|_2^2}$
- 必须知道$f^{\rm opt}$, 所以并不实用

problem	algorithm	stepsize rule	convergence rate	iteration complexity
convex & Lipschitz problem	projected subgradient method	$\eta_t=\frac{1}{\sqrt{t} }$	$f^{ {\rm best}，t}\mathbin{\lower.3ex\hbox{$\buildrel<\over{\smash{\scriptstyle\sim}\vphantom{_x} }$} } \frac{\|x^0-x^*\|_2^2+L_f\log t}{\sqrt{t} }$ $O(\frac{1}{\sqrt{t} })$	$O(\frac{1}{\epsilon^2})$
strongly convex & Lipschitz problem	projected subgradient method	$\eta_t=\frac{2}{\mu(t+1)}$	$f^{ {\rm best},t}-f^{\rm opt}\le \frac{2L_f^2}{\mu}\cdot\frac{1}{t+1}$ $O(\frac{1}{t})$	$O(\frac{1}{\epsilon})$

Proximal gradient methods

composite models : $\begin{array}{*{20}{c} } { {\rm{minimize} }_x}&{F(x):=f(x)+h(x)}&{}
{ {\rm{subject\ to} } }&{x\in R^n}&{} \end{array}$, f convex and smooth, h convex$F^{\rm opt}:=\min_x F(x)$
- $\mathcal{l}_1 $regularized minimization $\begin{array}{*{20}{c} } { {\rm{minimize} }_x}&{f(x)+|x|_1} \end{array}$,use $\mathcal{l}_1 $ regularization to promote sparsity.
- nuclear norm regularized minimization $\begin{array}{{20}{c} } { {\rm{minimize} }_x}&{f(x)+|x|_} \end{array}$,use nuclear norm regularization to promote low-rank structure.
Proximal gradient descent
- $x^{t+1}=\arg \min_x \left { f(x^t)+\langle \nabla f(x^t),x-x^t\rangle+\frac{1}{2\eta_t}|x-x^t|_2^2\right }$
projected proximal gradient descent
- $x^{t+1}=\arg \min_x \left { f(x^t)+\langle \nabla f(x^t),x-x^t\rangle+\frac{1}{2\eta_t}|x-x^t|2^2\right }+\mathbb{L}\mathcal{C}(x)$
  
  $=\arg \min _x \left{ \frac{1}{2}|x-(x^t-\eta_t \nabla f(x^t))|_2^2+\eta_t c(x) \right}$
  
  where $\mathbb{L}_\mathcal{C}(x)=\left{ {\begin{array}{*{20}{c} } 0&{ {\rm{if\ } }x \in \mathcal{C} } \ \infty &{ {\text{else} } } \end{array} } \right.$

proximal operator

${\rm prox}_h(x):=\arg\min_z \left{ \frac{1}{2}|z-x|_2^2 +h(z)\right}$ , for any convex function $h$, may be non-smooth
Projected GD update $x^{t+1}={\rm prox}{\eta_t \mathbb{L}\mathcal{C} }(x^t-\eta_t \nabla f(x^t))$
accommodate more general $h$
1. for $t=0,1,\cdots$ do
2. $x^{t+1}={\rm prox}_{\eta_t h}(x^t-\eta_t \nabla f(x^t))$
在一般环境下定义良好，比如非光滑凸函数；能够被常用的函数有效评估，比如正则化函数；这个概念很简单，覆盖了很多众所周知的优化算法
basic rules
- if $f(x)=ag(x)+b$ with $a>0$, then ${\rm prox}f(x)={\rm prox}{ag}(x)$
- affine addition: $f(x)=g(x)+a^Tx+b$ , then ${\rm prox}f(x)={\rm prox}{g}(x-a)$
- quadratic addition: $f(x)=g(x)+\frac{\rho}{2}|x-a|2^2$, then ${\rm prox}_f(x)={\rm prox}{\frac{1}{1+\rho}g}(\frac{1}{1+\rho}x+\frac{\rho}{1+\rho}a)$
- scaling and translation : $f(x)=g(ax+b)$ with $a \ne 0$ , then ${\rm prox}f(x)=\frac{1}{a}({\rm prox}{a^2g}(ax+b)-b)$
- orthogonal mapping: $f(x)=g(Qx)$ with $Q^TQ=QQ^T=I$, then ${\rm prox}f(x)=Q^T{\rm prox}{g}(Qx)$
- orthogonal affine mapping: $f(x)=g(Qx+b)$ with $QQ^T=\alpha^{-1}I$, then ${\rm prox}f(x)=(I=\alpha Q^TQ)x+\alpha Q^T({\rm prox}{\alpha^{-1}g}(Qx+b)-b)$
- norm composition: $f(x)=g(|x|2)$, then ${\rm prox}_f(x)={\rm prox}{g}(|x|_2)\frac{x}{|x|_2}, \forall x \ne 0$
firm nonexpansiveness(非膨胀性) $\langle {\rm prox}h(x_1)-{\rm prox}{h}(x_2),x_1-x_2\rangle \ge |{\rm prox}h(x_1)-{\rm prox}{h}(x_2)|^2_2$
nonexpansiveness $|{\rm prox}h(x_1)-{\rm prox}{h}(x_2)|_2\le |x_1-x_2|_2$
interpret prox via resolvant of subdifferential operator $z={\rm prox}_f(x) \Leftrightarrow z=(\mathcal{I}+ \partial f)^{-1}(x)$ , where $\mathcal{I} $ is identity mapping ($\mathcal{I}(z)=z$)

Moreau decomposition

Suppose $f$ is closed convex, and $f^(x):= \sup_z{ \langle x,z\rangle -f(z)}$ is convex conjugate of $f$, then $x={\rm prox}_f(x)+{\rm prox}_{f^}(x)$

proximal mapping 和 duality的关键联系

generalization of orthogonal decomposition

$x=\mathcal{P}\mathcal{K}(x)+\mathcal{P}\mathcal{K^{\circ} }(x)$, where $K^{\circ}:={x

\lang x,z\rang \le 0, \forall z \in \mathcal{K}}$

extended Moreau decomposition

Suppose $f$ is closed convex and $\lambda >0 $, then $x={\rm prox}{\lambda f}(x)+\lambda {\rm prox}{\frac{1}{\lambda}f^*}(x/\lambda)$

slides 中有很多函数的分解，还没仔细看

Backtracking line search for proximal gradient methods

Let $\tau_L(x):={\rm prox}_{\frac{1}{L}h}(x-\frac{1}{L}\nabla f(x))$
1. Initialize $\eta=1,0<\alpha\le 1/2,0<\beta<1$
2. while $f(\tau_{L_t}(x^t))>f(x^t)-\lang \nabla f(x^t),x^t-\tau_{L_t}(x^t)\rang+\frac{L_T}{2}|\tau_{L_t}(x^t)-x^t|_2^2$ do
3. $L^t\leftarrow \frac{1}{\beta}L^t ({\rm or\ } \eta_t\leftarrow \beta \eta_t)$
here, $\frac{1}{L_t}$ correspond to $\eta_t$, and $\tau_{L_t}(x^t)$ generalizes $x^{t+1}$

problem	algorithm	stepsize rule	convergence rate	iteration complexity
convex & smooth problem	proximal GD	$\eta_t=\frac{1}{L}$	$F(x^t)-F^{\rm opt}\le \frac{L\|x^0-x^*\|_2^2}{2t}$ $O(\frac{1}{ {t} })$	$O(\frac{1}{\epsilon})$
strongly convex & smooth problem	projected subgradient method	$\eta_t=\frac{1}{L}$	$\|x^t-x^\|_2^2\le \left( 1- \frac{\mu}{L}\right)^t \|x^0-x^\|_2^2$ $O((1-\frac{1}{\kappa})^t)$	$O(\kappa\log\frac{1}{\epsilon})$

Accelerated gradient methods

问题：1. GD专注于每次迭代时改善损失，可能会目光短浅。2. GD可能有时候是锯齿形瞬变的

解决方法：1. 从历史迭代中探索信息 2. 增加缓冲器（比如momentum）产生更平滑的轨迹

Heavy-ball method
- $x^{t+1}=x^{t}-\eta_t \nabla f(x^t)+\theta_t(x^t-x^{t-1})$
- add inertia to the “ball ” (i.e. include momentum term) to mitigate zigzagging
- System matrix$\left[ \begin{gathered} {x^{t + 1} } - {x^} \hfill
  {x^t} - {x^} \hfill \ \end{gathered} \right] = \left[ {\begin{array}{{20}{c} } {(1 + \theta ){\mathbf{I} } - {\eta _t}\int_0^1 { {\nabla ^2}f({x_\tau }){\text{d} }\tau } }&{ - {\theta _t}{\mathbf{I} } } \ {\mathbf{I} }&0 \end{array} } \right]\left[ \begin{gathered} {x^t} - {x^} \hfill
  {x^{t - 1} } - {x^*} \hfill \ \end{gathered} \right]$
Nesterov’s accelerated gradient methods
- $x^{t+1}=y^t-\eta_t\nabla f(y^t)$
  
  $y^{t+1}=x^{t+1}+\frac{t}{t+3}(x^{t+1}-x^t)$
- alternates between gradient updates and proper extrapolation (not a descent method(i.e. may not have $f(x^{t+1})\le f(x^t)$ ))
- one of most beautiful and mysterious results in optimization
- 可以用微分方程解释，具体看slides
FISTA (Fast iterative shrinkage-thresholding algorithm(快速迭代收缩阈值算法))
- $x^{t+1}={\rm prox}_{\eta_t h}(y^t-\eta_t\nabla f(y^t))$
  
  $y^{t+1}=x^{t+1}+\frac{\theta_t-1}{\theta_{t+1} }(x^{t+1}-x^t)$
  
  where $y^0=x^0,\theta_0=1$ and $\theta_{t+1}=\frac{1+\sqrt{1+4\theta_t^2} }{2}$ (momentum coefficient)
- Rippling behavior: take $y^{t+1}=x^{t+1}+\frac{1-\sqrt{q} }{1+\sqrt{q} }(x^{t+1}-x^t)$ $q^*=1/\kappa$
  - when $q>q^*$ : we underestimate momentum $\rightarrow$ slower convergence
  - when $q<q^*$ : we overestimate momentum ($\frac{1-\sqrt{q} }{1+\sqrt{q} }$ is large) $\rightarrow$ overshooting/ rippling behavior
  - $q=q^$是最好的，但是由于$u,L$很难计算，导致$\kappa,p^$很难计算，所以实际使用时需要自己调试
- Adaptive restart
  - When certain criterion is met , restart running FISTA with $\begin{array}{*{20}{c} } { {x^0} \leftarrow {x^t} } \ { {y^0} \leftarrow {x^t} } \ { {\theta _0} \leftarrow 1} \end{array}$
  - function scheme: restart when $f(x^t)>f(x^{t-1})$
  - gradient scheme: restart when $\lang\nabla f(y^{t-1}),x^t-x^{t-1}>0\rang$

problem	algorithm	stepsize rule	convergence rate	iteration complexity
strong convex & smooth problem	Heavy-ball method	$\eta_t=\frac{4}{(\sqrt{L}+\sqrt{\mu})^2}$ $\theta_t=\max{	1-\sqrt{\eta_tL}	,	1-\sqrt{\eta_t\mu}	}^2$	$\left\| \left[ \begin{gathered} {x^{t + 1} } - {x^} \hfill \ {x^t} - {x^} \hfill \ \end{gathered} \right] \right \|_2 \leqslant {\left( {\frac{ {\sqrt \kappa - 1} }{ {\sqrt \kappa + 1} } } \right)^t}\left \| \left[ \begin{gathered}{x^1} - {x^} \hfill \ {x^0} - {x^} \hfill \ \end{gathered} \right]\right \|_2$	$O(\sqrt{\kappa}\log\frac{1}{\epsilon})$
convex & smooth problem	Nesterov’s accelerated gradient methods	$\eta_t=\frac{1}{L}$	$f(x^t)-f^{\rm opt}\le \frac{2L\|x^0-x^*\|_2^2}{(t+1)^2}$ $O(\frac{1}{t^2})$	$O(\frac{1}{\sqrt{\epsilon} })$
convex & smooth problem	FISTA	$\eta_t=\frac{1}{L}$	$F(x^t)-F^{\rm opt}\le \frac{2L\|x^0-x^*\|_2^2}{(t+1)^2}$ $O(\frac{1}{t^2})$	$O(\frac{1}{\sqrt{\epsilon} })$
strong convex & smooth problem	FISTA	$\eta_t=\frac{1}{L}$	$F(x^t)-F^{\rm opt}\le\left( 1-\frac{1}{\sqrt{\kappa} }\right )^t\left( F(x^0)-F^{\rm opt}+\frac{\mu \|x^0-x^*\|_2^2}{2}\right ) $ $O((1-\frac{1}{\sqrt{\kappa} })^t)$	$O(\sqrt{\kappa}\log\frac{1}{\epsilon})$

Smoothing for nonsmooth optimization

非光滑导致输出subgradient成为一个first order oracle (black box model), 所以所以不能提升收敛速度

解决方法： approximate a nonsmooth objective function by a smooth function

Smooth approximation

A convex function $f$ is called $(\alpha,\beta)$-smoothable if , for any $\mu>0,\exist$ convex function$f_\mu$ s.t.
- $f_\mu(x)\le f(x)\le f_\mu(x)+\beta\mu, \forall x$
- $f_\mu$ is $\frac{\alpha}{\mu}$-smooth
here, $f_\mu$ is called $\frac{1}{\mu}$-smooth approximation of f with parameter $(\alpha,\beta)$, $\mu$ is tradeoff between approximation accuracy and smoothness
- Examples
  - $|x|1 \Rightarrow f\mu(x):=\sum\nolimits_{i = 1}^n { {h_\mu }({x_i})} ,{h_\mu }(z) = \left{ {\begin{array}{*{20}{c} } { {z^2}/2\mu }&{ {\text{if } }|z|\le \mu} \ {|z| - \mu /2}&{ {\text{else} } } \end{array} } \right.$(Huber function)
  - $|x|2 \Rightarrow f\mu(x):=\sqrt{|x|_2^2+\mu^2}-\mu$
  - $\max_i{x_i}\Rightarrow f_\mu(x):=\mu \log (\sum\nolimits_{i = 1}^ne^{x^i/\mu})-\mu \log n$
- basic rules
  - addition $\lambda_1f_1+\lambda_2 f_2 \Rightarrow \lambda_1f_{\mu,1}+\lambda_2 f_{\mu,2}$ with parameters $(\lambda_1\alpha_1+\lambda_2 \alpha_2,\lambda_1\beta_1+\lambda_2 \beta_2)$
  - affine transformation $h(Ax+b) \Rightarrow h_\mu(Ax+b)$ with parameters $(\alpha_1+\lambda_2 \alpha|A|^2,\beta)$
Smoothing via Moreau envelope

Moreau envelope (or Moreau-Yosida regularization) of convex function $f$ with parameter $\mu>0$ is defined as

$M_{\mu f}(x):=\inf_z\left{f(z)+\frac{1}{2\mu}|x-z|_2^2\right}$
- $M_{\mu f}$ is smoothed or regularized form of $f$
- minimizers of $f$=minimizers of $M_f$ $\Rightarrow$ minimizing $f$ and $M_f$ are equivalent
- Connection with proximal operater
  - $M_{\mu f}=f({\rm prox_{\mu f}(x)})+\frac{1}{2\mu}|x-{\rm prox_{\mu f}(x)}|$
  - ${\rm prox_{\mu f}(x)}=x-\mu\nabla M_{\mu f}(x)$
- Properties of Moreau envelope
  - $M_{\mu f}$ is convex
  - $M_{\mu f}$ is $\frac{1}{\mu}$-smooth
  - if $f$ is $L_f$-Lipschitz, then $M_{\mu f}$ is $\frac{1}{\mu}$-smooth approximation of $f$ with parameters $(1,L_f^2/2)$
Smoothing via conjugation

Suppose $f=g^$, namely, $f(x)=\mathop \sup \limits_{z}{\lang z,x\rang-g(z)}:=g^(x)$

$f_\mu(x)=\mathop \sup \limits_{z}{\lang z,x\rang-g(z)-\mu d(z)}=(g+\mu d)^*(x)$

for some 1-strong convex and continuous function $d$ , called proximity function(近邻函数)
- Properties
  - $g+\mu d$ is $\mu$-strongly convex $\Rightarrow$ $f_\mu$ is$\frac{1}{\mu}$-smooth
  - $f_\mu(x)\le f(x)\le f_\mu(x)+\mu D$ with $D:=\sup_xd(x)$ $\Rightarrow$ $f_\mu$ is$\frac{1}{\mu}$-smooth approximation of $f$ with parameters $(1,D)$ | problem | algorithm | parameter | iteration complexity | | —————————————————– | ——— | —————————– | ———————– | | $\frac{1}{\mu}$-smooth with $(\alpha, \beta)$ problem | FISTA | $\mu=\frac{\epsilon}{2\beta}$ | $O(\frac{1}{\epsilon})$ |

iteration complexity 从subgradient的$O(\frac{1}{\epsilon^2})$提升到了$O(\frac{1}{\epsilon})$

Dual and primal-dual methods

Dual formulation $\begin{array}{{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{f(x)+h(Ax)}&{}
\end{array} \Rightarrow \begin{array}{{20}{c} } {\mathop {\rm{minimize} }\limits_{x,z} }&{f(x)+h(z)}&{}
{ {\rm{subject\ to} } }&{Ax=z}&{}
\end{array} \Rightarrow \begin{array}{*{20}{c} } { {\rm{maximize} }\lambda \mathop \min \limits{x,z} }&{f(x)+h(z)+\lang \lambda, Ax-z\rang}&{} \end{array}$

$\Rightarrow { {\rm{maximize} }\lambda \mathop \min \limits{x,z} }{\lang A^T\lambda, x\rang+f(x)}+{\mathop \min \limits_{z} }{h(z)-\lang \lambda, z\rang} $

$\Rightarrow \begin{array}{{20}{c} } { {\rm{minimize} }_\lambda}&{f^(-A^T\lambda)-h^*(\lambda)} \end{array}$
- primal : $\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{f(x)+h(Ax)}&{}
  \end{array}$
- dual: $\Rightarrow \begin{array}{{20}{c} } { {\rm{minimize} }_\lambda}&{f^(-A^T\lambda)-h^*(\lambda)} \end{array}$
  - if $f^*$ is smooth or strongly convex
  - proximal operator w.r.t. h is cheap
Dual proximal gradient algorithm
1. for $t=0,1,\cdots$ do
2. $\lambda^{t+1}={\rm prox}_{\eta_th^}(\lambda^t+\eta_tA\nabla f^(-A^T\lambda^t))$
Primal representation of dual proximal gradient algorithm
1. for $t=0,1,\cdots$ do
2. $x^t=\arg \min_x{f(x)+\lang A^T\lambda^t, x\rang}$
3. $\lambda^{t+1}=\lambda^t+\eta_tAx^t-\eta_t{\rm prox}_{\eta_t^{-1}h^*}(\eta_t^{-1}\lambda^t+Ax^t)$
Accelerated dual proximal gradient algorithm
1. for $t=0,1,\cdots$ do
2. $\lambda^{t+1}={\rm prox}_{\eta_th^}(w^t+\eta_tA\nabla f^(-A^Tw^t))$
3. $\theta_{t+1}=\frac{1+\sqrt{1+4\theta^3_t} }{2}$
4. $w^{t+1}=\lambda^{t+1}+\frac{\theta_t-1}{\theta_{t+1} }(\lambda^{t+1}-\lambda^{t})$
Primal representation of accelerated dual proximal gradient algorithm
1. for $t=0,1,\cdots$ do
2. $x^t=\arg \min_x{f(x)+\lang A^T\lambda^t, x\rang}$
3. $\lambda^{t+1}=w^t+\eta_tAx^t-\eta_t{\rm prox}_{\eta_t^{-1}h^*}(\eta_t^{-1}w^t+Ax^t)$
4. $\theta_{t+1}=\frac{1+\sqrt{1+4\theta^3_t} }{2}$
5. $w^{t+1}=\lambda^{t+1}+\frac{\theta_t-1}{\theta_{t+1} }(\lambda^{t+1}-\lambda^{t})$
Primal-dual proximal gradient method

$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x} }&{f(x)+h(Ax)}&{}
\end{array}$where $f$ and $h$ are closed and convex, both $f$ and $h$ might be non-smooth, both $f$ and $h$ admit inexpensive proximal operators

我们仅仅已经讨论了proximal method (resp. dual proximal method) ,但是他们只能更新primal(resp. dual )变量。我们能利用${\rm prox}_f$ 和${\rm prox}_h$从而同时更新primal和dual变量吗？
- Saddle-point problem ${\rm minimize}x \max\lambda f(x)+\lang \lambda ,Ax\rang-h^*(\lambda)$
  
  saddle points : $\forall x\in X,y\in Y, f(x^,y)\le f(x^,y^)\le f(x,y^)$
  - optimality condition $0 \in \left[ {\begin{array}{{20}{c} } {}&{ {A^T} } \ { - A}&{} \end{array} } \right]\left[ \begin{gathered} x \hfill
    \lambda \hfill \ \end{gathered} \right] + \left[ \begin{gathered} \partial f(x) \hfill
    \partial {h^}(\lambda ) \hfill \ \end{gathered} \right]: =\mathcal{F} (x,\lambda )$
  - fixed-point iteration/resolvent iteration $x^{t+1}={(\mathcal{I}+\eta\mathcal{F})^{-1}(x^t)}$
    
    issue: 计算$(\mathcal{I}+\eta\mathcal{F})^{-1}$太难了
  - $\mathcal{A}(x,\lambda):=\left[ {\begin{array}{{20}{c} } {}&{ {A^T} } \ { - A}&{} \end{array} } \right]\left[ \begin{gathered} x \hfill
    \lambda \hfill \ \end{gathered} \right]$, $\mathcal{B}(x,\lambda):=\left[ \begin{gathered} \partial f(x) \hfill
    \partial {h^}(\lambda ) \hfill \ \end{gathered} \right]$
    
    solution: 将$(\mathcal{I}+\eta\mathcal{F})^{-1}$拆分成$(\mathcal{I}+\eta\mathcal{A})^{-1}$和$(\mathcal{I}+\eta\mathcal{B})^{-1}$分别解线性问题和$\rm prox$
  - operator splitting via Cayley operators
    
    Let $\mathcal{R}\mathcal{A}:=(\mathcal{I}+\eta\mathcal{A})^{-1}$, $\mathcal{R}\mathcal{B}:=(\mathcal{I}+\eta\mathcal{B})^{-1}$ be resolvents and $\mathcal{C}\mathcal{A}:=(2\mathcal{R}\mathcal{A}-\mathcal{I})$, $\mathcal{C}\mathcal{B}:=(2\mathcal{R}\mathcal{B}-\mathcal{I})$ be Cayley operators
    
    then
    
    $0\in \mathcal{A}(x)+\mathcal{B}(x) \Leftrightarrow \mathcal{C}\mathcal{A}\mathcal{C}\mathcal{B}(z)=z$ with $x=\mathcal{R}_\mathcal{B}(z)$
    
    这个的证明。。。没看懂
    
    issue：怎么求解$\mathcal{C}\mathcal{A}\mathcal{C}\mathcal{B}(z)$呢直接用$z^{t+1}=\mathcal{C}\mathcal{A}\mathcal{C}\mathcal{B}(z^t)$可能不会收敛
  - Douglas-Rachford splitting（damped fixted-point iteration）
    
    $z^{t+1}=\frac{1}{2}(\mathcal{I}+\mathcal{C}\mathcal{A}\mathcal{C}\mathcal{B})(z^t)$
    
    expilcit expression (更清晰的阐释)
    
    $x^{t+\frac{1}{2} }=\mathcal{R}_\mathcal{B}(z^t)$
    
    $z^{t+\frac{1}{2} }=2x^{t+\frac{1}{2} }-z^t$
    
    $x^{t+1}=\mathcal{R}_\mathcal{A}(z^{t+\frac{1}{2} })$
    
    $z^{t+1}=\frac{1}{2}(z^t+2x^{t+1}-z^{t+\frac{1}{2} })=z^t+x^{t+1}-x^{t+\frac{1}{2} }$
    
    其中$x^{t+\frac{1}{2} },z^{t+\frac{1}{2} }$都是辅助变量
    
    applying Douglas-Rachford splitting to optimality condition
    
    $x^{t+\frac{1}{2} }={\rm prox}_{\eta f}(p^t)$
    
    $\lambda^{t+\frac{1}{2} }={\rm prox}_{\eta h^*}(q^t)$
    
    $\left[ \begin{gathered} {x^{t + 1} } \hfill
    {\lambda ^{t + 1} } \hfill \ \end{gathered} \right] = {\left[ {\begin{array}{*{20}{c} } I&{\mu {A^T} } \ { - \eta A}&I \end{array} } \right]^{ - 1} } \left[ \begin{gathered} 2{x^{t + \frac{1}{2} } } - {p^t} \hfill
    2{\lambda ^{t + \frac{1}{2} } } - {q^t} \hfill \ \end{gathered} \right]$
    
    $p^{t+1}=p^t+x^{t+1}-x^{t+\frac{1}{2} }$
    
    $q^{t+1}=q^t+\lambda^{t+1}-\lambda^{t+\frac{1}{2} }$

Alternating direction method of multipliers (ADMM)

Two-block problem

$\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{x,z} }&{F(x,z):=f_1(x)+f_2(z)}&{}
{ {\rm{subject\ to} } }&{Ax+Bz=b}&{}
\end{array}$

这也可以用Douglas-Rachford splitting求解
Augmented Lagrangian method
- Dual problem : ${\rm minimize}_\lambda f_1^(-A^T\lambda)+f_2^(-B^T\lambda)+\lang\lambda,b\rang$
- Proximal point method $\lambda^{t+1}=\arg \min_\lambda \left{ f_1^(-A^T\lambda)+f_2^(-B^T\lambda)+\lang\lambda,b\rang +\frac{1}{2\rho}|\lambda -\lambda^t|_2^2\right}$
- Augmented Lagrangian method (or method for multipliers)
  
  $ (x^{t+1},z^{t+1})=\arg \min_{x,z}\left{ f_1(x)+f_2(z)+\frac{\rho}{2}|Ax+Bz-b+\frac{1}{\rho}\lambda^t|_2^2\right} $ (primal step)
  
  $\lambda^{t+1}=\lambda^{t}+\rho(Ax^{t+1}+Bz^{t+1}-b)$ (dual step)
  
  primal step通常是expensive，就像解决原问题一样
  
  $x,z$的最小化不能分开进行
Alternating direction method of multipliers

$ x^{t+1}=\arg \min_{x}\left{ f_1(x)+\frac{\rho}{2}|Ax+Bz^t-b+\frac{1}{\rho}\lambda^t|_2^2\right} $

$ z^{t+1}=\arg \min_{z}\left{f_2(z)+\frac{\rho}{2}|Ax^{t+1}+Bz-b+\frac{1}{\rho}\lambda^t|_2^2\right} $

$\lambda^{t+1}=\lambda^{t}+\rho(Ax^{t+1}+Bz^{t+1}-b)$

混合了对偶分解(dual decomposition )和增强拉格朗日( Augmented Lagrangian method)方法的好处

$x,z$是接近对称的但是不是一个整体

robust PCA

$M=L$ (low-rank) $+S$(sparse)

怎么将一个矩阵$M$分解成一个低秩矩阵和系数矩阵的叠加(superposition)

convex programing

$\begin{array}{{20}{c} } {\mathop {\rm{minimize} }\limits_{L,S} }&{|L|_+\lambda|S|_1}&{}
{ {\rm{subject\ to} } }&{L+S=M}&{}
\end{array}$

where $|L|*:=\sum\nolimits{i = 1}^n { {\sigma i}(L)} $ is nuclear norm, and $|S|_1:=\sum\nolimits{i,j} {

S_{i,j}

} $ is enteywise $l_1$ norm

ADMM for solving

$ L^{t+1}=\arg \min_{L}\left{ |L|_*+\frac{\rho}{2}|L+S^t-M+\frac{1}{\rho}\Lambda^t|_F^2\right} $

$ S^{t+1}=\arg \min_{S}\left{\lambda|S|_1+\frac{\rho}{2}|L^{t+1}+S-M+\frac{1}{\rho}\Lambda^t|_F^2\right} $

$\Lambda^{t+1}=\Lambda^{t}+\rho(L^{t+1}+S^{t+1}-M)$
- is equivalent to
  
  $ L^{t+1}={\rm SVT}_{\rho^{-1} }\left(M-S^t-\frac{1}{\rho}\Lambda^t\right) $ (singular value thresholding)
  
  $ S^{t+1}={\rm ST}_{\lambda \rho^{-1} }\left(M-L^{t+1}-\frac{1}{\rho}\Lambda^t\right) $ (soft thresholding)
  
  $\Lambda^{t+1}=\Lambda^{t}+\rho(L^{t+1}+S^{t+1}-M)$
  
  where for any $X$ with ${\rm SVD} X=U \Sigma V^T (\Sigma={\rm diag}({\sigma_i}))$, one has
  
  ${\rm SVT}{\tau}(X)=U{\rm diag}({(\sigma_i-\tau)+})V^T$ and
  
  ${({\text{S} }{ {\text{T} }\tau }(X)){i,j} } = \left{ {\begin{array}{*{20}{c} } { {X_{i,j} } - \tau }&{ {\text{if } }{X_{i,j} } > \tau } \ 0&{ {\text{if |} }{X_{i,j} }| \leqslant \tau } \ { {X_{i,j} } + \tau }&{ {\text{if } }{X_{i,j} } < - \tau } \end{array} } \right.$
other examples
- graphical lasso $\begin{array}{*{20}{c} } {\mathop {\rm{minimize} }\limits_{\Theta} }&{-\log \det\Theta+\lang\Theta ,S\rang+\lambda|\Theta|_1}&{}
  { {\rm{subject\ to} } }&{\Theta \succeq 0}&{}
  \end{array}$(估计稀疏高斯图形模型)
- consensus optimization（共识优化）
conbergence rate: $O(1/t)$

iteration complexity $O(1/\epsilon)$

收敛性证明部分抽空一步一步推导一下

Quasi-Newton methods

Newton’s method
- $x^{t+1}=x^t-(\nabla^2f(x^t))^{-1}\nabla f(x^t)$
- Quadratic convergence $O(\log\log\frac{1}{\epsilon})$
Quasi-Newton method $x^{t+1}=x^t-\eta_tH_t\nabla f(x^t)$
BFGS(Broyden-Fletcher-Goldfarb-Shanno) method
1. for $t=0,1,\cdots$ do
2. $x^{t+1}=x^t-\eta_tH_t\nabla f(x^t)$ (line search to determin $\eta_t$)
3. $H_{t+1}=(I-\rho_ts_ty_t^T)H_t(I-\rho_ty_ts_t^T)+\rho_ts_ts_t^T$
  
  where $s_t=x^{t+1}-x^t,y_t=\nabla f(x^{t+1})-\nabla f(x^{t}),\rho_t=\frac{1}{y_t^T s_t}$
- iteration cost: gradient method <$O(n^2)$<newton method $O(n^3)$
- no magic formula for initization; possinle choices: approximate inverse Hessian at $x^0$, or identity matrix.
Computational complexity = iteration cost $\times$ iteration complexity

iteration complexity : gradient method $O(\log(\frac{1}{\epsilon}))$ >BFGS>Newton method $O(\log\log(\frac{1}{\epsilon}))$
L-BFGS (Limited-memory BFGS)

$H_t^L=V_{t-1}^T\cdots V_{t-m}^TH^L_{t,0}V_{t-m}\cdots V_{t-1}$

$+\rho_{t-m}V_{t-1}^T\cdots V_{t-m+1}^Ts_{t-m}s_{t-m}^TV_{t-m+1}\cdots V_{t-1}$

$+\rho_{t-m+1}V_{t-1}^T\cdots V_{t-m+2}^Ts_{t-m+1}s_{t-m+1}^TV_{t-m+2}\cdots V_{t-1}$

$+\cdots+\rho_{t-1}s_{t-1}s_{t-1}^T$

where $V_t=(I-\rho_t y_ts_t^T)$
- can be computed recursively (递归的)
- intialization $H^L_{t,0}$ may vary from iteration to iteration.
- only needs to store ${(s_i,y_i)}_{t-m\le i<t}$

Stochastic gradient methods

stochastic approximation / stochastic gradient descent (SGD)

$x^{t+1}=x^t-\eta_t g(x^t;\xi^t)$

where $g(x^t;\xi^t)$ is unbiased estimate of $\nabla F(x^t)$, i.e. $\mathbb{E}[g(x^t;\xi^t)]=\nabla F(x^t)$
- $\nabla F(x^t)=0 \Rightarrow$ finding roots of $G(x):=\mathbb{E}[g(x^t;\xi^t)]$
- Examples
  - SGD for empirical risk minimization $x^{t+1}=x^t-\eta_t \nabla_xf_{i_t}(x^t;{a_i,y_i})$ (choose $i_t$ uniformly at random)
  - temporal difference (TD) learning: Reinforcement learning studies Markov decision process (MDP) with unknown model.
  - Q-learning: solve Bellman equation
- $\mathbb{E}[| g(x^t;\xi^t)|_2^2]\le \sigma_g^2+c_g |\nabla F (x) _2^2$
iterate averaging

没看懂它推导出来的结果是用来干什么的，slides中说是减轻震荡和减少方差，但是我没看懂怎么减少的，下一个sildes应该会讲。

problem	algorithm	stepsize rule	convergence rate	iteration complexity
strong convex & smooth problem	SGD	$\eta_t<\frac{1}{Lc_g}$	$\mathbb{E}[F(x^t)-F(x^)] \le \frac{\eta L \sigma_g^2}{2\mu}+(1-\eta \mu)^t(F(x^0)-F(x^))$	$O(\sqrt{\kappa}\log\frac{1}{\epsilon})$
strongly convex problem	SGD	$\eta_t=\frac{\theta}{t+1}$ for $\theta >\frac{1}{2\mu}$	$\mathbb{E}[\|x^t-x^\|_2^2] \le \frac{c_\theta}{t+1}$ where $c_\theta=\max{\frac{2\theta^2 \sigma_g^2}{2\mu\theta-1},\|x^0-x^\|_2^2}$	$O(\frac{1}{\sqrt{\epsilon} })$

待整理：

norm
线性收敛，log收敛速度,算法复杂度

$\max x_i \le f(\mathbf{x})\le \max x_i+\log n, f(\mathbf{x})=(\prod\nolimits_{i = 0}^n x_i )^{1/n}$

Gradient: $\nabla f(\mathbf{x})=[\frac{ {\delta f(\mathbf{x})} }{ {\delta x_1} } \cdots \frac{ {\delta f(\mathbf{x})} }{ {\delta x_n} }]^T\in R^n$

Hessian: $\nabla^2f(\mathbf{x})=\left (\frac{ {\delta^2 f(\mathbf{x})} }{ {\delta x_i\delta x_j} }\right )_{ij}\in R^{n\times n}$

Taylor series $f(\mathbf{x}+ \mathbf{\delta})=f(x)+\nabla f(\mathbf{x})^T\mathbf{\delta}+\frac{1}{2}\mathbf{\delta}^T\nabla^2f(\mathbf{x})\mathbf{\delta}+o(\left|\mathbf{\delta}\right|^2)$

$\left| I-\eta Q\right|=max{

1-\eta\lambda_1(Q)

1-\eta\lambda_n(Q)

w.r.t 关于，iff 当且仅当, a.k.a. 又名 resp. 分别的，分开的