Optimization
Motivation
Unconstrained Optimization - Newton’s Method
Video transcript :
The focus is on unconstrained optimization, meaning we are given a multivariable function in $n$ variables $f$ without any constraints on the variables. We want to find the point $x$ that minimizes the function $f(x)$.
$$ f: \mathbb{R}^n \to \mathbb{R} $$ $$ \min_{x \in \mathbb{R}^n} f(x) $$
One way of finding the minimum could be to eyeball it after drawing the function, however this method would only work as the dimension of the function is inferior to 2 and even for those functions this method is not precise. In some application like Machine Learning or Robotics, the functions that we want to minimize can have tens, thousands, or even millions of variables, making it impractical to visualize them.
Some of the most useful and used algorithms belong to the families of iterative optimization algorithms, which progressively refine their estimates of the minimum. The Newton’s method is one such algorithm as we will see.
The process of those algorithms is as follows:
- Initialization: Start with an initial guess $x_0$ for the minimum (often chosen randomly or based on prior knowledge).
- Iteration: Pick a direction, i.e. a vector $d_0$, and follow it to obtain $x_1$. To track how much progress have been made, calculate the value of $x_1 = f(x_1)$. Hopfully $f(x_1) \leq f(x_0)$, meaning we have descended the value on the graph traced by $f(x)$ in space. This direction $d_0$ is usualy called a descent direction. Then iteratively, at step $k$ we calculate $x_{k+1} = x_k + d_k$.
- Convergence Check: Repeat the iteration until convergence criteria are met (e.g., the change in function value is below a threshold).
The way to chose the direction step $d_k$ in the iteration step is crucial to achieve convergence to the optimal solution in a reasonable time frame. The chosen algorithm’s will depend on the choice made for this direction.
To pick a good descent direction, we need to understand how the function $f(x)$ behaves in the vicinity of the current point $x_k$. This is typically done using the gradient $\nabla f(x_k)$, which points in the direction of steepest increase. The negative gradient $-\nabla f(x_k)$, therefore, points in the direction of steepest descent.
Example : ($n=1$) $$ f(x) = \frac{1}{20} x^4 - \frac{2}{5} x + 1 $$
$$ f’(x) = \frac{f(x) - f(x_k)}{x - x_k} \Rightarrow f(x) \approx f(x_k) + \color{orange}{f’(x_k)} \color{black}{(x - x_k)} \tag{1} $$
We can observe that locally around the point $x_k$, the function $f(x)$ can be approximated by a linear function. This is the essence of gradient-based optimization methods: they use the local linear approximation to guide the search for the minimum as linear functions are simpler to optimize.
We have then an iteration step using the gradient information:
$$ x_{k+1} = x_k - \alpha f’(x_k) $$
Where $\alpha$ is a step size (also called learning rate). This step size determines how far we move along the descent direction. If $\alpha$ is too large, we might overshoot the minimum; if it’s too small, convergence will be slow. Trying to tune $\alpha$ in an optimal way it what leads us to Newton’s method.
Newton’s method comes from the observation that using the second-order derivative (the Hessian $\nabla^2 f(x_k)$) can provide a more accurate estimate of the function’s curvature, allowing for more informed step sizes and directions. From calculus, we know that we can use the Taylor expansion we can refine the approximation made in equation (1).
$$ f(x) \approx f(x_k) + f’(x_k) (x - x_k) + \frac{1}{2} f''(x_k) (x - x_k)^2 \tag{2} $$
This leads us to the following iteration step using both the first derivative and secound derivative information as $\alpha = \frac{1}{f’(x_k)}$:
$$ x_{k+1} = x_k - \frac{1}{f’(x_k)} f’(x_k) $$
Convex Optimization
Convex Set
A set $\mathcal{S}$ is convex if, for any two points $x_1, x_2 \in \mathcal{S}$ and any scalar $\lambda \in [0, 1]$, the point $\lambda x_1 + (1 - \lambda) x_2$ is also in $\mathcal{S}$. i.e. the line segment connecting any two points in the set lies entirely within the set.
Convex combination of $x_1, \cdots, x_k$: any point $z$ of the form: $$ z= \theta _1 z_1 + \theta _2 z_2 + \cdots + \theta _k z_k \quad \text{with} \quad \theta _1 + \cdots + \theta_k = 1, \theta _i \geq 0 $$
A hyperplane is a flat affine subspace of one dimension less than its ambient space. Formally, in \(\mathbb{R}^n\), a hyperplane can be defined as the set of points \(\{x \in \mathbb{R}^n: a^T x = b\}\) for some \(a \in \mathbb{R}^n\) (\(a \neq 0\)) and \(b \in \mathbb{R}\).
A halfspace is the set of points on one side of a hyperplane. Formally, in \(\mathbb{R}^n\), a halfspace can be defined as the set of points \(\{x \in \mathbb{R}^n: a^T x \leq b\}\) for some \(a \in \mathbb{R}^n\) (\(a \neq 0\)) and \(b \in \mathbb{R}\).
A polyhedron is the intersection of a finite number of halfspaces. Formally, in \(\mathbb{R}^n\), a polyhedron can be defined as the set of points \(\{x \in \mathbb{R}^n: a_i^T x \leq b_i, \, i = 1, \ldots, m\}\) for some \(a_i \in \mathbb{R}^n\) (\(a_i \neq 0\)) and \(b_i \in \mathbb{R}\).
A polytope is a bounded polyhedron. Formally, in \(\mathbb{R}^n\), a polytope can be defined as the set of points \(\{x \in \mathbb{R}^n: a_i^T x \leq b_i, \, i = 1, \ldots, m\}\) for some \(a_i \in \mathbb{R}^n\) (\(a_i \neq 0\)) and \(b_i \in \mathbb{R}\), with the additional constraint that the feasible region is bounded.
Convex functions
A function $f:\mathcal{S} \to \mathbb{R}$ is strictly convex if $\mathcal{S}$ is a convex set and for any two points $x_1, x_2 \in \mathcal{S}$ and any scalar $\lambda \in [0, 1]$, the following inequality holds: $$ f(\lambda x_1 + (1 - \lambda) x_2) < \lambda f(x_1) + (1 - \lambda) f(x_2) $$
A function $f:\mathcal{S} \to \mathbb{R}$ is concave if $\mathcal{S}$ is a convex set and $-f$ is convex.
First-order condition for convexity: Differentiable $f$ with convex domain is convex iff $$ f(y) \geq f(x) + \nabla f(x)^T (y - x), \quad \forall x, y \in \textbf{dom} f $$
Second-order condition for convexity: Twice differentiable $f$ with open convex domain is convex iff $$\nabla^2 f(x) \succeq 0, \quad \forall x \in \textbf{dom} f$$ i.e. the Hessian matrix is positive semidefinite for all $x$ in the domain of $f$.
\(f(x) = e^{ax}\), for any \(a \in \mathbb{R}\).
\(f(x) = x^a\) on \(\mathbb{R}_+\), for \(a \geq 1\) or \(a < 0\) (otherwise concave).
\(f(x) = -\log(x)\) on \(\mathbb{R}_{+}\).
Convex optimization problem
A convex optimization problem is an optimization problem where the objective function is convex, and the feasible region (the set of points that satisfy the constraints) is also a convex set. This means that any local minimum is also a global minimum, making these problems easier to solve than general optimization problems.
With $f, g_i, \cdots, g_m$ convex functions and $c_i^T x = b_i$ affine functions (equality constrains are affine).
Often rewrite as follows: $$ \begin{aligned} & \text{min} && f(x) \
& \text{s.t.} && g(x) \leq 0 \
& && Cx = b \end{aligned} $$
Where $C:\mathbb{R}^{n \times m}$ is a matrix and $g: \mathbb{R}^n \to \mathbb{R}^m$.
Important properties: Feasible set of a convex optimization problem is convex.
Proof: Assume $x$ is locally optimal and a feasible $y$ such that $f(y) < f(x)$. $x$ locally optimal implies that there exists an $R > 0$ such that $\left| y - x \right|_2 \leq R \Rightarrow f(y) \geq f(x)$.
Exercises
Exercise 1.1: Applying Newton’s method
Consider the problem of minimizing $f(x)=x^{\frac{4}{3}}=(\sqrt[3]{x})^4$. Note that 0 is a global minimizer of $f$.
- Write down the algorithm for Newton's metod applied to this problem
- Show thar as long as the starting point is not 0, the algorithm in part 1. does not converge to 0 (no matter how close to 0 we start).
Solution
Exercise 1.2: Convex Set
Show that the set $(x\in \mathbb{R}^n \mid |x| \leq r)$ is convex, where $r>0$ is a given real number, and $|x| = \sqrt{x^\top x}$ is the Euclidean norm of $x\in \mathbb{R}^n$.
Solution
Let $u, v \in \Theta = {x\in \mathbb{R}^n: |x| \leq r}$, and $\alpha \in [0, 1]$. Suppose $z=\alpha u + (1-\alpha)v$. To show that $\Theta$ is convex, we need to show that $z\in\Theta$, i.e., $|z| \leq r$. To this end, $$ \begin{aligned} |z|^2 &= (\alpha u + (1-\alpha)v)^\top (\alpha u + (1-\alpha)v)
&= \alpha^2 |u|^2 + 2 \alpha (1-\alpha)u^\top v + (1-\alpha)^2|v|^2 \end{aligned} $$ Since $u,v \in \Theta$, then $|u|^2 \leq r^2$ and $|v|^2 \leq r^2$. Furthermore, by the Cauchy-Schwarz inequality, we have $u^\top v \leq |u| |v| \leq r^2$. Therefore, $$ |z|^2\leq\alpha^2 r^2 + 2 \alpha(1-\alpha)r^2 +(1-\alpha)^2 r^2 = r^2 $$ Hence, $z\in\Theta$, which shows that $\Theta$ is a convex set, i.e., any point on the line segment joining $u$ and $v$ is also in $\Theta$.