Optimization

Motivation

Unconstrained Optimization - Newton’s Method

This video offers a visual and intuitive explanation of Newton's Method, a fundamental optimization technique used to find local minima or maxima of functions. Through clear illustrations and step-by-step demonstrations, it delves into how this method accelerates convergence compared to gradient descent, especially near the optimum.
Video from 0:15 to 7:40 | Source: Visually Explained - YouTube Watch here

Video transcript :

The focus is on unconstrained optimization, meaning we are given a multivariable function in $n$ variables $f$ without any constraints on the variables. We want to find the point $x$ that minimizes the function $f(x)$.

$$ f: \mathbb{R}^n \to \mathbb{R} $$ $$ \min_{x \in \mathbb{R}^n} f(x) $$

One way of finding the minimum could be to eyeball it after drawing the function, however this method would only work as the dimension of the function is inferior to 2 and even for those functions this method is not precise. In some application like Machine Learning or Robotics, the functions that we want to minimize can have tens, thousands, or even millions of variables, making it impractical to visualize them.

Some of the most useful and used algorithms belong to the families of iterative optimization algorithms, which progressively refine their estimates of the minimum. The Newton’s method is one such algorithm as we will see.

The process of those algorithms is as follows:

  • Initialization: Start with an initial guess $x_0$ for the minimum (often chosen randomly or based on prior knowledge).
Initialization step
Initialization step
  • Iteration: Pick a direction, i.e. a vector $d_0$, and follow it to obtain $x_1$. To track how much progress have been made, calculate the value of $x_1 = f(x_1)$. Hopfully $f(x_1) \leq f(x_0)$, meaning we have descended the value on the graph traced by $f(x)$ in space. This direction $d_0$ is usualy called a descent direction. Then iteratively, at step $k$ we calculate $x_{k+1} = x_k + d_k$.
Initialization
1st step after direction $d_0$
Optimization step
k-th step after direction $d_k$
  • Convergence Check: Repeat the iteration until convergence criteria are met (e.g., the change in function value is below a threshold).

The way to chose the direction step $d_k$ in the iteration step is crucial to achieve convergence to the optimal solution in a reasonable time frame. The chosen algorithm’s will depend on the choice made for this direction.

To pick a good descent direction, we need to understand how the function $f(x)$ behaves in the vicinity of the current point $x_k$. This is typically done using the gradient $\nabla f(x_k)$, which points in the direction of steepest increase. The negative gradient $-\nabla f(x_k)$, therefore, points in the direction of steepest descent.

Example : ($n=1$) $$ f(x) = \frac{1}{20} x^4 - \frac{2}{5} x + 1 $$

$$ f’(x) = \frac{f(x) - f(x_k)}{x - x_k} \Rightarrow f(x) \approx f(x_k) + \color{orange}{f’(x_k)} \color{black}{(x - x_k)} \tag{1} $$

graphical f(x)
Graphical representation of $f(x)$
graphical f'(x)
Graphical representation of $f'(x)$

We can observe that locally around the point $x_k$, the function $f(x)$ can be approximated by a linear function. This is the essence of gradient-based optimization methods: they use the local linear approximation to guide the search for the minimum as linear functions are simpler to optimize.

We have then an iteration step using the gradient information:

$$ x_{k+1} = x_k - \alpha f’(x_k) $$

Where $\alpha$ is a step size (also called learning rate). This step size determines how far we move along the descent direction. If $\alpha$ is too large, we might overshoot the minimum; if it’s too small, convergence will be slow. Trying to tune $\alpha$ in an optimal way it what leads us to Newton’s method.

Newton’s method comes from the observation that using the second-order derivative (the Hessian $\nabla^2 f(x_k)$) can provide a more accurate estimate of the function’s curvature, allowing for more informed step sizes and directions. From calculus, we know that we can use the Taylor expansion we can refine the approximation made in equation (1).

$$ f(x) \approx f(x_k) + f’(x_k) (x - x_k) + \frac{1}{2} f''(x_k) (x - x_k)^2 \tag{2} $$

2nd derivative
Graphical representation of $f''(x)$

This leads us to the following iteration step using both the first derivative and secound derivative information as $\alpha = \frac{1}{f’(x_k)}$:

$$ x_{k+1} = x_k - \frac{1}{f’(x_k)} f’(x_k) $$

Newton's method can be generalized in higher dimensions ($n\geq1$) as: \[ x_{k+1} = x_k - \nabla^2 f(x_k)^{-1} \nabla f(x_k) \]

Convex Optimization

Convex Set

A set $\mathcal{S}$ is convex if, for any two points $x_1, x_2 \in \mathcal{S}$ and any scalar $\lambda \in [0, 1]$, the point $\lambda x_1 + (1 - \lambda) x_2$ is also in $\mathcal{S}$. i.e. the line segment connecting any two points in the set lies entirely within the set.

Mathematical definition of a convex set \( \mathcal{S} \): \[ \lambda x_1 + (1 - \lambda) x_2 \in \mathcal{S}, \quad \forall x_1, x_2 \in \mathcal{S}, \forall \lambda \in [0, 1] \]

Convex combination of $x_1, \cdots, x_k$: any point $z$ of the form: $$ z= \theta _1 z_1 + \theta _2 z_2 + \cdots + \theta _k z_k \quad \text{with} \quad \theta _1 + \cdots + \theta_k = 1, \theta _i \geq 0 $$

Convex Set Examples

A hyperplane is a flat affine subspace of one dimension less than its ambient space. Formally, in \(\mathbb{R}^n\), a hyperplane can be defined as the set of points \(\{x \in \mathbb{R}^n: a^T x = b\}\) for some \(a \in \mathbb{R}^n\) (\(a \neq 0\)) and \(b \in \mathbb{R}\).

Hyperplane

A halfspace is the set of points on one side of a hyperplane. Formally, in \(\mathbb{R}^n\), a halfspace can be defined as the set of points \(\{x \in \mathbb{R}^n: a^T x \leq b\}\) for some \(a \in \mathbb{R}^n\) (\(a \neq 0\)) and \(b \in \mathbb{R}\).

Halfspace

A polyhedron is the intersection of a finite number of halfspaces. Formally, in \(\mathbb{R}^n\), a polyhedron can be defined as the set of points \(\{x \in \mathbb{R}^n: a_i^T x \leq b_i, \, i = 1, \ldots, m\}\) for some \(a_i \in \mathbb{R}^n\) (\(a_i \neq 0\)) and \(b_i \in \mathbb{R}\).

Polyhedron

A polytope is a bounded polyhedron. Formally, in \(\mathbb{R}^n\), a polytope can be defined as the set of points \(\{x \in \mathbb{R}^n: a_i^T x \leq b_i, \, i = 1, \ldots, m\}\) for some \(a_i \in \mathbb{R}^n\) (\(a_i \neq 0\)) and \(b_i \in \mathbb{R}\), with the additional constraint that the feasible region is bounded.

Polytop
Quizz
Which of the following is a convex set?

Convex functions

A function \( f: \mathcal{S} \to \mathbb{R} \) is convex if \(\mathcal{S}\) is a convex set and \[ f(\lambda x_1 + (1 - \lambda) x_2) \leq \lambda f(x_1) + (1 - \lambda) f(x_2) \] \[ \forall x_1, x_2 \in \mathcal{S}, \lambda \in [0,1] \]
Convex function definition
Convex function definition

A function $f:\mathcal{S} \to \mathbb{R}$ is strictly convex if $\mathcal{S}$ is a convex set and for any two points $x_1, x_2 \in \mathcal{S}$ and any scalar $\lambda \in [0, 1]$, the following inequality holds: $$ f(\lambda x_1 + (1 - \lambda) x_2) < \lambda f(x_1) + (1 - \lambda) f(x_2) $$

A function $f:\mathcal{S} \to \mathbb{R}$ is concave if $\mathcal{S}$ is a convex set and $-f$ is convex.

First-order condition for convexity: Differentiable $f$ with convex domain is convex iff $$ f(y) \geq f(x) + \nabla f(x)^T (y - x), \quad \forall x, y \in \textbf{dom} f $$

Second-order condition for convexity: Twice differentiable $f$ with open convex domain is convex iff $$\nabla^2 f(x) \succeq 0, \quad \forall x \in \textbf{dom} f$$ i.e. the Hessian matrix is positive semidefinite for all $x$ in the domain of $f$.

Convex Function Examples
Exponential Function Example:

\(f(x) = e^{ax}\), for any \(a \in \mathbb{R}\).

Exponential
Powers Function Example:

\(f(x) = x^a\) on \(\mathbb{R}_+\), for \(a \geq 1\) or \(a < 0\) (otherwise concave).

Powers
Logarithm Function Example:

\(f(x) = -\log(x)\) on \(\mathbb{R}_{+}\).

Logarithm

Convex optimization problem

A convex optimization problem is an optimization problem where the objective function is convex, and the feasible region (the set of points that satisfy the constraints) is also a convex set. This means that any local minimum is also a global minimum, making these problems easier to solve than general optimization problems.

A convex optimization problem has the form: \[ \begin{aligned} & \text{minimize} && f(x) \\ & \text{subject to} && g_i(x) \leq 0, \quad i = 1, \ldots, m \\ & && h_i(x) = 0, \quad i = 1, \ldots, p \end{aligned} \] where \(f\) and \(g_i\) are convex functions, and \(h_i\) are affine functions.

With $f, g_i, \cdots, g_m$ convex functions and $c_i^T x = b_i$ affine functions (equality constrains are affine).

Often rewrite as follows: $$ \begin{aligned} & \text{min} && f(x) \
& \text{s.t.} && g(x) \leq 0 \
& && Cx = b \end{aligned} $$

Where $C:\mathbb{R}^{n \times m}$ is a matrix and $g: \mathbb{R}^n \to \mathbb{R}^m$.

Important properties: Feasible set of a convex optimization problem is convex.

Lemma - Convex problems: Local optima are global optima
Any locally optimal point of a convex problem is globally optimal.

Proof: Assume $x$ is locally optimal and a feasible $y$ such that $f(y) < f(x)$. $x$ locally optimal implies that there exists an $R > 0$ such that $\left| y - x \right|_2 \leq R \Rightarrow f(y) \geq f(x)$.

Proof

Exercises

Exercise 1.1: Applying Newton’s method

Consider the problem of minimizing $f(x)=x^{\frac{4}{3}}=(\sqrt[3]{x})^4$. Note that 0 is a global minimizer of $f$.

  1. Write down the algorithm for Newton's metod applied to this problem
  2. Show thar as long as the starting point is not 0, the algorithm in part 1. does not converge to 0 (no matter how close to 0 we start).
Solution
1. We compute $f'(x)$ and $f'\'(x)$: \[ \begin{align} f'(x) &= \frac{4}{3} x^{\frac{1}{3}} \\ f''(x) &= \frac{4}{9} x^{-\frac{2}{3}} \end{align} \] Therefore the Newton's algorithm for this problem takes the form \[ x_{k+1} = x_k - \frac{\frac{4}{3} x_k^{\frac{1}{3}}}{\frac{4}{9} x_k^{-\frac{2}{3}}} = -2 x_k \] 2. From part 1., we have $x_k = -2 x_k$. Therefore, as long as $x_0 \neq 0$, the sequence $\{x_k\}$ diverges and does not converge to 0.

Exercise 1.2: Convex Set

Show that the set $(x\in \mathbb{R}^n \mid |x| \leq r)$ is convex, where $r>0$ is a given real number, and $|x| = \sqrt{x^\top x}$ is the Euclidean norm of $x\in \mathbb{R}^n$.

Solution

Let $u, v \in \Theta = {x\in \mathbb{R}^n: |x| \leq r}$, and $\alpha \in [0, 1]$. Suppose $z=\alpha u + (1-\alpha)v$. To show that $\Theta$ is convex, we need to show that $z\in\Theta$, i.e., $|z| \leq r$. To this end, $$ \begin{aligned} |z|^2 &= (\alpha u + (1-\alpha)v)^\top (\alpha u + (1-\alpha)v)
&= \alpha^2 |u|^2 + 2 \alpha (1-\alpha)u^\top v + (1-\alpha)^2|v|^2 \end{aligned} $$ Since $u,v \in \Theta$, then $|u|^2 \leq r^2$ and $|v|^2 \leq r^2$. Furthermore, by the Cauchy-Schwarz inequality, we have $u^\top v \leq |u| |v| \leq r^2$. Therefore, $$ |z|^2\leq\alpha^2 r^2 + 2 \alpha(1-\alpha)r^2 +(1-\alpha)^2 r^2 = r^2 $$ Hence, $z\in\Theta$, which shows that $\Theta$ is a convex set, i.e., any point on the line segment joining $u$ and $v$ is also in $\Theta$.