Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Local Optimization

These methods are good at finding local minima. They are often used when you have a reasonable starting point or when the objective function is relatively smooth (ideally, convex).

Line search is a method for finding the minimum of a function along a given direction. It’s often used in conjunction with other optimization methods that require a search direction (e.g., gradient descent, Newton’s method). An example of a line search algorithm is the golden section search.

Golden section search is a method for finding the minimum (or maximum) of a unimodal function within a specified interval. It is a derivative-free optimization technique that relies on the golden ratio to reduce the search interval at each step.

Algorithm

  1. Define the interval [a,b][a, b] where the minimum lies.

  2. Compute two interior points:

    x1=bbaϕ,x2=a+baϕx_1 = b - \frac{b - a}{\phi}, \quad x_2 = a + \frac{b - a}{\phi}

    where ϕ=1+52\phi = \frac{1 + \sqrt{5}}{2} is the golden ratio.

  3. Evaluate the function at x1x_1 and x2x_2.

  4. Narrow the interval:

    • If f(x1)<f(x2)f(x_1) < f(x_2), the new interval is [a,x2][a, x_2].

    • Otherwise, the new interval is [x1,b][x_1, b].

  5. Repeat steps 2-4 until the interval is sufficiently small.

  6. The minimum is approximately at the midpoint of the final interval.

Advantages and Disadvantages

Golden section search is often used in line search procedures within larger optimization algorithms.

Gradient Descent

This is the workhorse of many optimization problems. It works by iteratively taking steps in the direction of the negative gradient.

The algorithm is simple:

  1. Choose an initial guess, x0x_0.

  2. Calculate the gradient, f(xk)\nabla f(x_k), at the current point.

  3. Update the position: xk+1=xkαf(xk)x_{k+1} = x_k - \alpha \nabla f(x_k), where α\alpha is the learning rate (a small positive number that controls the step size).

  4. Repeat steps 2 and 3 until a convergence criterion is met (e.g., the gradient becomes very small, or the change in the objective function is negligible).

Learning Rate (α\alpha) should be chosen properly: If α\alpha is too small, the convergence will be very slow. If α\alpha is too large, the algorithm might overshoot the minimum and oscillate or even diverge. There are various strategies for choosing and adapting the learning rate (e.g., using a fixed value, decreasing it over time, or using more sophisticated methods like Adam).

GD is simple to implement but can be slow for high-dimensional problems or functions with many local minima and is sensitive to the choice of learning rate.

Variants: Stochastic Gradient Descent (SGD, for large datasets, updates with a subset of the data), Momentum (adds a “momentum” term to accelerate convergence and escape shallow local minima), Adam (combines momentum and adaptive learning rates).

Conjugate Gradient

Comparison of gradient descent and conjugate gradient methods. Conjugate gradient methods can converge more quickly than gradient descent, especially for quadratic functions.

Comparison of gradient descent and conjugate gradient methods. Conjugate gradient methods can converge more quickly than gradient descent, especially for quadratic functions.

Gradient descent can sometimes “zig-zag” inefficiently, especially in valleys with steep sides. Conjugate gradient methods improve upon this by choosing search directions that are conjugate to each other. This ensures that each step makes progress in a new, independent direction. CG converges more quickly than GD for many problems, particularly when the objective function is quadratic.

The search direction, dkd_k, is calculated as:

dk=f(xk)+βkdk1d_k = -\nabla f(x_k) + \beta_k d_{k-1}

where βk\beta_k is a scalar that ensures conjugacy. Different formulas for βk\beta_k exist (e.g., Fletcher-Reeves, Polak-Ribière). The key idea is that dkd_k is a linear combination of the current negative gradient and the previous search direction. The update rule is then:

xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k d_k

where αk\alpha_k is again chosen (often via line search) to minimize the function along the search direction.

Newton’s Method

Newton’s method uses both the gradient and the Hessian matrix (second derivatives) to find the minimum of a function. It’s based on approximating the function locally by a quadratic function and then finding the minimum of that quadratic.

We want to find a point where the gradient is zero, f(x)=0\nabla f(x) = 0. This is a necessary condition for a minimum (assuming the function is differentiable). Newton’s method uses a Taylor series expansion of the gradient around the current point, xkx_k:

f(xk+Δx)f(xk)+H(xk)Δx\nabla f(x_k + \Delta x) \approx \nabla f(x_k) + H(x_k) \Delta x

where H(xk)H(x_k) is the Hessian matrix at xkx_k. We want to find Δx\Delta x such that f(xk+Δx)=0\nabla f(x_k + \Delta x) = 0. Setting the right-hand side to zero and solving for Δx\Delta x, we get:

0=f(xk)+H(xk)Δx0 = \nabla f(x_k) + H(x_k) \Delta x
Δx=H(xk)1f(xk)\Delta x = -H(x_k)^{-1} \nabla f(x_k)

Algorithm:

  1. Choose an initial guess, x0x_0.

  2. Calculate the gradient, f(xk)\nabla f(x_k), and the Hessian, H(xk)H(x_k), at the current point.

  3. Calculate the update step: Δx=H(xk)1f(xk)\Delta x = -H(x_k)^{-1} \nabla f(x_k).

  4. Update the position: xk+1=xk+Δxx_{k+1} = x_k + \Delta x.

  5. Repeat steps 2-4 until a convergence criterion is met.

Near a minimum, Newton’s method converges very rapidly (much faster than gradient descent). The number of correct digits roughly doubles with each iteration. However, the Hessian can be computationally expensive to calculate, and the method can be sensitive to the choice of starting point. In addition, the Hessian must be nonsingular for the Newton step to be defined, and it should be positive definite if we want the step to be a local descent direction toward a minimum. If the Hessian is not positive definite, the method may diverge or move toward a saddle point or maximum.

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

BFGS is a quasi-Newton method. Newton’s method uses the Hessian (matrix of second derivatives) to find the optimum. However, calculating the Hessian can be computationally expensive. BFGS approximates the Hessian iteratively, making it more practical for many problems.

BFGS iteratively updates an approximation to the inverse Hessian, denoted as HkH_k. The update rule uses the change in the gradient and the change in the position to improve the approximation:

Hk+1=(IρkskykT)Hk(IρkykskT)+ρkskskTH_{k+1} = \left(I - \rho_k s_k y_k^T\right) H_k \left(I - \rho_k y_k s_k^T\right) + \rho_k s_k s_k^T

where

sk=xk+1xk,yk=f(xk+1)f(xk),ρk=1ykTsk.s_k = x_{k+1} - x_k, \qquad y_k = \nabla f(x_{k+1}) - \nabla f(x_k), \qquad \rho_k = \frac{1}{y_k^T s_k}.

The search direction is then calculated as:

dk=Hkf(xk)d_k = -H_k \nabla f(x_k)

And the position update is:

xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k d_k

(Again, αk\alpha_k is often found via line search.)

Other Local Optimization Methods

Nelder-Mead (Simplex) method doesn’t require derivatives. It constructs a simplex (a shape with n+1n+1 vertices in nn dimensions) around the current point and iteratively shrinks, expands, or reflects the simplex to find the minimum.

Powell’s method is another derivative-free method that uses conjugate directions to find the minimum. It’s often more efficient than Nelder-Mead for high-dimensional problems.