Skip to article frontmatterSkip to article content

Local Optimization

These methods are good at finding local minima. They are often used when you have a reasonable starting point or when the objective function is relatively smooth (ideally, convex).

Line search is a method for finding the minimum of a function along a given direction. It’s often used in conjunction with other optimization methods that require a search direction (e.g., gradient descent, Newton’s method). An example of a line search algorithm is the golden section search.

Golden section search is a method for finding the minimum (or maximum) of a unimodal function within a specified interval. It is a derivative-free optimization technique that relies on the golden ratio to reduce the search interval at each step.

Algorithm

  1. Define the interval [a,b][a, b] where the minimum lies.

  2. Compute two interior points:

    x1=bbaϕ,x2=a+baϕx_1 = b - \frac{b - a}{\phi}, \quad x_2 = a + \frac{b - a}{\phi}

    where ϕ=1+52\phi = \frac{1 + \sqrt{5}}{2} is the golden ratio.

  3. Evaluate the function at x1x_1 and x2x_2.

  4. Narrow the interval:

    • If f(x1)<f(x2)f(x_1) < f(x_2), the new interval is [a,x2][a, x_2].
    • Otherwise, the new interval is [x1,b][x_1, b].
  5. Repeat steps 2-4 until the interval is sufficiently small.

  6. The minimum is approximately at the midpoint of the final interval.

Advantages and Disadvantages

Golden section search is often used in line search procedures within larger optimization algorithms.

Gradient Descent

This is the workhorse of many optimization problems. It works by iteratively taking steps in the direction of the negative gradient.

The algorithm is simple:

  1. Choose an initial guess, x0x_0.
  2. Calculate the gradient, f(xk)\nabla f(x_k), at the current point.
  3. Update the position: xk+1=xkαf(xk)x_{k+1} = x_k - \alpha \nabla f(x_k), where α is the learning rate (a small positive number that controls the step size).
  4. Repeat steps 2 and 3 until a convergence criterion is met (e.g., the gradient becomes very small, or the change in the objective function is negligible).

Learning Rate (α) should be chosen properly: If α is too small, the convergence will be very slow. If α is too large, the algorithm might overshoot the minimum and oscillate or even diverge. There are various strategies for choosing and adapting the learning rate (e.g., using a fixed value, decreasing it over time, or using more sophisticated methods like Adam).

GD is simple to implement but can be slow for high-dimensional problems or functions with many local minima and is sensitive to the choice of learning rate.

Variants: Stochastic Gradient Descent (SGD, for large datasets, updates with a subset of the data), Momentum (adds a “momentum” term to accelerate convergence and escape shallow local minima), Adam (combines momentum and adaptive learning rates).

Conjugate Gradient

Comparison of gradient descent and conjugate gradient methods. Conjugate gradient methods can converge more quickly than gradient descent, especially for quadratic functions.

Comparison of gradient descent and conjugate gradient methods. Conjugate gradient methods can converge more quickly than gradient descent, especially for quadratic functions.

Gradient descent can sometimes “zig-zag” inefficiently, especially in valleys with steep sides. Conjugate gradient methods improve upon this by choosing search directions that are conjugate to each other. This ensures that each step makes progress in a new, independent direction. CG converges more quickly than GD for many problems, particularly when the objective function is quadratic.

The search direction, dkd_k, is calculated as:

dk=f(xk)+βkdk1d_k = -\nabla f(x_k) + \beta_k d_{k-1}

where βk\beta_k is a scalar that ensures conjugacy. Different formulas for βk\beta_k exist (e.g., Fletcher-Reeves, Polak-Ribière). The key idea is that dkd_k is a linear combination of the current negative gradient and the previous search direction. The update rule is then:

xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k d_k

where αk\alpha_k is again chosen (often via line search) to minimize the function along the search direction.

Newton’s Method

Newton’s method uses both the gradient and the Hessian matrix (second derivatives) to find the minimum of a function. It’s based on approximating the function locally by a quadratic function and then finding the minimum of that quadratic.

We want to find a point where the gradient is zero, f(x)=0\nabla f(x) = 0. This is a necessary condition for a minimum (assuming the function is differentiable). Newton’s method uses a Taylor series expansion of the gradient around the current point, xkx_k:

f(xk+Δx)f(xk)+H(xk)Δx\nabla f(x_k + \Delta x) \approx \nabla f(x_k) + H(x_k) \Delta x

where H(xk)H(x_k) is the Hessian matrix at xkx_k. We want to find Δx\Delta x such that f(xk+Δx)=0\nabla f(x_k + \Delta x) = 0. Setting the right-hand side to zero and solving for Δx\Delta x, we get:

0=f(xk)+H(xk)Δx0 = \nabla f(x_k) + H(x_k) \Delta x
Δx=H(xk)1f(xk)\Delta x = -H(x_k)^{-1} \nabla f(x_k)

Algorithm:

  1. Choose an initial guess, x0x_0.
  2. Calculate the gradient, f(xk)\nabla f(x_k), and the Hessian, H(xk)H(x_k), at the current point.
  3. Calculate the update step: Δx=H(xk)1f(xk)\Delta x = -H(x_k)^{-1} \nabla f(x_k).
  4. Update the position: xk+1=xk+Δxx_{k+1} = x_k + \Delta x.
  5. Repeat steps 2-4 until a convergence criterion is met.

Near a minimum, Newton’s method converges very rapidly (much faster than gradient descent). The number of correct digits roughly doubles with each iteration. However, the Hessian can be computationally expensive to calculate, and the method can be sensitive to the choice of starting point. In addition, the Hessian matrix must be positive definite (which is true near a minimum) for the inverse to exist and for the method to move towards a minimum. If the Hessian is not positive definite, the method may diverge or move towards a saddle point or maximum.

BFGS (Broyden-Fletcher-Goldfarb-Shanno)

BFGS is a quasi-Newton method. Newton’s method uses the Hessian (matrix of second derivatives) to find the optimum. However, calculating the Hessian can be computationally expensive. BFGS approximates the Hessian iteratively, making it more practical for many problems.

BFGS iteratively updates an approximation to the inverse Hessian, denoted as BkB_k. The update rule for the approximation is complex, but the key idea is that it uses the change in the gradient and the change in the position to improve the approximation:

Bk+1=Bk+ykykTykTskBkskskTBkskTBkskB_{k+1} = B_k + \frac{y_k y_k^T}{y_k^T s_k} - \frac{B_k s_k s_k^T B_k}{s_k^T B_k s_k}

where sk=xk+1xks_k = x_{k+1} - x_k and yk=f(xk+1)f(xk)y_k = \nabla f(x_{k+1}) - \nabla f(x_k).

The search direction is then calculated as:

dk=Bkf(xk)d_k = -B_k \nabla f(x_k)

And the position update is:

xk+1=xk+αkdkx_{k+1} = x_k + \alpha_k d_k

(Again, αk\alpha_k is often found via line search.)

Other Local Optimization Methods

Nelder-Mead (Simplex) method doesn’t require derivatives. It constructs a simplex (a shape with n+1n+1 vertices in nn dimensions) around the current point and iteratively shrinks, expands, or reflects the simplex to find the minimum.

Powell’s method is another derivative-free method that uses conjugate directions to find the minimum. It’s often more efficient than Nelder-Mead for high-dimensional problems.