Review of gradient methods in mathematical optimization problems. gradient methods

gradient methods

Gradient unconstrained optimization methods use only the first derivatives of the objective function and are linear approximation methods at each step, i.e. the objective function at each step is replaced by a tangent hyperplane to its graph at the current point.

At the k-th stage of gradient methods, the transition from the point Xk to the point Xk+1 is described by the relation:

where k is the step size, k is a vector in the direction Xk+1-Xk.

Steepest descent methods

For the first time, such a method was considered and applied by O. Cauchy in the 18th century. Its idea is simple: the gradient of the objective function f(X) at any point is a vector in the direction of the greatest increase in the value of the function. Therefore, the antigradient will be directed towards the greatest decrease in the function and is the direction of the steepest descent. The antigradient (and the gradient) is orthogonal to the level surface f(X) at the point X. If in (1.2) we introduce the direction

then this will be the direction of steepest descent at the point Xk.

We get the transition formula from Xk to Xk+1:

The anti-gradient only gives the direction of descent, not the step size. In general, one step does not give a minimum point, so the descent procedure must be applied several times. At the minimum point, all components of the gradient are equal to zero.

All gradient methods use the above idea and differ from each other in technical details: calculation of derivatives by an analytical formula or finite difference approximation; the step size can be constant, change according to some rules, or be selected after applying one-dimensional optimization methods in the direction of the antigradient, etc. etc.

We will not dwell in detail, because. the steepest descent method is generally not recommended as a serious optimization procedure.

One of the disadvantages of this method is that it converges to any stationary point, including the saddle point, which cannot be a solution.

But the most important thing is the very slow convergence of steepest descent in the general case. The point is that the descent is "the fastest" in the local sense. If the search hyperspace is strongly elongated ("ravine"), then the antigradient is directed almost orthogonally to the bottom of the "ravine", i.e. the best direction to reach the minimum. In this sense, a direct translation of the English term "steepest descent", i.e. the descent along the steepest slope is more consistent with the state of affairs than the term "the fastest" adopted in the Russian-language specialized literature. One way out in this situation is to use the information given by the second partial derivatives. Another way out is to change the scales of the variables.

linear approximation derivative gradient

Fletcher-Reeves conjugate gradient method

The conjugate gradient method constructs a sequence of search directions that are linear combinations of the current steepest descent direction and the previous search directions, i.e.

and the coefficients are chosen so as to make the search directions conjugate. Proved that

and this is a very valuable result that allows you to build a fast and efficient optimization algorithm.

Fletcher-Reeves algorithm

1. In X0 is calculated.

2. At the kth step, using a one-dimensional search in the direction, the minimum of f(X) is found, which determines the point Xk+1.

3. Calculate f(Xk+1) and.
4. The direction is determined from the ratio:

5. After the (n+1)-th iteration (i.e., with k=n), a restart is performed: X0=Xn+1 is assumed and the transition to step 1 is performed.
6. The algorithm stops when

where is an arbitrary constant.

The advantage of the Fletcher-Reeves algorithm is that it does not require matrix inversion and saves computer memory, since it does not need the matrices used in Newtonian methods, but at the same time is almost as efficient as quasi-Newtonian algorithms. Because search directions are mutually conjugate, then the quadratic function will be minimized in no more than n steps. In the general case, a restart is used, which allows you to get the result.

The Fletcher-Reeves algorithm is sensitive to the accuracy of a one-dimensional search, so any rounding errors that may occur must be corrected when using it. Also, the algorithm may fail in situations where the Hessian becomes ill-conditioned. The algorithm has no guarantee of convergence always and everywhere, although practice shows that the algorithm almost always gives a result.

Newtonian methods

The direction of search corresponding to the steepest descent is associated with a linear approximation of the objective function. Methods using second derivatives arose from a quadratic approximation of the objective function, i.e. when expanding the function in a Taylor series, terms of the third and higher orders are discarded.

where is the Hessian matrix.

The minimum of the right side (if it exists) is reached in the same place as the minimum of the quadratic form. Let's write a formula for determining the direction of the search:

The minimum is reached at

An optimization algorithm in which the search direction is determined from this relation is called Newton's method, and the direction is Newton's direction.

In problems of finding the minimum of an arbitrary quadratic function with a positive matrix of second derivatives, Newton's method gives a solution in one iteration, regardless of the choice of the starting point.

Classification of Newtonian Methods

Actually, Newton's method consists in a single application of the Newtonian direction to optimize the quadratic function. If the function is not quadratic, then the following theorem is true.

Theorem 1.4. If the Hessian matrix of a general non-linear function f at the minimum point X* is positive-definite, the starting point is chosen close enough to X*, and the step lengths are chosen correctly, then Newton's method converges to X* with quadratic speed.

Newton's method is considered to be the reference one, and all developed optimization procedures are compared with it. However, Newton's method works only with a positive-definite and well-conditioned Hessian matrix (its determinant must be substantially greater than zero, more precisely, the ratio of the largest and smallest eigenvalues should be close to one). To eliminate this shortcoming, modified Newtonian methods are used, using Newtonian directions as far as possible and deviating from them only when necessary.

The general principle of modifications to Newton's method is as follows: at each iteration, some positive-definite matrix "related" to is first constructed, and then calculated by the formula

Since it is positive definite, then - will necessarily be the direction of descent. The construction procedure is organized so that it coincides with the Hessian matrix if it is positive definite. These procedures are built on the basis of some matrix expansions.

Another group of methods, which are almost as fast as the Newton method, is based on the approximation of the Hessian matrix using finite differences, because it is not necessary to use the exact values of the derivatives for optimization. These methods are useful when the analytical calculation of derivatives is difficult or simply impossible. Such methods are called discrete Newton methods.

The key to the effectiveness of Newtonian-type methods is taking into account information about the curvature of the function being minimized, which is contained in the Hessian matrix and makes it possible to build locally exact quadratic models of the objective function. But it is possible to collect and accumulate information about the curvature of a function based on observing the change in the gradient during iterations of the descent.

The corresponding methods based on the possibility of approximating the curvature of a non-linear function without the explicit formation of its Hessian matrix are called quasi-Newtonian methods.

Note that when constructing an optimization procedure of the Newtonian type (including the quasi-Newtonian one), it is necessary to take into account the possibility of the appearance of a saddle point. In this case, the vector of the best search direction will always be directed to the saddle point, instead of moving away from it in the "down" direction.

Newton-Raphson method

This method consists in repeated use of the Newtonian direction when optimizing functions that are not quadratic.

Basic iterative formula for multivariate optimization

is used in this method when choosing the direction of optimization from the relation

The real step length is hidden in the non-normalized Newtonian direction.

Since this method does not require the value of the objective function at the current point, it is sometimes called the indirect or analytical optimization method. His ability to determine the minimum of a quadratic function in one calculation looks extremely attractive at first glance. However, this "single calculation" is costly. First of all, it is necessary to calculate n partial derivatives of the first order and n(n+1)/2 - of the second. In addition, the Hessian matrix must be inverted. This already requires about n3 computational operations. With the same cost, conjugate direction methods or conjugate gradient methods can take about n steps, i.e. achieve almost the same result. Thus, the iteration of the Newton-Raphson method does not provide advantages in the case of a quadratic function.

If the function is not quadratic, then

- the initial direction already, generally speaking, does not indicate the actual minimum point, which means that the iterations must be repeated repeatedly;
- a step of unit length can lead to a point with a worse value of the objective function, and the search can give the wrong direction if, for example, the Hessian is not positive definite;
- the Hessian can become ill-conditioned, making it impossible to invert it, i.e. determining the direction for the next iteration.

The strategy itself does not distinguish which stationary point (minimum, maximum, saddle point) the search is approaching, and the calculation of the objective function values, by which it would be possible to track whether the function is increasing, is not done. So, it all depends on which stationary point in the attraction zone is the starting point of the search. The Newton-Raphson strategy is rarely used on its own without modification of one kind or another.

Pearson methods

Pearson proposed several methods for approximating the inverse Hessian without explicitly calculating the second derivatives, i.e. by observing changes in the direction of the antigradient. In this case, conjugate directions are obtained. These algorithms differ only in details. Here are those that are most widely used in applied fields.

Pearson's Algorithm #2.

In this algorithm, the inverse Hessian is approximated by the matrix Hk calculated at each step by the formula

An arbitrary positive-definite symmetric matrix is chosen as the initial matrix H0.

This Pearson algorithm often leads to situations where the matrix Hk becomes ill-conditioned, namely, it begins to oscillate, oscillating between positive definite and non-positive definite, while the determinant of the matrix is close to zero. To avoid this situation, it is necessary to re-set the matrix every n steps, equating it to H0.

Pearson's Algorithm #3.

In this algorithm, the matrix Hk+1 is determined from the formula

Hk+1 = Hk +

The descent path generated by the algorithm is similar to the behavior of the Davidon-Fletcher-Powell algorithm, but the steps are slightly shorter. Pearson also proposed a variant of this algorithm with a cyclic reordering of the matrix.

Projective Newton-Raphson algorithm

Pearson proposed the idea of an algorithm in which the matrix is calculated from the relation

H0=R0, where the matrix R0 is the same as the initial matrices in the previous algorithms.

When k is a multiple of the number of independent variables n, the matrix Hk is replaced by the matrix Rk+1 calculated as the sum

The value Hk(f(Xk+1) - f(Xk)) is the projection of the gradient increment vector (f(Xk+1)-f(Xk)), orthogonal to all gradient increment vectors in the previous steps. After every n steps, Rk is an approximation of the inverse Hessian H-1(Xk), so in essence a (approximately) Newton search is performed.

Davidon-Fletcher-Powell Method

This method has other names - the variable metric method, the quasi-Newton method, because he uses both of these approaches.

The Davidon-Fletcher-Powell (DFP) method is based on the use of Newtonian directions, but does not require the calculation of the inverse Hessian at each step.

The search direction at step k is the direction

where Hi is a positive-definite symmetric matrix that is updated at each step and, in the limit, becomes equal to the inverse Hessian. The identity matrix is usually chosen as the initial matrix H. The iterative DFT procedure can be represented as follows:

1. At step k, there is a point Xk and a positive-definite matrix Hk.
2. Select as the new search direction

3. One-dimensional search (usually by cubic interpolation) along the direction determines k minimizing the function.

4. Relies.

5. Relies.

6. Determined by and. If Vk or are small enough, the procedure terminates.

7. Set Uk = f(Xk+1) - f(Xk).
8. Matrix Hk is updated according to the formula

9. Increase k by one and return to step 2.

The method is effective in practice if the gradient calculation error is small and the matrix Hk does not become ill-conditioned.

The matrix Ak ensures the convergence of Hk to G-1, the matrix Bk ensures the positive definiteness of Hk+1 at all stages and excludes H0 in the limit.

In the case of a quadratic function

those. the DFP algorithm uses conjugate directions.

Thus, the DFT method uses both the ideas of the Newtonian approach and the properties of conjugate directions, and when minimizing the quadratic function, it converges in no more than n iterations. If the function being optimized has a form close to a quadratic function, then the DFP method is efficient due to a good approximation of G-1 (Newton's method). If the objective function has a general form, then the DFP method is effective due to the use of conjugate directions.

Gradient optimization methods

Optimization problems with non-linear or hard-to-compute relations that determine the optimization criterion and constraints are the subject of non-linear programming. As a rule, solutions to non-linear programming problems can only be found by numerical methods using computer technology. Among them, the most commonly used are gradient methods (methods of relaxation, gradient, steepest descent and ascent), non-gradient deterministic search methods (scanning methods, simplex, etc.), and random search methods. All these methods are used in the numerical determination of optima and are widely covered in the specialized literature.

In the general case, the value of the optimization criterion R can be seen as a function R(x b xx..., x n), defined in n-dimensional space. Since there is no visual graphic representation of an n-dimensional space, we will use the case of a two-dimensional space.

If a R(l x 2) continuous in the region D, then around the optimal point M°(xi°, x z °) it is possible to draw a closed line in this plane, along which the value R= const. There are many such lines, called lines of equal levels, that can be drawn around the optimal point (depending on the step

Among the methods used to solve problems of nonlinear programming, a significant place is occupied by methods of finding solutions based on the analysis of the derivative with respect to the direction of the function being optimized. If at each point in space a scalar function of several variables takes on well-defined values, then in this case we are dealing with a scalar field (temperature field, pressure field, density field, etc.). The vector field (the field of forces, velocities, etc.) is defined in a similar way. Isotherms, isobars, isochrones, etc. - all these are lines (surfaces) of equal levels, equal values of a function (temperature, pressure, volume, etc.). Since the value of the function changes from point to point in space, it becomes necessary to determine the rate of change of the function in space, that is, the derivative in direction.

The concept of a gradient is widely used in engineering calculations to find the extrema of non-linear functions. Gradient methods are numerical methods of the search type. They are universal and especially effective in cases of searching for extrema of nonlinear functions with restrictions, as well as when the analytical function is completely unknown. The essence of these methods is to determine the values of variables that provide the extremum of the goal function by moving along the gradient (when searching for max) or in the opposite direction (min). Various gradient methods differ from one another in the way in which movement towards the optimum is determined. The bottom line is that if the lines are equal levels R(xu x i) characterize graphically the dependence R(x\jc?), then the search for the optimal point can be carried out in different ways. For example, draw a grid on a plane x\, xr with indication of values R at the grid nodes (Fig. 2.13).

Then you can choose from the nodal values of the extreme. This path is not rational, it is associated with a large number of calculations, and the accuracy is low, since it depends on the step, and the optimum can be located between the nodes.

Numerical Methods

Mathematical models contain relationships compiled on the basis of a theoretical analysis of the processes under study or obtained as a result of processing experiments (tables of data, graphs). In any case, the mathematical model only approximately describes the real process. Therefore) the issue of accuracy, adequacy of the model is the most important. The need for approximations arises in the very solution of equations. Until recently, models containing non-linear or partial differential equations could not be solved analytically. The same applies to numerous classes of non-contractible integrals. However, the development of methods for numerical analysis made it possible to vastly expand the boundaries of the possibilities of analyzing mathematical models, especially with the use of computers.

Numerical methods are used to approximate functions, to solve differential equations and their systems, to integrate and differentiate, to calculate numerical expressions.

The function can be defined analytically, table, graph. When performing research, a common problem is the approximation of a function by an analytic expression that satisfies the stated conditions. This accomplishes four tasks:

Selection of nodal points, conducting experiments at certain values (levels) of independent variables (if the step of changing the factor is chosen incorrectly, we will either “skip” a characteristic feature of the process under study, or we will lengthen the procedure and increase the complexity of finding patterns);

The choice of approximating functions in the form of polynomials, empirical formulas, depending on the content of a particular problem (one should strive for the maximum simplification of approximating functions);

Selection and use of goodness-of-fit criteria, on the basis of which the parameters of the approximating functions are found;

Fulfillment of the requirements of a given accuracy to the choice of an approximating function.

In problems of approximation of functions by polynomials, three classes are used

Linear combination of power functions (Taylor series, Lagrange, Newton polynomials, etc.);

Function combination cos nx, w them(Fourier series);

Polynomial formed by functions exp(-a, d).

When finding the approximating function, various criteria of agreement with the experimental data are used.

When optimizing by the gradient method, the optimum of the object under study is sought in the direction of the fastest increase (decrease) of the output variable, i.e. in the direction of the gradient. But before you take a step in the direction of the gradient, you need to calculate it. The gradient can be calculated either from the available model

simulation dynamic gradient polynomial

where is the partial derivative with respect to the i-th factor;

i, j, k - unit vectors in the direction of the coordinate axes of the factor space, or according to the results of n trial movements in the direction of the coordinate axes.

If the mathematical model of the statistical process has the form of a linear polynomial, the regression coefficients b i of which are partial derivatives of the expansion of the function y = f(X) in a Taylor series in powers of x i , then the optimum is sought in the direction of the gradient with a certain step h i:

pkfv n (Ch) \u003d and 1 p 1 + and 2 p 2 + ... + and t p t

The direction is corrected after each step.

The gradient method, together with its numerous modifications, is a common and effective method for finding the optimum of the objects under study. Consider one of the modifications of the gradient method - the steep ascent method.

The steep ascent method, or otherwise the Box-Wilson method, combines the advantages of three methods - the Gauss-Seidel method, the gradient method and the method of full (or fractional) factorial experiments, as a means of obtaining a linear mathematical model. The task of the steep ascent method is to carry out stepping in the direction of the fastest increase (or decrease) of the output variable, that is, along grad y (X). Unlike the gradient method, the direction is corrected not after each next step, but when a partial extremum of the objective function is reached at some point in a given direction, as is done in the Gauss-Seidel method. At the point of a partial extremum, a new factorial experiment is set up, a mathematical model is determined, and a steep ascent is again carried out. In the process of moving towards the optimum by this method, a statistical analysis of intermediate search results is regularly carried out. The search is terminated when quadratic effects in the regression equation become significant. This means that the optimum region has been reached.

Let us describe the principle of using gradient methods using the example of a function of two variables

subject to two additional conditions:

This principle (without change) can be applied to any number of variables, as well as additional conditions. Consider the plane x 1 , x 2 (Fig. 1). According to formula (8), each point corresponds to a certain value of F. In Fig.1, the lines F = const belonging to this plane are represented by closed curves surrounding the point M * , where F is minimal. Let at the initial moment the values x 1 and x 2 correspond to the point M 0 . The calculation cycle begins with a series of trial steps. First, x 1 is given a small increment; at this time, the value of x 2 is unchanged. Then the resulting increment in the value of F is determined, which can be considered proportional to the value of the partial derivative

(if the value is always the same).

The definition of partial derivatives (10) and (11) means that a vector with coordinates and is found, which is called the gradient of F and is denoted as follows:

It is known that the direction of this vector coincides with the direction of the steepest increase in the value of F. The opposite direction to it is the “steepest descent”, in other words, the steepest decrease in the value of F.

After finding the components of the gradient, the trial movements stop and the working steps are carried out in the direction opposite to the direction of the gradient, and the step size is the greater, the greater the absolute value of the vector grad F. These conditions are realized if the values of the working steps and are proportional to the previously obtained values of the partial derivatives:

where b is a positive constant.

After each working step, the increment of F is estimated. If it turns out to be negative, then the movement is in the right direction and you need to move in the same direction M 0 M 1 further. If at point M 1 the measurement result shows that, then the working movements stop and a new series of trial movements begins. In this case, the gradient gradF is determined at a new point M 1 , then the working movement continues along the new found direction of steepest descent, i.e. along the line M 1 M 2 , etc. This method is called the steepest descent/steepest ascent method.

When the system is near a minimum, which is indicated by a small value of the quantity

there is a switch to a more “cautious” search method, the so-called gradient method. It differs from the steepest descent method in that after determining the gradient gradF, only one working step is made, and then a series of trial movements begins again at a new point. This method of search provides a more accurate establishment of the minimum compared to the steepest descent method, while the latter allows you to quickly approach the minimum. If during the search the point M reaches the border of the admissible area and at least one of the values M 1 , M 2 changes sign, the method changes and the point M starts moving along the border of the area.

The effectiveness of the steep climb method depends on the choice of the scale of variables and the type of response surface. The surface with spherical contours ensures fast contraction to the optimum.

The disadvantages of the steep climb method include:

1. Limitation of extrapolation. Moving along the gradient, we rely on the extrapolation of the partial derivatives of the objective function with respect to the corresponding variables. However, the shape of the response surface may change and it is necessary to change the direction of the search. In other words, the movement on the plane cannot be continuous.

2. Difficulty in finding the global optimum. The method is applicable to finding only local optima.

The gradient vector is directed towards the fastest increase of the function at a given point. The vector opposite to the gradient -grad(/(x)), is called the anti-gradient and is directed in the direction of the fastest decrease of the function. At the minimum point, the gradient of the function is zero. First-order methods, also called gradient methods, are based on the properties of the gradient. If there is no additional information, then from the starting point x (0 > it is better to go to the point x (1) , lying in the direction of the antigradient - the fastest decrease in the function. Choosing the antigradient -grad (/(x (^)) at the point x (to we obtain an iterative process of the form

In coordinate form, this process is written as follows:

As a criterion for stopping the iterative process, one can use either condition (10.2) or the fulfillment of the condition for the smallness of the gradient

A combined criterion is also possible, consisting in the simultaneous fulfillment of the indicated conditions.

Gradient methods differ from each other in the way the step size is chosen. a In the constant step method, some constant step value is chosen for all iterations. Pretty small step a^ ensures that the function decreases, i.e. fulfillment of the inequality

However, this may lead to the need to carry out a sufficiently large number of iterations to reach the minimum point. On the other hand, too large a step can cause the function to grow or lead to fluctuations around the minimum point. Additional information is required to select the step size, so methods with a constant step are rarely used in practice.

More reliable and economical (in terms of the number of iterations) are gradient methods with a variable step, when, depending on the approximation obtained, the step size changes in some way. As an example of such a method, consider the steepest descent method. In this method, at each iteration, the step value n* is selected from the condition of the minimum of the function /(x) in the direction of descent, i.e.

This condition means that the movement along the antigradient occurs as long as the value of the function f(x) decreases. Therefore, at each iteration, it is necessary to solve the problem of one-dimensional minimization with respect to π of the function φ(λ) =/(x(/r) - - agrad^x^))). The algorithm of the steepest descent method is as follows.

1. Let us set the coordinates of the initial point x^°, the accuracy of the approximate solution r. We set k = 0.
2. At the point x (/z) we calculate the value of the gradient grad(/(x (^)).
3. Determine the step size a^ by one-dimensional minimization with respect to i of the function cp(i).
4. We define a new approximation to the minimum point x (* +1 > according to the formula (10.4).
5. Check the conditions for stopping the iterative process. If they are satisfied, then the calculations stop. Otherwise, we put k k+ 1 and go to item 2.

In the steepest descent method, the direction of movement from point x (*) touches the level line at point x (* +1) . The descent trajectory is zigzag, and adjacent zigzag links are orthogonal to each other. Indeed, a step a^ is chosen by minimizing a functions ( a). Necessary condition

minimum of the function - = 0. Calculating the derivative

complex function, we obtain the orthogonality condition for the descent direction vectors at neighboring points:

The problem of minimizing the function φ(n) can be reduced to the problem of calculating the root of a function of one variable g(a) =

Gradient methods converge to a minimum at the rate of a geometric progression for smooth convex functions. Such functions have the largest and smallest eigenvalues of the matrix of second derivatives (Hesse matrices)

differ little from each other, i.e. the matrix H(x) is well conditioned. However, in practice, the minimized functions often have ill-conditioned matrices of second derivatives. The values of such functions along some directions change much faster than in other directions. The rate of convergence of gradient methods also significantly depends on the accuracy of gradient calculations. The loss of precision, which usually occurs in the vicinity of the minimum points, can generally break the convergence of the gradient descent process. Therefore, gradient methods are often used in combination with other, more efficient methods at the initial stage of solving a problem. In this case, the point x(0) is far from the minimum point, and steps in the direction of the antigradient make it possible to achieve a significant decrease in the function.

There are no restrictions in the unconstrained optimization problem.

Recall that the gradient of a multidimensional function is a vector that is analytically expressed by the geometric sum of partial derivatives

Scalar Function Gradient F(X) at some point it is directed towards the fastest increase of the function and is orthogonal to the level line (surfaces of constant value F(X), passing through a point X k). The vector opposite to the gradient  antigradient  is directed in the direction of the fastest decrease of the function F(X). At the extreme point grad F(X)= 0.

In gradient methods, the movement of a point when searching for the minimum of the objective function is described by the iterative formula

where  k  step parameter on k th iteration along the antigradient. For climbing methods (search for the maximum), you need to move along the gradient.

Different variants of gradient methods differ from each other in the way of choosing the step parameter, as well as taking into account the direction of movement in the previous step. Consider the following options for gradient methods: with a constant step, with a variable step parameter (step splitting), the steepest descent method, and the conjugate gradient method.

Method with a constant step parameter. In this method, the step parameter is constant on each iteration. The question arises: how to practically choose the value of the step parameter? A sufficiently small step parameter can lead to an unacceptably large number of iterations required to reach the minimum point. On the other hand, a step parameter that is too large can lead to overshooting the minimum point and to an oscillatory computational process around this point. These circumstances are disadvantages of the method. Since it is impossible to guess in advance the acceptable value of the step parameter  k, then it becomes necessary to use the gradient method with a variable step parameter.

As it approaches the optimum, the gradient vector decreases in magnitude, tending to zero, therefore, when  k = const step length gradually decreases. Near the optimum, the length of the gradient vector tends to zero. Vector length or norm in n-dimensional Euclidean space is determined by the formula

, where n- number of variables.

Options for stopping the search for the optimum:

From a practical point of view, it is more convenient to use the 3rd stopping criterion (since the values of the design parameters are of interest), however, to determine the proximity of the extremum point, you need to focus on the 2nd criterion. Several criteria can be used to stop the computational process.

Consider an example. Find the minimum of the objective function F(X) = (x 1  2) 2 + (x 2  4) 2 . Exact solution of the problem X*= (2.0;4.0). Expressions for partial derivatives

,
.

Choose a step  k = 0.1. Let's search from the starting point X 1 = . The solution is presented in the form of a table.

Gradient method with step parameter splitting. In this case, during the optimization process, the step parameter  k decreases if, after the next step, the objective function increases (when searching for a minimum). In this case, the step length is often split (divided) in half, and the step is repeated from the previous point. This provides a more accurate approach to the extremum point.

The steepest descent method. Variable step methods are more economical in terms of number of iterations. If the optimal step length  k along the direction of the antigradient is a solution to a one-dimensional minimization problem, then this method is called the steepest descent method. In this method, at each iteration, the problem of one-dimensional minimization is solved:

F(X k+1 )=F(X k   k S k )=min F( k ), S k =  F(X);

 k >0

In this method, movement in the direction of the antigradient continues until the minimum of the objective function is reached (until the value of the objective function decreases). Using an example, let's consider how the objective function can be analytically written at each step depending on the unknown parameter

Example. min F(x 1 , x 2 ) = 2x 1 2 + 4x 2 3 – 3. Then  F(X)= [ 4x 1 ; 12x 2 2 ]. Let the point X k = , Consequently  F(X)= [ 8; 12], F(X k   S k ) =

2(2  8 ) 2 + 4(1  12 ) 3  3. It is necessary to find  that delivers the minimum of this function.

Steepest descent algorithm (for finding the minimum)

initial step. Let  be the stopping constant. Select starting point X 1 , put k = 1 and go to the main step.

Basic step. If a || gradF(X)||< , then end the search, otherwise determine  F(X k ) and find  k  optimal solution of the minimization problem F(X k   k S k ) at  k  0. Put X k +1 = X k   k S k, assign k =

k + 1 and repeat the main step.

To find the minimum of a function of one variable in the steepest descent method, you can use unimodal optimization methods. From a large group of methods, consider the method of dichotomy (bisection) and the golden section. The essence of unimodal optimization methods is to narrow the interval of uncertainty of the location of the extremum.

Dichotomy method (bisection)Initial step. Choose the distinguishability constant  and the final length of the uncertainty interval l. The value of  should be as small as possible, however, allowing to distinguish the values of the function F( ) and F( ) . Let [ a 1 , b 1 ]  initial uncertainty interval. Put k =

The main stage consists of a finite number of iterations of the same type.

k-th iteration.

Step 1. If a b k  a k  l, then the computation ends. Solution x * = (a k + b k )/2. Otherwise

,
.

Step 2 If a F( k ) < F( k ), put a k +1 = a k ; b k +1 =  k. Otherwise a k +1 =  k and b k +1 = b k. Assign k = k + 1 and go to step 1.

Golden section method. A more efficient method than the dichotomy method. Allows you to get a given value of the uncertainty interval in fewer iterations and requires fewer calculations of the objective function. In this method, the new division point of the uncertainty interval is calculated once. The new point is placed at a distance

 = 0.618034 from the end of the interval.

Golden Ratio Algorithm

Initial step. Choose an acceptable finite length of the uncertainty interval l > 0. Let [ a 1 , b 1 ]  initial uncertainty interval. Put  1 = a 1 +(1   )(b 1  a 1 ) and  1 = a 1 +  (b 1  a 1 ) , where  = 0,618 . Calculate F( 1 ) and F( 1 ) , put k = 1 and go to the main step.

Step 1. If a b k  a k  l, then the calculations end x * = (a k + b k )/ 2. Otherwise, if F( k ) > F( k ) , then go to step 2; if F( k )  F( k ) , go to step 3.

Step 2 Put a k +1 =  k , b k +1 = b k ,  k +1 =  k ,  k +1 = a k +1 +  (b k +1 – a k +1 ). Calculate F( k +1 ), go to step 4.

Step 3 Put a k +1 = a k , b k +1 =  k ,  k +1 =  k ,  k +1 = a k +1 + (1   )(b k +1 – a k +1 ). Calculate F( k +1 ).

Step 4 Assign k = k + 1, go to step 1.

At the first iteration, two evaluations of the function are required, at all subsequent iterations, only one.

Conjugate gradient method (Fletcher-Reeves). In this method, the choice of direction of movement on k+ 1 step takes into account the change of direction on k step. The descent direction vector is a linear combination of the anti-gradient direction and the previous search direction. In this case, when minimizing ravine functions (with narrow long troughs), the search is not perpendicular to the ravine, but along it, which allows you to quickly reach the minimum. When searching for an extremum using the conjugate gradient method, the point coordinates are calculated by the expression X k +1 = X k   V k +1 , where V k +1 is a vector calculated by the following expression:

The first iteration usually relies V = 0 and the anti-gradient search is performed, as in the steepest descent method. Then the direction of motion deviates from the direction of the antigradient the more, the more significantly the length of the gradient vector changed at the last iteration. After n steps to correct the operation of the algorithm take the usual step along the antigradient.

Algorithm of the conjugate gradient method

Step 1. Enter start point X 0 , accuracy  , dimension n.

Step 2 Put k = 1.

Step 3 Put vector V k = 0.

Step 4 Calculate grad F(X k ).

Step 5 Calculate Vector V k +1.

Step 6 Perform 1D Vector Search V k +1.

Step 7 If a k < n, put k = k + 1 and go to step 4 otherwise go to step 8.

Step 8 If the length of the vector V less than , end the search, otherwise go to step 2.

The conjugate direction method is one of the most effective in solving minimization problems. The method in conjunction with one-dimensional search is often used in practice in CAD. However, it should be noted that it is sensitive to errors that occur during the calculation process.

Disadvantages of Gradient Methods

In problems with a large number of variables, it is difficult or impossible to obtain derivatives in the form of analytic functions.

When calculating derivatives using difference schemes, the resulting error, especially in the vicinity of an extremum, limits the possibilities of such an approximation.