Review of gradient methods in mathematical optimization problems. Gradient methods

Gradient methods

Gradient unconstrained optimization methods use only the first derivatives of the objective function and are linear approximation methods at each step, i.e. the objective function at each step is replaced by a tangent hyperplane to its graph at the current point.

At the kth stage of gradient methods, the transition from point Xk to point Xk+1 is described by the relation:

where k is the step size, k is the vector in the direction Xk+1-Xk.

Steepest Descent Methods

This method was first considered and applied by O. Cauchy in the 18th century. Its idea is simple: the gradient of the objective function f(X) at any point is a vector in the direction of the greatest increase in the value of the function. Consequently, the antigradient will be directed towards the greatest decrease in the function and is the direction of steepest descent. The antigradient (and gradient) is orthogonal to the level surface f(X) at point X. If we introduce the direction in (1.2)

then this will be the direction of steepest descent at point Xk.

We obtain the formula for transition from Xk to Xk+1:

The antigradient gives only the direction of descent, but not the magnitude of the step. In general, one step does not give a minimum point, so the descent procedure must be applied several times. At the minimum point, all gradient components are equal to zero.

All gradient methods use the stated idea and differ from each other in technical details: calculation of derivatives using an analytical formula or finite-difference approximation; the step size can be constant, change according to some rules, or be chosen after applying one-dimensional optimization methods in the antigradient direction, etc. and so on.

We will not go into detail, because... The steepest descent method is not generally recommended as a serious optimization procedure.

One of the disadvantages of this method is that it converges to any stationary point, including a saddle point, which cannot be a solution.

But the most important thing is the very slow convergence of steepest descent in the general case. The point is that the descent is “fastest” in the local sense. If the search hyperspace is strongly elongated (“ravine”), then the antigradient is directed almost orthogonally to the bottom of the “ravine,” i.e. the best direction to achieve the minimum. In this sense, a direct translation of the English term "steepest descent", i.e. descent along the steepest slope is more consistent with the state of affairs than the term “fastest”, adopted in Russian-language specialized literature. One way out in this situation is to use the information provided by second partial derivatives. Another way out is to change the scales of the variables.

linear approximation derivative gradient

Fletcher-Reeves conjugate gradient method

In the conjugate gradient method, a sequence of search directions is constructed, which are linear combinations of the current direction of steepest descent, and previous search directions, i.e.

Moreover, the coefficients are chosen so as to make the search directions conjugate. It has been proven that

and this is a very valuable result that allows you to build a fast and effective optimization algorithm.

Fletcher-Reeves algorithm

1. In X0 is calculated.

2. At the kth step, using a one-dimensional search in the direction, the minimum f(X) is found, which determines the point Xk+1.

3. f(Xk+1) and are calculated.
4. The direction is determined from the relationship:

5. After the (n+1)th iteration (i.e. when k=n), a restart is made: X0=Xn+1 is assumed and the transition to step 1 is carried out.
6. The algorithm stops when

where is an arbitrary constant.

The advantage of the Fletcher-Reeves algorithm is that it does not require matrix inversion and saves computer memory, since it does not need the matrices used in Newtonian methods, but at the same time it is almost as efficient as quasi-Newtonian algorithms. Because the search directions are mutually conjugate, then the quadratic function will be minimized in no more than n steps. In the general case, a restart is used, which allows you to get the result.

The Fletcher-Reeves algorithm is sensitive to the precision of the one-dimensional search, so it must be used to eliminate any rounding errors that may occur. Additionally, the algorithm may fail in situations where the Hessian becomes ill-conditioned. The algorithm has no guarantee of convergence always and everywhere, although practice shows that the algorithm almost always produces results.

Newtonian methods

The search direction corresponding to steepest descent is associated with a linear approximation of the objective function. Methods using second derivatives arose from the quadratic approximation of the objective function, i.e., when expanding the function in a Taylor series, terms of the third and higher orders are discarded.

where is the Hessian matrix.

The minimum of the right-hand side (if it exists) is achieved at the same place as the minimum of the quadratic form. Let's write down the formula to determine the search direction:

The minimum is reached at

An optimization algorithm in which the search direction is determined from this relationship is called Newton's method, and the direction is called Newtonian direction.

In problems of finding the minimum of an arbitrary quadratic function with a positive matrix of second derivatives, Newton's method gives a solution in one iteration, regardless of the choice of the starting point.

Classification of Newtonian methods

Newton's method itself consists of applying the Newtonian direction once to optimize a quadratic function. If the function is not quadratic, then the following theorem is true.

Theorem 1.4. If the Hessian matrix of a nonlinear function f of general form at the minimum point X* is positive definite, the starting point is chosen sufficiently close to X* and the step lengths are chosen correctly, then Newton’s method converges to X* with a quadratic rate.

Newton's method is considered a reference method; all developed optimization procedures are compared with it. However, Newton's method is efficient only for a positive definite and well-conditioned Hessian matrix (its determinant must be significantly greater than zero, or more precisely, the ratio of the largest and smallest eigenvalues must be close to one). To overcome this shortcoming, modified Newtonian methods are used, using Newtonian directions whenever possible and deviating from them only when necessary.

The general principle of modifications of Newton’s method is as follows: at each iteration, a certain positive definite matrix “associated” with is first constructed, and then calculated using the formula

Since it is positive definite, then - will necessarily be the direction of descent. The construction procedure is organized so that it coincides with the Hessian matrix if it is positive definite. These procedures are based on certain matrix decompositions.

Another group of methods, practically not inferior in speed to Newton’s method, is based on the approximation of the Hessian matrix using finite differences, because It is not necessary to use exact values of derivatives for optimization. These methods are useful when analytical calculation of derivatives is difficult or simply impossible. Such methods are called discrete Newton methods.

The key to the effectiveness of Newton-type methods is taking into account information about the curvature of the minimized function, contained in the Hessian matrix and allowing the construction of locally accurate quadratic models of the objective function. But it is possible to collect and accumulate information about the curvature of a function based on observing the change in gradient during descent iterations.

The corresponding methods, based on the possibility of approximating the curvature of a nonlinear function without explicitly forming its Hessian matrix, are called quasi-Newtonian methods.

Note that when constructing an optimization procedure of the Newtonian type (including quasi-Newtonian), it is necessary to take into account the possibility of the appearance of a saddle point. In this case, the vector of the best search direction will always be directed towards the saddle point, instead of moving away from it in the downward direction.

Newton-Raphson method

This method consists of repeatedly using the Newtonian direction when optimizing functions that are not quadratic.

Basic iterative formula for multidimensional optimization

is used in this method when choosing the optimization direction from the relation

The real step length is hidden in the non-normalized Newtonian direction.

Since this method does not require the value of the objective function at the current point, it is sometimes called an indirect or analytical optimization method. Its ability to determine the minimum of a quadratic function in a single calculation looks extremely attractive at first glance. However, this "single calculation" requires significant costs. First of all, it is necessary to calculate n partial derivatives of the first order and n(n+1)/2 - of the second. In addition, the Hessian matrix must be inverted. This requires about n3 computational operations. With the same cost, conjugate direction methods or conjugate gradient methods can take about n steps, i.e. achieve almost the same result. Thus, the iteration of the Newton-Raphson method does not provide advantages in the case of a quadratic function.

If the function is not quadratic, then

- the initial direction, generally speaking, no longer indicates the actual minimum point, which means that iterations must be repeated several times;
- a step of unit length can lead to a point with a worse value of the objective function, and the search can give the wrong direction if, for example, the Hessian is not positive definite;
- the Hessian may become ill-conditioned, making it impossible to invert it, i.e. determining the direction for the next iteration.

The strategy itself does not distinguish which stationary point (minimum, maximum, saddle point) the search is approaching, and calculations of the values of the objective function, which could be used to track whether the function is increasing, are not made. This means that everything depends on which stationary point the starting point of the search is in the attraction zone. The Newton-Raphson strategy is rarely used on its own without modification of one kind or another.

Pearson methods

Pearson proposed several methods that approximate the inverse Hessian without explicitly calculating second derivatives, i.e. by observing changes in the direction of the antigradient. In this case, conjugate directions are obtained. These algorithms differ only in details. Let us present those that are most widely used in applied areas.

Pearson Algorithm No. 2.

In this algorithm, the inverse Hessian is approximated by the matrix Hk, calculated at each step using the formula

An arbitrary positive definite symmetric matrix is chosen as the initial matrix H0.

This Pearson algorithm often leads to situations where the matrix Hk becomes ill-conditioned, namely, it begins to oscillate, oscillating between positive definite and non-positive definite, while the determinant of the matrix is close to zero. To avoid this situation, it is necessary to redefine the matrix every n steps, equating it to H0.

Pearson Algorithm No. 3.

In this algorithm, the matrix Hk+1 is determined from the formula

Hk+1 = Hk +

The descent trajectory generated by the algorithm is similar to the behavior of the Davidon-Fletcher-Powell algorithm, but the steps are slightly shorter. Pearson also proposed a variation of this algorithm with cyclic matrix resetting.

Projective Newton-Raphson algorithm

Pearson proposed the idea of an algorithm in which the matrix is calculated from the relation

H0=R0, where the matrix R0 is the same as the initial matrices in the previous algorithms.

When k is a multiple of the number of independent variables n, the matrix Hk is replaced by the matrix Rk+1, calculated as the sum

The quantity Hk(f(Xk+1) - f(Xk)) is the projection of the gradient increment vector (f(Xk+1) - f(Xk)), orthogonal to all gradient increment vectors in the previous steps. After every n steps, Rk is an approximation of the inverse Hessian H-1(Xk), so in effect a (approximate) Newton search is performed.

Davidon-Fletcher-Powell method

This method has other names - the variable metric method, the quasi-Newton method, because he uses both of these approaches.

The Davidon-Fletcher-Powell (DFP) method is based on the use of Newtonian directions, but does not require the calculation of the inverse Hessian at each step.

The search direction at step k is the direction

where Hi is a positive definite symmetric matrix that is updated at each step and in the limit becomes equal to the inverse Hessian. The identity matrix is usually chosen as the initial matrix H. The iterative DFT procedure can be represented as follows:

1. At step k there is a point Xk and a positive definite matrix Hk.
2. Select as the new search direction

3. A one-dimensional search (usually cubic interpolation) along the direction determines k, which minimizes the function.

4. Relies.

5. Relies.

6. Is determined. If Vk or are small enough, the procedure ends.

7. It is assumed that Uk = f(Xk+1) - f(Xk).
8. Matrix Hk is updated according to the formula

9. Increase k by one and return to step 2.

The method is effective in practice if the error in gradient calculations is small and the matrix Hk does not become ill-conditioned.

The matrix Ak ensures the convergence of Hk to G-1, the matrix Bk ensures the positive definiteness of Hk+1 at all stages and excludes H0 in the limit.

In the case of a quadratic function

those. The DFP algorithm uses conjugate directions.

Thus, the DFT method uses both the ideas of the Newtonian approach and the properties of conjugate directions, and when minimizing the quadratic function, it converges in no more than n iterations. If the optimized function has a form close to a quadratic function, then the DFT method is effective due to its good approximation G-1 (Newton’s method). If the objective function has a general form, then the DFT method is effective due to the use of conjugate directions.

Gradient optimization methods

Optimization problems with nonlinear or difficult-to-compute relationships that define optimization criteria and constraints are the subject of nonlinear programming. As a rule, solutions to nonlinear programming problems can be found only by numerical methods using computer technology. Among them, the most frequently used are gradient methods (methods of relaxation, gradient, steepest descent and ascent), gradient-free methods of deterministic search (scanning methods, simplex, etc.), and random search methods. All these methods are used in the numerical determination of optima and are widely covered in the specialized literature.

In general, the value of the optimization criterion R can be considered as a function R(x b xx..., x n), defined in n-dimensional space. Since there is no visual graphic representation of the n-dimensional space, we will use the case of a two-dimensional space.

If R(l b x 2) continuous in the region D, then around the optimal point M°(xi°, x g °) it is possible to draw a closed line in a given plane along which the value R= const. Many such lines, called lines of equal levels, can be drawn around the optimal point (depending on the step

Among the methods used to solve nonlinear programming problems, a significant place is occupied by methods for finding solutions based on the analysis of the derivative with respect to the direction of the function being optimized. If at each point in space a scalar function of several variables takes on well-defined values, then in this case we are dealing with a scalar field (temperature field, pressure field, density field, etc.). The vector field (field of forces, velocities, etc.) is defined in a similar way. Isotherms, isobars, isochrones, etc. - all these are lines (surfaces) of equal levels, equal values of the function (temperature, pressure, volume, etc.). Since the value of a function changes from point to point in space, it becomes necessary to determine the rate of change of the function in space, that is, the derivative in direction.

The concept of gradient is widely used in engineering calculations when finding extrema of nonlinear functions. Gradient methods are numerical search methods. They are universal and especially effective in cases of searching for extrema of nonlinear functions with restrictions, as well as when the analytical function is completely unknown. The essence of these methods is to determine the values of the variables that provide the extremum of the goal function by moving along the gradient (when searching max) or in the opposite direction (min). Various gradient methods differ from each other in the way they determine the movement towards the optimum. The bottom line is that if lines of equal levels R(xu x i) characterize graphically the dependence R(x\jc?), then the search for the optimal point can be done in different ways. For example, draw a mesh on a plane x\, xr indicating the values R at grid nodes (Fig. 2.13).

Then you can select the extreme value from the node values. This path is not rational, it is associated with a large number of calculations, and the accuracy is low, since it depends on the step, and the optimum may be between nodes.

Numerical methods

Mathematical models contain relationships compiled on the basis of a theoretical analysis of the processes being studied or obtained as a result of processing experiments (data tables, graphs). In any case, the mathematical model only approximately describes the real process. Therefore, the issue of accuracy and adequacy of the model is the most important. The need for approximations also arises when solving equations itself. Until recently, models containing nonlinear differential equations or partial differential equations could not be solved by analytical methods. The same applies to numerous classes of sky integrals. However, the development of methods of numerical analysis has made it possible to endlessly expand the boundaries of the possibilities of analyzing mathematical models, especially this has become possible with the use of computers.

Numerical methods are used to approximate functions, to solve differential equations and their systems, to integrate and differentiate, and to calculate numerical expressions.

The function can be specified analytically, as a table, or as a graph. When performing research, a common task is to approximate a function by an analytical expression that satisfies the stated conditions. This solves four problems:

Selecting nodal points, conducting experiments at certain values (levels) of independent variables (if the step of changing a factor is incorrectly chosen, we will either “miss” a characteristic feature of the process being studied, or we will lengthen the procedure and increase the complexity of searching for a pattern);

The choice of approximating functions in the form of polynomials, empirical formulas, depending on the content of a specific problem (one should strive to simplify the approximating functions as much as possible);

Selection and use of agreement criteria on the basis of which the parameters of approximating functions are found;

Meeting the requirements of a given accuracy for the selection of an approximation function.

In problems of approximating functions by polynomials, three classes are used

Linear combination of power functions (Taylor series, Lagrange, Newton polynomials, etc.);

Combination of functions soz ph, w them(Fourier series);

Polynomial formed by functions exp(-a, d).

When finding the approximating function, various criteria for agreement with experimental data are used.

When optimizing by the gradient method, the optimum of the object under study is sought in the direction of the fastest increase (decrease) of the output variable, i.e. in the direction of the gradient. But before you take a step towards the gradient, you need to calculate it. The gradient can be calculated either using an existing model

modeling dynamic gradient polynomial

where is the partial derivative with respect to the i-th factor;

i, j, k - unit vectors in the direction of the coordinate axes of the factor space, or according to the results of n trial movements in the direction of the coordinate axes.

If the mathematical model of a statistical process has the form of a linear polynomial, the regression coefficients b i of which are partial derivatives of the expansion of the function y = f(X) into a Taylor series in powers of x i , then the optimum is sought in the direction of the gradient with a certain step h i:

pkfv n(H)= and 1 r 1 +and 2 r 2 +…+and t r t

The direction is adjusted after each step.

The gradient method, together with its numerous modifications, is a common and effective method for searching for the optimum of the objects under study. Let's consider one of the modifications of the gradient method - the steep ascent method.

The steep ascent method, or otherwise the Box-Wilson method, combines the advantages of three methods - the Gauss-Seidel method, the gradient method and the method of full (or fractional) factorial experiments, as a means of obtaining a linear mathematical model. The task of the steep ascent method is to carry out stepwise movement in the direction of the fastest increase (or decrease) of the output variable, that is, along grad y(X). Unlike the gradient method, the direction is adjusted not after each next step, but when a particular extremum of the objective function is reached at some point in a given direction, as is done in the Gauss-Seidel method. At the point of a particular extremum, a new factorial experiment is carried out, a mathematical model is determined, and a steep ascent is again carried out. In the process of moving towards the optimum using the specified method, statistical analysis of intermediate search results is regularly carried out. The search stops when the quadratic effects in the regression equation become significant. This means that the optimum region has been reached.

Let us describe the principle of using gradient methods using the example of a function of two variables

subject to two additional conditions:

This principle (without modification) can be applied to any number of variables, as well as additional conditions. Consider the plane x 1 , x 2 (Fig. 1). According to formula (8), each point corresponds to a certain value of F. In Fig. 1, the lines F = const belonging to this plane are represented by closed curves surrounding the point M * at which F is minimal. Let at the initial moment the values x 1 and x 2 correspond to the point M 0 . The calculation cycle begins with a series of trial steps. First, the value of x 1 is given a small increment; at this time the value of x 2 is unchanged. Then the resulting increment in the value of F is determined, which can be considered proportional to the value of the partial derivative

(if the value is always the same).

The definition of partial derivatives (10) and (11) means that a vector with coordinates and has been found, which is called the gradient of the value F and is denoted as follows:

It is known that the direction of this vector coincides with the direction of the steepest increase in the value of F. The opposite direction is the “shartest descent”, in other words, the steepest decrease in the value of F.

After finding the components of the gradient, the trial movements are stopped and working steps are carried out in the direction opposite to the direction of the gradient, and the larger the absolute value of the vector grad F, the larger the step size. These conditions are met if the values of the working steps are proportional to the previously obtained values of the partial derivatives:

where b is a positive constant.

After each working step, the increment in the value of F is estimated. If it turns out to be negative, then the movement occurs in the right direction and you need to move further in the same direction M 0 M 1. If at point M 1 the measurement result shows that, then the working movements stop and a new series of trial movements begins. In this case, the gradient gradF is determined at the new point M 1, then the working movement continues along the new found direction of steepest descent, i.e. along the line M 1 M 2, etc. This method is called the steepest descent/steepest ascent method.

When the system is near the minimum, which is indicated by a small value of

there is a switch to a more “cautious” search method, the so-called gradient method. It differs from the steepest descent method in that after determining the gradient gradF, only one working step is taken, and then a series of trial movements begins again at a new point. This search method provides a more accurate determination of the minimum compared to the steepest descent method, while the latter allows you to quickly approach the minimum. If during the search process point M reaches the boundary of the admissible region and at least one of the quantities M 1, M 2 changes sign, the method changes and point M begins to move along the boundary of the region.

The effectiveness of the steep ascent method depends on the choice of the scale of the variables and the type of response surface. The surface with spherical contours ensures fast contraction to the optimum.

The disadvantages of the steep ascent method include:

1. Limitations of extrapolation. Moving along the gradient, we rely on extrapolation of the partial derivatives of the objective function with respect to the corresponding variables. However, the shape of the response surface may change and the search direction must be changed. In other words, movement on a plane cannot be continuous.

2. Difficulty in finding a global optimum. The method is applicable to finding only local optima.

The gradient vector is directed in the direction of the fastest increase in the function at a given point. The vector opposite to the gradient -grad(/(x)) is called the antigradient and is directed in the direction of the fastest decrease in the function. At the minimum point, the gradient of the function is zero. First-order methods, also called gradient methods, are based on the properties of gradients. If there is no additional information, then from the initial point x (0 > it is better to go to point x (1) lying in the direction of the antigradient - the fastest decrease of the function. Choosing the antigradient -grad(/(x (^)) at the point as the direction of descent x(k we obtain an iterative process of the form

In coordinate form, this process is written as follows:

As a criterion for stopping the iterative process, you can use either condition (10.2) or the fulfillment of the condition of a small gradient

A combined criterion is also possible, consisting in the simultaneous fulfillment of the specified conditions.

Gradient methods differ from each other in the way they choose the step size A In the method with a constant step, a certain constant step value is chosen for all iterations. Quite a small step a^ ensures that the function decreases, i.e. fulfillment of inequality

However, this may lead to the need to carry out a fairly large number of iterations to reach the minimum point. On the other hand, too large a step can cause the function to grow or lead to fluctuations around the minimum point. Additional information is required to select the step size, so methods with constant steps are rarely used in practice.

Gradient methods with variable steps are more reliable and economical (in terms of the number of iterations), when the step size changes in some way depending on the obtained approximation. As an example of such a method, consider the steepest descent method. In this method, at each iteration, the step size i* is selected from the condition of the minimum of the function f(x) in the direction of descent, i.e.

This condition means that movement along the antigradient occurs as long as the value of the function /(x) decreases. Therefore, at each iteration it is necessary to solve the problem of one-dimensional minimization with respect to φ of the function φ(τ) =/(x(/r) - - agrad^x^))). The algorithm of the steepest descent method is as follows.

1. Let us set the coordinates of the initial point x^° and the accuracy of the approximate solution r. Let us set k = 0.
2. At point x (/r) we calculate the value of the gradient grad(/(x (^)).
3. Determine the step size a^ by one-dimensional minimization of the function cp(i) with respect to i.
4. Let us determine a new approximation to the minimum point x (* +1 > using formula (10.4).
5. Let's check the conditions for stopping the iterative process. If they are fulfilled, then the calculations stop. Otherwise we assume k k+ 1 and go to step 2.

In the steepest descent method, the direction of movement from point x (*) touches the level line at point x (* +1). The descent path is zigzag, and adjacent zigzag links are orthogonal to each other. Indeed, a step a^ is selected by minimizing by A functions ( A). Prerequisite

minimum of the function - = 0. Having calculated the derivative

complex function, we obtain the condition for the orthogonality of the vectors of descent directions at neighboring points:

The problem of minimizing the function φ(π) can be reduced to the problem of calculating the root of a function of one variable g(a) =

Gradient methods converge to a minimum at a geometric progression rate for smooth convex functions. Such functions have the largest and smallest eigenvalues of the matrix of second derivatives (Hessian matrix)

differ little from each other, i.e. the matrix H(x) is well conditioned. However, in practice, the functions being minimized often have ill-conditioned matrices of second derivatives. The values of such functions change much faster along some directions than in other directions. The convergence rate of gradient methods also depends significantly on the accuracy of gradient calculations. The loss of precision, which usually occurs in the vicinity of minimum points, can generally disrupt the convergence of the gradient descent process. Therefore, gradient methods are often used in combination with other, more effective methods at the initial stage of solving a problem. In this case, the point x (0) is far from the minimum point, and steps in the direction of the antigradient make it possible to achieve a significant decrease in the function.

There are no restrictions in an unconstrained optimization problem.

Recall that the gradient of a multidimensional function is a vector that is analytically expressed by the geometric sum of partial derivatives

Gradient of a scalar function F(X) at some point it is directed in the direction of the fastest increase in the function and is orthogonal to the level line (a surface of constant value F(X), passing through a point X k). The vector opposite to the gradient  antigradient  is directed towards the fastest decrease of the function F(X). At the extremum point grad F(X)= 0.

In gradient methods, the movement of a point when searching for the minimum of the objective function is described by the iterative formula

Where  k  step parameter k th iteration along the antigradient. For ascending methods (searching for the maximum), you need to move along the gradient.

Various variants of gradient methods differ from each other in the way they choose the step parameter, as well as taking into account the direction of movement in the previous step. Let's consider the following options for gradient methods: with a constant step, with a variable step parameter (step division), the steepest descent method and the conjugate gradient method.

Method with a constant step parameter. In this method, the step parameter is constant in each iteration. The question arises: how to practically choose the value of the step parameter? A sufficiently small step parameter may result in an unacceptably large number of iterations required to reach the minimum point. On the other hand, too large a step parameter can lead to overshooting the minimum point and to an oscillatory computational process around this point. These circumstances are disadvantages of the method. Since it is impossible to guess in advance the acceptable value of the step parameter  k, then there is a need to use the gradient method with a variable step parameter.

As we approach the optimum, the gradient vector decreases in value, tending to zero, so when  k = const the step length gradually decreases. Near the optimum, the length of the gradient vector tends to zero. Vector length or norm in n-dimensional Euclidean space is determined by the formula

, Where n- number of variables.

Options for stopping the optimal search process:

From a practical point of view, it is more convenient to use the 3rd stopping criterion (since the values of the design parameters are of interest), however, to determine the proximity of the extremum point, you need to focus on the 2nd criterion. Several criteria can be used to stop the computational process.

Let's look at an example. Find the minimum of the objective function F(X) = (x 1  2) 2 + (x 2  4) 2 . Exact solution to the problem X*= (2.0;4.0). Expressions for partial derivatives

,
.

Choosing a step  k = 0.1. Let's search from the starting point X 1 = . Let's present the solution in the form of a table.

Gradient method with splitting the step parameter. In this case, during the optimization process, the step parameter  k decreases if after the next step the objective function increases (when searching for a minimum). In this case, the step length is often split (divided) in half, and the step is repeated from the previous point. This provides a more accurate approach to the extremum point.

Method of steepest descent. Variable step methods are more economical in terms of the number of iterations. If the optimal step length  k along the antigradient direction is a solution to a one-dimensional minimization problem, then this method is called the steepest descent method. In this method, at each iteration the problem of one-dimensional minimization is solved:

F(X k+1 )=F(X k   k S k )=min F( k ), S k =  F(X);

 k >0

In this method, movement in the direction of the antigradient continues until the minimum of the objective function is reached (while the value of the objective function decreases). Using an example, let us consider how the objective function can be written analytically at each step depending on an unknown parameter

Example. min F(x 1 , x 2 ) = 2x 1 2 + 4x 2 3 – 3. Then  F(X)= [ 4x 1 ; 12x 2 2 ]. Let the point X k = , hence  F(X)= [ 8; 12], F(X k   S k ) =

2(2  8 ) 2 + 4(1  12 ) 3  3. It is necessary to find , which provides the minimum for this function.

Algorithm for the steepest descent method (to find the minimum)

Initial step. Let   be a stopping constant. Select starting point X 1 , put k = 1 and go to the main step.

Basic step. If || gradF(X)||< , then end the search, otherwise determine  F(X k ) and find  k  optimal solution to the minimization problem F(X k   k S k ) at  k  0. Put X k +1 = X k   k S k, assign k =

k + 1 and repeat the main step.

To find the minimum of a function of one variable in the steepest descent method, you can use unimodal optimization methods. From a large group of methods, we will consider the method of dichotomy (bisection) and the golden section. The essence of unimodal optimization methods is to narrow the range of uncertainty in the location of the extremum.

Dichotomy method (bisection)Initial step. Select the distinguishability constant  and the finite length of the uncertainty interval l. The value  should be as small as possible, but still allow one to distinguish between the values of the function F( ) And F( ) . Let [ a 1 , b 1 ] - initial uncertainty interval. Put k =

The main stage consists of a finite number of iterations of the same type.

kth iteration.

Step 1. If b k  a k  l, then the calculations end. Solution x * = (a k + b k )/2. Otherwise

,
.

Step 2. If F( k ) < F( k ), put a k +1 = a k ; b k +1 =  k. Otherwise a k +1 =  k And b k +1 = b k. Assign k = k + 1 and go to step 1.

Golden section method. A more effective method than the dichotomy method. Allows you to obtain a given value of the uncertainty interval in fewer iterations and requires fewer calculations of the objective function. In this method, the new division point of the uncertainty interval is calculated once. A new point is placed at a distance

 = 0.618034 from the end of the interval.

Algorithm of the golden section method

Initial step. Select the permissible finite length of the uncertainty interval l > 0. Let [ a 1 , b 1 ] - initial uncertainty interval. Put  1 = a 1 +(1   )(b 1  a 1 ) And  1 = a 1 +  (b 1  a 1 ) , Where  = 0,618 . Calculate F( 1 ) And F( 1 ) , put k = 1 and go to the main stage.

Step 1. If b k  a k  l, then the calculations end x * = (a k + b k )/ 2. Otherwise if F( k ) > F( k ) , then go to step 2; If F( k )  F( k ) , go to step 3.

Step 2. Put a k +1 =  k , b k +1 = b k ,  k +1 =  k ,  k +1 = a k +1 +  (b k +1 – a k +1 ). Calculate F( k +1 ), go to step 4.

Step 3. Put a k +1 = a k , b k +1 =  k ,  k +1 =  k ,  k +1 = a k +1 + (1   )(b k +1 – a k +1 ). Calculate F( k +1 ).

Step 4. Assign k = k + 1, go to step 1.

At the first iteration, two function calculations are required, at all subsequent iterations only one.

Conjugate gradient method (Fletcher-Reeves). In this method, choosing the direction of movement on k+ Step 1 takes into account the change in direction on k step. The descent direction vector is a linear combination of the antigradient direction and the previous search direction. In this case, when minimizing gully functions (with narrow long depressions), the search is not perpendicular to the ravine, but along it, which allows one to quickly reach the minimum. The coordinates of a point when searching for an extremum using the conjugate gradient method are calculated using the expression X k +1 = X k   V k +1 , Where V k +1 – vector calculated using the following expression:

The first iteration usually relies V = 0 and a search is performed along the antigradient, as in the steepest descent method. Then the direction of movement deviates from the direction of the antigradient, the more, the more significantly the length of the gradient vector changes at the last iteration. After n Steps to correct the operation of the algorithm are made using the usual anti-gradient step.

Algorithm of the conjugate gradient method

Step 1. Enter starting point X 0 , accuracy  , dimension n.

Step 2. Put k = 1.

Step 3. Put vector V k = 0.

Step 4. Calculate grad F(X k ).

Step 5. Calculate vector V k +1.

Step 6. Perform a one-dimensional vector search V k +1.

Step 7 If k < n, put k = k + 1 and go to step 4, otherwise go to step 8.

Step 8 If the vector length V less than , end the search, otherwise  go to step 2.

The conjugate directions method is one of the most effective in solving minimization problems. The method, in combination with one-dimensional search, is often practically used in CAD. However, it should be noted that it is sensitive to errors that occur during the counting process.

Disadvantages of gradient methods

In problems with a large number of variables, it is difficult or impossible to obtain derivatives in the form of analytical functions.

When calculating derivatives using difference schemes, the resulting error, especially in the vicinity of the extremum, limits the possibilities of such approximation.