Least squares method for determining coefficients. Approximation of experimental data

3. Approximation of functions using the method

least squares

The least squares method is used when processing experimental results for approximations (approximations) experimental data analytical formula. The specific type of formula is chosen, as a rule, for physical reasons. Such formulas could be:

and others.

The essence of the least squares method is as follows. Let the measurement results be presented in the table:

Table 4
				x n
				y n

(3.1)

where f - known function, a 0 , a 1 , …, a m - unknown constant parameters whose values must be found. In the least squares method, the approximation of function (3.1) to the experimental dependence is considered best if the condition is satisfied

(3.2)

that is amounts a squared deviations of the desired analytical function from the experimental dependence should be minimal .

Note that the function Q called residual.

Since the discrepancy

then it has a minimum. A necessary condition for the minimum of a function of several variables is the equality to zero of all partial derivatives of this function with respect to the parameters. Thus, finding the best values of the parameters of the approximating function (3.1), that is, their values at which Q = Q (a 0 , a 1 , …, a m ) is minimal, reduces to solving the system of equations:

(3.3)

The least squares method can be given the following geometric interpretation: among an infinite family of lines of a given type, one line is found for which the sum of the squared differences of the ordinates of the experimental points and the corresponding ordinates of the points found by the equation of this line will be the smallest.

Finding the parameters of a linear function

Let the experimental data be represented by a linear function:

It is required to select the following values a and b , for which the function

(3.4)

will be minimal. The necessary conditions for the minimum of function (3.4) are reduced to the system of equations:

After transformations, we obtain a system of two linear equations with two unknowns:

(3.5)

solving which, we find the required values of the parameters a and b.

Finding the Parameters of a Quadratic Function

If the approximating function is a quadratic dependence

then its parameters a, b, c found from the minimum condition of the function:

(3.6)

The conditions for the minimum of function (3.6) are reduced to the system of equations:

After transformations, we obtain a system of three linear equations with three unknowns:

(3.7)

at solution of which we find the required values of the parameters a, b and c.

Example . Let the experiment result in the following table of values x and y:

Table 5

y i	0,705	0,495	0,426	0,357	0,368	0,406	0,549	0,768

It is required to approximate the experimental data with linear and quadratic functions.

Solution. Finding the parameters of the approximating functions is reduced to solving systems of linear equations (3.5) and (3.7). To solve the problem, we will use a spreadsheet processor Excel.

1. First, let’s connect sheets 1 and 2. Enter the experimental values x i and y i into columns A and B, starting from the second line (we will place the column headings in the first line). Then we calculate the sums for these columns and place them in the tenth row.

In columns C–G place the calculation and summation respectively

2. Let's uncouple the sheets. We will carry out further calculations in a similar way for the linear dependence on Sheet 1 and for the quadratic dependence on Sheet 2.

3. Under the resulting table, we will form a matrix of coefficients and a column vector of free terms. Let's solve the system of linear equations using the following algorithm:

To calculate the inverse matrix and multiply matrices, we use Master functions and functions MOBR And MUMNIFE.

4. In the block of cells H2: H 9 based on the obtained coefficients we calculate approximating value polynomialy i calc., in block I 2: I 9 – deviations D y i = y i exp. - y i calc.,in column J – the residual:

The resulting tables and those built using Chart Wizards graphs are shown in Figures 6, 7, 8.

Rice. 6. Table for calculating the coefficients of a linear function,

approximating experimental data.

Rice. 7. Table for calculating the coefficients of a quadratic function,

approximatingexperimental data.

Rice. 8. Graphical representation of approximation results

experimental data by linear and quadratic functions.

Answer. The experimental data were approximated by a linear dependence y = 0,07881 x + 0,442262 with residual Q = 0,165167 and quadratic dependence y = 3,115476 x 2 – 5,2175 x + 2,529631 with residual Q = 0,002103 .

Tasks. Approximate a function given by a table, linear and quadratic functions.

Table 6

№0

0,1

0,2

0,3

0,4

0,5

0,6

0,7

0,8

3,030

3,142

3,358

3,463

3,772

3,251

3,170

3,665

№ 1

3,314

3,278

3,262

3,292

3,332

3,397

3,487

3,563

№ 2

1,045

1,162

1,264

1,172

1,070

0,898

0,656

0,344

№ 3

6,715

6,735

6,750

6,741

6,645

6,639

6,647

6,612

№ 4

2,325

2,515

2,638

2,700

2,696

2,626

2,491

2,291

№ 5

1.752

1,762

1,777

1,797

1,821

1,850

1,884

1,944

№ 6

1,924

1,710

1,525

1,370

1,264

1,190

1,148

1,127

№ 7

1,025

1,144

1,336

1,419

1,479

1,530

1,568

1,248

№ 8

5,785

5,685

5,605

5,545

5,505

5,480

5,495

5,510

№ 9

4,052

4,092

4,152

4,234

4,338

4,468

4,599

Example.

Experimental data on the values of variables X And at are given in the table.

As a result of their alignment, the function is obtained

Using least square method, approximate these data by a linear dependence y=ax+b(find parameters A And b). Find out which of the two lines better (in the sense of the least squares method) aligns the experimental data. Make a drawing.

The essence of the least squares method (LSM).

The task is to find the linear dependence coefficients at which the function of two variables A And b takes the smallest value. That is, given A And b the sum of squared deviations of the experimental data from the found straight line will be the smallest. This is the whole point of the least squares method.

Thus, solving the example comes down to finding the extremum of a function of two variables.

Deriving formulas for finding coefficients.

A system of two equations with two unknowns is compiled and solved. Finding the partial derivatives of a function by variables A And b, we equate these derivatives to zero.

We solve the resulting system of equations using any method (for example by substitution method or Cramer's method) and obtain formulas for finding coefficients using the least squares method (LSM).

Given A And b function takes the smallest value. The proof of this fact is given below in the text at the end of the page.

That's the whole method of least squares. Formula for finding the parameter a contains the sums ,,, and parameter n- amount of experimental data. We recommend calculating the values of these amounts separately. Coefficient b found after calculation a.

It's time to remember the original example.

Solution.

In our example n=5. We fill out the table for the convenience of calculating the amounts that are included in the formulas of the required coefficients.

The values in the fourth row of the table are obtained by multiplying the values of the 2nd row by the values of the 3rd row for each number i.

The values in the fifth row of the table are obtained by squaring the values in the 2nd row for each number i.

The values in the last column of the table are the sums of the values across the rows.

We use the formulas of the least squares method to find the coefficients A And b. We substitute the corresponding values from the last column of the table into them:

Hence, y = 0.165x+2.184- the desired approximating straight line.

It remains to find out which of the lines y = 0.165x+2.184 or better approximates the original data, that is, makes an estimate using the least squares method.

Error estimation of the least squares method.

To do this, you need to calculate the sum of squared deviations of the original data from these lines And , a smaller value corresponds to a line that better approximates the original data in the sense of the least squares method.

Since , then straight y = 0.165x+2.184 better approximates the original data.

Graphic illustration of the least squares (LS) method.

Everything is clearly visible on the graphs. The red line is the found straight line y = 0.165x+2.184, the blue line is , pink dots are the original data.

In practice, when modeling various processes - in particular, economic, physical, technical, social - one or another method of calculating approximate values of functions from their known values at certain fixed points is widely used.

This kind of function approximation problem often arises:

when constructing approximate formulas for calculating the values of characteristic quantities of the process under study using tabular data obtained as a result of the experiment;

in numerical integration, differentiation, solving differential equations, etc.;

if necessary, calculate the values of functions at intermediate points of the considered interval;

when determining the values of characteristic quantities of a process outside the considered interval, in particular when forecasting.

If, to model a certain process specified by a table, we construct a function that approximately describes this process based on the least squares method, it will be called an approximating function (regression), and the problem of constructing approximating functions itself will be called an approximation problem.

This article discusses the capabilities of the MS Excel package for solving this type of problem, in addition, it provides methods and techniques for constructing (creating) regressions for tabulated functions (which is the basis of regression analysis).

Excel has two options for building regressions.

Adding selected regressions (trendlines) to a diagram built on the basis of a data table for the process characteristic under study (available only if a diagram has been constructed);

Using the built-in statistical functions of the Excel worksheet, allowing you to obtain regressions (trend lines) directly from the source data table.

Adding trend lines to a chart

For a table of data that describes a process and is represented by a diagram, Excel has an effective regression analysis tool that allows you to:

build on the basis of the least squares method and add five types of regressions to the diagram, which model the process under study with varying degrees of accuracy;

add the constructed regression equation to the diagram;

determine the degree of correspondence of the selected regression to the data displayed on the chart.

Based on chart data, Excel allows you to obtain linear, polynomial, logarithmic, power, exponential types of regressions, which are specified by the equation:

y = y(x)

where x is an independent variable that often takes the values of a sequence of natural numbers (1; 2; 3; ...) and produces, for example, a countdown of the time of the process under study (characteristics).

1 . Linear regression is good for modeling characteristics whose values increase or decrease at a constant rate. This is the simplest model to construct for the process under study. It is constructed in accordance with the equation:

y = mx + b

where m is the tangent of the linear regression slope to the x-axis; b - coordinate of the point of intersection of linear regression with the ordinate axis.

2 . A polynomial trend line is useful for describing characteristics that have several distinct extremes (maxima and minima). The choice of polynomial degree is determined by the number of extrema of the characteristic under study. Thus, a second-degree polynomial can well describe a process that has only one maximum or minimum; polynomial of the third degree - no more than two extrema; polynomial of the fourth degree - no more than three extrema, etc.

In this case, the trend line is constructed in accordance with the equation:

y = c0 + c1x + c2x2 + c3x3 + c4x4 + c5x5 + c6x6

where coefficients c0, c1, c2,... c6 are constants whose values are determined during construction.

3 . The logarithmic trend line is successfully used when modeling characteristics whose values initially change rapidly and then gradually stabilize.

y = c ln(x) + b

4 . A power-law trend line gives good results if the values of the relationship under study are characterized by a constant change in the growth rate. An example of such a dependence is the graph of uniformly accelerated motion of a car. If there are zero or negative values in the data, you cannot use a power trend line.

Constructed in accordance with the equation:

y = c xb

where coefficients b, c are constants.

5 . An exponential trend line should be used when the rate of change in the data is continuously increasing. For data containing zero or negative values, this type of approximation is also not applicable.

Constructed in accordance with the equation:

y = c ebx

where coefficients b, c are constants.

When selecting a trend line, Excel automatically calculates the value of R2, which characterizes the reliability of the approximation: the closer the R2 value is to unity, the more reliably the trend line approximates the process under study. If necessary, the R2 value can always be displayed on the chart.

Determined by the formula:

To add a trend line to a data series:

activate a chart based on a series of data, i.e. click within the chart area. The Diagram item will appear in the main menu;

after clicking on this item, a menu will appear on the screen in which you should select the Add trend line command.

The same actions can be easily implemented by moving the mouse pointer over the graph corresponding to one of the data series and right-clicking; In the context menu that appears, select the Add trend line command. The Trendline dialog box will appear on the screen with the Type tab opened (Fig. 1).

After this you need:

Select the required trend line type on the Type tab (the Linear type is selected by default). For the Polynomial type, in the Degree field, specify the degree of the selected polynomial.

1 . The Built on series field lists all data series in the chart in question. To add a trend line to a specific data series, select its name in the Built on series field.

If necessary, by going to the Parameters tab (Fig. 2), you can set the following parameters for the trend line:

change the name of the trend line in the Name of the approximating (smoothed) curve field.

set the number of periods (forward or backward) for the forecast in the Forecast field;

display the equation of the trend line in the diagram area, for which you should enable the show equation on the diagram checkbox;

display the approximation reliability value R2 in the diagram area, for which you should enable the Place the approximation reliability value on the diagram (R^2) checkbox;

set the intersection point of the trend line with the Y axis, for which you should enable the checkbox for the intersection of the curve with the Y axis at a point;

Click the OK button to close the dialog box.

In order to start editing an already drawn trend line, there are three ways:

use the Selected trend line command from the Format menu, having previously selected the trend line;

select the Format trend line command from the context menu, which is called up by right-clicking on the trend line;

double click on the trend line.

The Trend Line Format dialog box will appear on the screen (Fig. 3), containing three tabs: View, Type, Parameters, and the contents of the last two completely coincide with the similar tabs of the Trend Line dialog box (Fig. 1-2). On the View tab, you can set the line type, its color and thickness.

To delete a trend line that has already been drawn, select the trend line to be deleted and press the Delete key.

The advantages of the considered regression analysis tool are:

the relative ease of constructing a trend line on charts without creating a data table for it;

a fairly wide list of types of proposed trend lines, and this list includes the most commonly used types of regression;

the ability to predict the behavior of the process under study by an arbitrary (within the limits of common sense) number of steps forward and also backward;

the ability to obtain the trend line equation in analytical form;

the possibility, if necessary, of obtaining an assessment of the reliability of the approximation.

The disadvantages include the following:

the construction of a trend line is carried out only if there is a diagram built on a series of data;

the process of generating data series for the characteristic under study based on the trend line equations obtained for it is somewhat cluttered: the required regression equations are updated with each change in the values of the original data series, but only within the chart area, while the data series formed on the basis of the old line equation trend remains unchanged;

In PivotChart reports, changing the view of a chart or associated PivotTable report does not preserve existing trendlines, meaning that before you draw trendlines or otherwise format a PivotChart report, you should ensure that the report layout meets the required requirements.

Trend lines can be used to supplement data series presented on charts such as graph, histogram, flat non-standardized area charts, bar charts, scatter charts, bubble charts, and stock charts.

You cannot add trend lines to data series in 3D, normalized, radar, pie, and donut charts.

Using Excel's built-in functions

Excel also has a regression analysis tool for plotting trend lines outside the chart area. There are a number of statistical worksheet functions you can use for this purpose, but all of them only allow you to build linear or exponential regressions.

Excel has several functions for constructing linear regression, in particular:

TREND;

SLOPE and CUT.

As well as several functions for constructing an exponential trend line, in particular:

LGRFPRIBL.

It should be noted that the techniques for constructing regressions using the TREND and GROWTH functions are almost the same. The same can be said about the pair of functions LINEST and LGRFPRIBL. For these four functions, creating a table of values uses Excel features such as array formulas, which somewhat clutters the process of building regressions. Let us also note that the construction of linear regression, in our opinion, is most easily accomplished using the SLOPE and INTERCEPT functions, where the first of them determines the slope of the linear regression, and the second determines the segment intercepted by the regression on the y-axis.

The advantages of the built-in functions tool for regression analysis are:

a fairly simple, uniform process of generating data series of the characteristic under study for all built-in statistical functions that define trend lines;

standard methodology for constructing trend lines based on generated data series;

the ability to predict the behavior of the process under study by the required number of steps forward or backward.

The disadvantages include the fact that Excel does not have built-in functions for creating other (except linear and exponential) types of trend lines. This circumstance often does not allow choosing a sufficiently accurate model of the process under study, as well as obtaining forecasts that are close to reality. In addition, when using the TREND and GROWTH functions, the equations of the trend lines are not known.

It should be noted that the authors did not set out to present the course of regression analysis with any degree of completeness. Its main task is to show, using specific examples, the capabilities of the Excel package when solving approximation problems; demonstrate what effective tools Excel has for building regressions and forecasting; illustrate how such problems can be solved relatively easily even by a user who does not have extensive knowledge of regression analysis.

Examples of solving specific problems

Let's look at solving specific problems using the listed Excel tools.

Problem 1

With a table of data on the profit of a motor transport enterprise for 1995-2002. you need to do the following:

Build a diagram.

Add linear and polynomial (quadratic and cubic) trend lines to the chart.

Using the trend line equations, obtain tabular data on enterprise profits for each trend line for 1995-2004.

Make a forecast for the enterprise's profit for 2003 and 2004.

The solution of the problem

In the range of cells A4:C11 of the Excel worksheet, enter the worksheet shown in Fig. 4.

Having selected the range of cells B4:C11, we build a diagram.

We activate the constructed diagram and, according to the method described above, after selecting the type of trend line in the Trend Line dialog box (see Fig. 1), we alternately add linear, quadratic and cubic trend lines to the diagram. In the same dialog box, open the Parameters tab (see Fig. 2), in the Name of the approximating (smoothed) curve field, enter the name of the trend being added, and in the Forecast forward for: periods field, set the value 2, since it is planned to make a profit forecast for two years ahead. To display the regression equation and the approximation reliability value R2 in the diagram area, enable the show equation on the screen checkboxes and place the approximation reliability value (R^2) on the diagram. For better visual perception, we change the type, color and thickness of the constructed trend lines, for which we use the View tab of the Trend Line Format dialog box (see Fig. 3). The resulting diagram with added trend lines is shown in Fig. 5.

To obtain tabular data on enterprise profits for each trend line for 1995-2004. Let's use the trend line equations presented in Fig. 5. To do this, in the cells of the range D3:F3, enter text information about the type of the selected trend line: Linear trend, Quadratic trend, Cubic trend. Next, enter the linear regression formula in cell D4 and, using the fill marker, copy this formula with relative references to the cell range D5:D13. It should be noted that each cell with a linear regression formula from the range of cells D4:D13 has as an argument a corresponding cell from the range A4:A13. Similarly, for quadratic regression, fill the range of cells E4:E13, and for cubic regression, fill the range of cells F4:F13. Thus, a forecast for the enterprise's profit for 2003 and 2004 has been compiled. using three trends. The resulting table of values is shown in Fig. 6.

Problem 2

Build a diagram.

Add logarithmic, power and exponential trend lines to the chart.

Derive the equations of the obtained trend lines, as well as the reliability values of the approximation R2 for each of them.

Using the trend line equations, obtain tabular data on the enterprise's profit for each trend line for 1995-2002.

Make a forecast of the company's profit for 2003 and 2004 using these trend lines.

The solution of the problem

Following the methodology given in solving problem 1, we obtain a diagram with logarithmic, power and exponential trend lines added to it (Fig. 7). Next, using the obtained trend line equations, we fill out a table of values for the enterprise’s profit, including the predicted values for 2003 and 2004. (Fig. 8).

In Fig. 5 and fig. it can be seen that the model with a logarithmic trend corresponds to the lowest value of approximation reliability

R2 = 0.8659

The highest values of R2 correspond to models with a polynomial trend: quadratic (R2 = 0.9263) and cubic (R2 = 0.933).

Problem 3

With the table of data on the profit of a motor transport enterprise for 1995-2002, given in task 1, you must perform the following steps.

Obtain data series for linear and exponential trend lines using the TREND and GROW functions.

Using the TREND and GROWTH functions, make a forecast of the enterprise’s profit for 2003 and 2004.

Construct a diagram for the original data and the resulting data series.

The solution of the problem

Let's use the worksheet for Problem 1 (see Fig. 4). Let's start with the TREND function:

select the range of cells D4:D11, which should be filled with the values of the TREND function corresponding to the known data on the profit of the enterprise;

Call the Function command from the Insert menu. In the Function Wizard dialog box that appears, select the TREND function from the Statistical category, and then click the OK button. The same operation can be performed by clicking the (Insert Function) button on the standard toolbar.

In the Function Arguments dialog box that appears, enter the range of cells C4:C11 in the Known_values_y field; in the Known_values_x field - the range of cells B4:B11;

To make the entered formula become an array formula, use the key combination + + .

The formula we entered in the formula bar will look like: =(TREND(C4:C11,B4:B11)).

As a result, the range of cells D4:D11 is filled with the corresponding values of the TREND function (Fig. 9).

To make a forecast of the enterprise's profit for 2003 and 2004. necessary:

select the range of cells D12:D13 where the values predicted by the TREND function will be entered.

call the TREND function and in the Function Arguments dialog box that appears, enter in the Known_values_y field - the range of cells C4:C11; in the Known_values_x field - the range of cells B4:B11; and in the New_values_x field - the range of cells B12:B13.

turn this formula into an array formula using the key combination Ctrl + Shift + Enter.

The entered formula will look like: =(TREND(C4:C11;B4:B11;B12:B13)), and the range of cells D12:D13 will be filled with the predicted values of the TREND function (see Fig. 9).

The data series is similarly filled in using the GROWTH function, which is used in the analysis of nonlinear dependencies and works in exactly the same way as its linear counterpart TREND.

Figure 10 shows the table in formula display mode.

For the initial data and the obtained data series, the diagram shown in Fig. eleven.

Problem 4

With the table of data on the receipt of applications for services by the dispatch service of a motor transport enterprise for the period from the 1st to the 11th of the current month, you must perform the following actions.

Get data series for linear regression: using the SLOPE and INTERCEPT functions; using the LINEST function.

Obtain a series of data for exponential regression using the LGRFPRIBL function.

Using the above functions, make a forecast about the receipt of applications to the dispatch service for the period from the 12th to the 14th of the current month.

Create a diagram for the original and received data series.

The solution of the problem

Note that, unlike the TREND and GROWTH functions, none of the functions listed above (SLOPE, INTERCEPT, LINEST, LGRFPRIB) are regression. These functions play only a supporting role, determining the necessary regression parameters.

For linear and exponential regressions built using the functions SLOPE, INTERCEPT, LINEST, LGRFPRIB, the appearance of their equations is always known, in contrast to linear and exponential regressions corresponding to the TREND and GROWTH functions.

1 . Let's build a linear regression with the equation:

y = mx+b

using the SLOPE and INTERCEPT functions, with the regression slope m determined by the SLOPE function, and the free term b by the INTERCEPT function.

To do this, we carry out the following actions:

enter the original table into the cell range A4:B14;

the value of parameter m will be determined in cell C19. Select the Slope function from the Statistical category; enter the range of cells B4:B14 in the known_values_y field and the range of cells A4:A14 in the known_values_x field. The formula will be entered in cell C19: =SLOPE(B4:B14,A4:A14);

Using a similar technique, the value of parameter b in cell D19 is determined. And its contents will look like: =SEGMENT(B4:B14,A4:A14). Thus, the values of the parameters m and b required for constructing a linear regression will be stored in cells C19, D19, respectively;

Next, enter the linear regression formula in cell C4 in the form: =$C*A4+$D. In this formula, cells C19 and D19 are written with absolute references (the cell address should not change during possible copying). The absolute reference sign $ can be typed either from the keyboard or using the F4 key, after placing the cursor on the cell address. Using the fill handle, copy this formula into the range of cells C4:C17. We obtain the required data series (Fig. 12). Due to the fact that the number of requests is an integer, you should set the number format with the number of decimal places to 0 on the Number tab of the Cell Format window.

2 . Now let's build a linear regression given by the equation:

y = mx+b

using the LINEST function.

For this:

Enter the LINEST function as an array formula in the cell range C20:D20: =(LINEST(B4:B14,A4:A14)). As a result, we obtain the value of parameter m in cell C20, and the value of parameter b in cell D20;

enter the formula in cell D4: =$C*A4+$D;

copy this formula using the fill marker into the cell range D4:D17 and get the desired data series.

3 . We build an exponential regression with the equation:

using the LGRFPRIBL function it is performed similarly:

In the cell range C21:D21 we enter the LGRFPRIBL function as an array formula: =( LGRFPRIBL (B4:B14,A4:A14)). In this case, the value of parameter m will be determined in cell C21, and the value of parameter b will be determined in cell D21;

the formula is entered into cell E4: =$D*$C^A4;

using the fill marker, this formula is copied to the range of cells E4:E17, where the data series for exponential regression will be located (see Fig. 12).

In Fig. Figure 13 shows a table where you can see the functions we use with the required cell ranges, as well as formulas.

Magnitude R 2 called coefficient of determination.

The task of constructing a regression dependence is to find the vector of coefficients m of model (1) at which the coefficient R takes on the maximum value.

To assess the significance of R, Fisher's F test is used, calculated using the formula

Where n- sample size (number of experiments);

k is the number of model coefficients.

If F exceeds some critical value for the data n And k and the accepted confidence probability, then the value of R is considered significant. Tables of critical values of F are given in reference books on mathematical statistics.

Thus, the significance of R is determined not only by its value, but also by the ratio between the number of experiments and the number of coefficients (parameters) of the model. Indeed, the correlation ratio for n=2 for a simple linear model is equal to 1 (a single straight line can always be drawn through 2 points on a plane). However, if the experimental data are random variables, such a value of R should be trusted with great caution. Usually, to obtain significant R and reliable regression, they strive to ensure that the number of experiments significantly exceeds the number of model coefficients (n>k).

To build a linear regression model you need:

1) prepare a list of n rows and m columns containing experimental data (column containing the output value Y must be either first or last in the list); For example, let’s take the data from the previous task, adding a column called “Period No.”, number the period numbers from 1 to 12. (these will be the values X)

2) go to the menu Data/Data Analysis/Regression

If the "Data Analysis" item in the "Tools" menu is missing, then you should go to the "Add-Ins" item in the same menu and check the "Analysis package" checkbox.

3) in the "Regression" dialog box, set:

· input interval Y;

· input interval X;

· output interval - the upper left cell of the interval in which the calculation results will be placed (it is recommended to place them on a new worksheet);

4) click "Ok" and analyze the results.

The essence of the method is that the criterion for the quality of the solution under consideration is the sum of squared errors, which they strive to minimize. To apply this, it is necessary to carry out as many measurements as possible of the unknown random variable (the more, the higher the accuracy of the solution) and a certain set of estimated solutions from which the best one must be selected. If the set of solutions is parameterized, then we need to find the optimal value of the parameters.

Why are squared errors minimized and not the errors themselves? The fact is that in most cases, errors go both ways: the estimate can be more than the measurement or less than it. If we add up errors with different signs, they will cancel each other out, and as a result, the sum will give us an incorrect idea of the quality of the assessment. Often, in order for the final estimate to have the same dimension as the measured values, the square root of the sum of squared errors is taken.

Photo:

LSM is used in mathematics, in particular in probability theory and mathematical statistics. This method is most widely used in filtering problems, when it is necessary to separate the useful signal from the noise superimposed on it.

It is also used in mathematical analysis to approximate the representation of a given function by simpler functions. Another area of application of least squares is the solution of systems of equations with a number of unknowns less than the number of equations.

I came up with several more very unexpected areas of application of MNCs, which I would like to talk about in this article.

OLS and typos

The scourge of automatic translators and search engines are typos and spelling errors. Indeed, if a word differs by only 1 letter, the program treats it as another word and translates/searches for it incorrectly or does not translate/does not find it at all.

I had a similar problem: I had two databases with addresses of Moscow houses, and I needed to combine them into one. But the addresses were written in different styles. One database contained the KLADR standard (All-Russian Address Classifier), for example: “BABUSHKINA LETCHIKA STREET, D10K3.” And in another database there was a postal style, for example: “St. Pilot Babushkina, building 10, building 3.” There seem to be no errors in both cases, but automating the process is incredibly difficult (each database has 40 thousand records!). Although there were also a lot of typos... How to make the computer understand that the 2 above addresses belong to the same house? This is where MNC came in handy for me.

What I've done? Having found the next letter in the first address, I looked for the same letter in the second address. If they were both in the same place, then I set the error for that letter to be 0. If they were in adjacent positions, then the error was 1. If there was a shift by 2 positions, the error was 2, etc. If there was no such letter at all in another address, then the error was assumed to be equal to n+1, where n is the number of letters in the 1st address. Thus, I calculated the sum of squared errors and combined those records in which this sum was minimal.

Of course, house and building numbers were processed separately. I don’t know if I invented another “bicycle”, or if it really was, but the problem was solved quickly and efficiently. I wonder if this method is used in search engines? Perhaps it applies because every self-respecting search engine, when encountering an unfamiliar word, offers a replacement from familiar words (“perhaps you meant ...”). However, they may do this analysis in some other way.

OLS and search by pictures, faces and maps

This method can also be used to search using pictures, drawings, maps, and even people’s faces.

Photo:

Now all search engines, instead of searching by pictures, essentially use search by captions to pictures. This is undoubtedly a useful and convenient service, but I propose to supplement it with a real image search.

A sample picture is entered and a rating is compiled for all images based on the sum of squared deviations of characteristic points. Determining these most characteristic points is in itself a non-trivial task. However, it is completely solvable: for example, for faces these are the corners of the eyes, lips, tip of the nose, nostrils, edges and centers of the eyebrows, pupils, etc.

By comparing these parameters, you can find the face that is most similar to the sample. I've already seen sites where this service works, and you can find the celebrity most similar to the photo you suggest, and even create an animation that turns you into a celebrity and back again. Surely the same method works in the Ministry of Internal Affairs databases containing identikit images of criminals.

Photo: pixabay.com

Yes, and you can search using fingerprints using the same method. Search on maps is focused on the natural irregularities of geographical objects - bends of rivers, mountain ranges, outlines of banks, forests and fields.

This is such a wonderful and universal method of least squares. I am sure that you, dear readers, will be able to find many unusual and unexpected areas of application of this method yourself.

It has many applications, as it allows an approximate representation of a given function by other simpler ones. LSM can be extremely useful in processing observations, and it is actively used to estimate some quantities based on the results of measurements of others containing random errors. In this article, you will learn how to implement least squares calculations in Excel.

Statement of the problem using a specific example

Suppose there are two indicators X and Y. Moreover, Y depends on X. Since OLS interests us from the point of view of regression analysis (in Excel its methods are implemented using built-in functions), we should immediately move on to considering a specific problem.

So, let X be the retail space of a grocery store, measured in square meters, and Y be the annual turnover, measured in millions of rubles.

It is required to make a forecast of what turnover (Y) the store will have if it has this or that retail space. Obviously, the function Y = f (X) is increasing, since the hypermarket sells more goods than the stall.

A few words about the correctness of the initial data used for prediction

Let's say we have a table built using data for n stores.

According to mathematical statistics, the results will be more or less correct if data on at least 5-6 objects is examined. In addition, “anomalous” results cannot be used. In particular, an elite small boutique can have a turnover that is several times greater than the turnover of large retail outlets of the “masmarket” class.

The essence of the method

The table data can be depicted on a Cartesian plane in the form of points M 1 (x 1, y 1), ... M n (x n, y n). Now the solution to the problem will be reduced to the selection of an approximating function y = f (x), which has a graph passing as close as possible to the points M 1, M 2, .. M n.

Of course, you can use a high-degree polynomial, but this option is not only difficult to implement, but also simply incorrect, since it will not reflect the main trend that needs to be detected. The most reasonable solution is to search for the straight line y = ax + b, which best approximates the experimental data, or more precisely, the coefficients a and b.

Accuracy assessment

With any approximation, assessing its accuracy is of particular importance. Let us denote by e i the difference (deviation) between the functional and experimental values for point x i, i.e. e i = y i - f (x i).

Obviously, to assess the accuracy of the approximation, you can use the sum of deviations, i.e., when choosing a straight line for an approximate representation of the dependence of X on Y, you should give preference to the one with the smallest value of the sum e i at all points under consideration. However, not everything is so simple, since along with positive deviations there will also be negative ones.

The issue can be solved using deviation modules or their squares. The last method is the most widely used. It is used in many areas, including regression analysis (implemented in Excel using two built-in functions), and has long proven its effectiveness.

Least square method

Excel, as you know, has a built-in AutoSum function that allows you to calculate the values of all values located in the selected range. Thus, nothing will prevent us from calculating the value of the expression (e 1 2 + e 2 2 + e 3 2 + ... e n 2).

In mathematical notation this looks like:

Since the decision was initially made to approximate using a straight line, we have:

Thus, the task of finding the straight line that best describes the specific dependence of the quantities X and Y comes down to calculating the minimum of a function of two variables:

To do this, you need to equate the partial derivatives with respect to the new variables a and b to zero, and solve a primitive system consisting of two equations with 2 unknowns of the form:

After some simple transformations, including division by 2 and manipulation of sums, we get:

Solving it, for example, using Cramer’s method, we obtain a stationary point with certain coefficients a * and b *. This is the minimum, i.e. to predict what turnover a store will have for a certain area, the straight line y = a * x + b * is suitable, which is a regression model for the example in question. Of course, it will not allow you to find the exact result, but it will help you get an idea of whether purchasing a specific area on store credit will pay off.

How to Implement Least Squares in Excel

Excel has a function for calculating values using least squares. It has the following form: “TREND” (known Y values; known X values; new X values; constant). Let's apply the formula for calculating OLS in Excel to our table.

To do this, enter the “=” sign in the cell in which the result of the calculation using the least squares method in Excel should be displayed and select the “TREND” function. In the window that opens, fill in the appropriate fields, highlighting:

range of known values for Y (in this case, data for trade turnover);
range x 1 , …x n , i.e. the size of retail space;
both known and unknown values of x, for which you need to find out the size of the turnover (for information about their location on the worksheet, see below).

In addition, the formula contains the logical variable “Const”. If you enter 1 in the corresponding field, this will mean that you should carry out the calculations, assuming that b = 0.

If you need to find out the forecast for more than one x value, then after entering the formula you should not press “Enter”, but you need to type the combination “Shift” + “Control” + “Enter” on the keyboard.

Some features

Regression analysis can be accessible even to dummies. The Excel formula for predicting the value of an array of unknown variables—TREND—can be used even by those who have never heard of least squares. It is enough just to know some of the features of its work. In particular:

If you arrange the range of known values of the variable y in one row or column, then each row (column) with known values of x will be perceived by the program as a separate variable.
If a range with known x is not specified in the TREND window, then when using the function in Excel, the program will treat it as an array consisting of integers, the number of which corresponds to the range with the given values of the variable y.
To output an array of “predicted” values, the expression for calculating the trend must be entered as an array formula.
If new values of x are not specified, then the TREND function considers them equal to the known ones. If they are not specified, then array 1 is taken as an argument; 2; 3; 4;…, which is commensurate with the range with already specified parameters y.
The range containing the new x values must have the same or more rows or columns as the range containing the given y values. In other words, it must be proportional to the independent variables.
An array with known x values can contain multiple variables. However, if we are talking about only one, then it is required that the ranges with the given values of x and y be proportional. In the case of several variables, it is necessary that the range with the given y values fit in one column or one row.

PREDICTION function

Implemented using several functions. One of them is called “PREDICTION”. It is similar to “TREND”, i.e. it gives the result of calculations using the least squares method. However, only for one X, for which the value of Y is unknown.

Now you know formulas in Excel for dummies that allow you to predict the future value of a particular indicator according to a linear trend.

The method of least squares is a mathematical procedure for constructing a linear equation that best fits a set of ordered pairs by finding the values for a and b, the coefficients in the equation of the line. The goal of least squares is to minimize the total squared error between the values of y and ŷ. If for each point we determine the error ŷ, the least squares method minimizes:

where n = number of ordered pairs around the line. as closely as possible to the data.

This concept is illustrated in the figure

Based on the figure, the line that best fits the data, the regression line, minimizes the total squared error of the four points on the graph. I'll show you how to determine this using least squares with the following example.

Imagine a young couple who have recently moved in together and share a vanity table in the bathroom. The young man began to notice that half of his table was inexorably shrinking, losing ground to hair mousses and soy complexes. Over the past few months, the guy had been closely monitoring the rate at which the number of objects on her side of the table was increasing. The table below shows the number of items the girl has accumulated on her bathroom vanity over the past few months.

Since our goal is to find out whether the number of items increases over time, “Month” will be the independent variable, and “Number of items” will be the dependent variable.

Using the least squares method, we determine the equation that best fits the data by calculating the values of a, the y-intercept, and b, the slope of the line:

a = y avg - bx avg

where x avg is the average value of x, the independent variable, y avg is the average value of y, the independent variable.

The table below summarizes the calculations required for these equations.

The effect curve for our bathtub example would be given by the following equation:

Since our equation has a positive slope of 0.976, the guy has evidence that the number of items on the table increases over time at an average rate of 1 item per month. The graph shows the effect curve with ordered pairs.

The expectation for the number of items over the next six months (month 16) will be calculated as follows:

ŷ = 5.13 + 0.976x = 5.13 + 0.976(16) ~ 20.7 = 21 items

So, it's time for our hero to take some action.

TREND function in Excel

As you probably already guessed, Excel has a function for calculating values by least squares method. This function is called TREND. Its syntax is as follows:

TREND (known Y values; known X values; new X values; constant)

known Y values – an array of dependent variables, in our case, the number of objects on the table

known values X – an array of independent variables, in our case this is the month

new X values – new X values (months) for which TREND function returns the expected value of the dependent variables (number of items)

const - optional. A Boolean value that specifies whether the constant b is required to be 0.

For example, the figure shows the TREND function used to determine the expected number of items on a bathroom vanity for the 16th month.