Determination of multiple correlation coefficient in MS Excel.

Initially in the model at include all main components (calculated values ​​are indicated in parentheses t-criteria):

The quality of the model is characterized by: multiple coefficient of determination r = 0.517, average relative error of approximation = 10.4%, residual variance s 2= 1.79 and F observable = 121. Due to the fact that F obs > F kr =2.85 at α = 0.05, v 1 = 6, v 2= 14, the regression equation is significant and at least one of the regression coefficients - β 1, β 2, β 3, β 4 - is not equal to zero.

If the significance of the regression equation (hypothesis H 0:β 1 = β 2 = β 3 = β 4 = 0 was checked at α = 0.05, then the significance of the regression coefficients, i.e. hypotheses H0: β j = 0 (j = 1, 2, 3, 4), should be tested at a significance level greater than 0.05, for example at α = 0.1. Then at α = 0.1, v= 14 magnitude t cr = 1.76, and significant, as follows from equation (53.41), are the regression coefficients β 1, β 2, β 3.

Considering that the main components are not correlated with each other, we can immediately eliminate all insignificant coefficients from the equation, and the equation will take the form

(53.42)

Comparing equations (53.41) and (53.42), we see that excluding insignificant principal components f 4 And f 5, did not affect the values ​​of the coefficients of the equation b 0 = 9,52, b 1 = 0,93, b 2 = 0.66 and corresponding t j (j = 0, 1, 2, 3).

This is due to the uncorrelated nature of the principal components. What is interesting here is the parallel of the regression equations for the initial indicators (53.22), (53.23) and the principal components (53.41), (53.42).

Equation (53.42) is significant because F obs = 194 > F cr = 3.01, found at α = 0.05, v 1 = 4, v 2= 16. The coefficients of the equation are also significant, since t j > t cr . = 1.746, corresponding to α = 0.01, v= 16 for j= 0, 1, 2, 3. Determination coefficient r= 0.486 indicates that 48.6% of the variation at due to the influence of the first three main components.

Equation (53.42) is characterized by an average relative error of approximation = 9.99% and a residual variance s 2 = 1,91.

The regression equation on the principal components (53.42) has slightly better approximating properties compared to the regression model (53.23) based on the initial indicators: r= 0,486 > r= 0,469; = 9,99% < (X) = 10.5% and s 2 (f) = 1,91 < s 2 (x) = 1.97. In addition, in equation (53.42) the principal components are linear functions of all initial indicators, while equation (53.23) includes only two variables ( x 1 And x 4). In a number of cases, it is necessary to take into account that model (53.42) is difficult to interpret, since it includes a third principal component f 3, which we have not interpreted and whose contribution to the total dispersion of the initial indicators ( x 1, ..., x 5) is only 8.6%. However, the exception f 3 from equation (53.42) significantly worsens the approximating properties of the model: r= 0.349; = 12.4% and s 2(f) = 2.41. Then it is advisable to choose equation (53.23) as a regression model of yield.

Cluster analysis

In statistical research, grouping primary data is the main decision technique classification problems, and therefore the basis for all further work with the collected information.

Traditionally, this problem is solved as follows. From the many features that describe an object, one is selected, the most informative from the researcher’s point of view, and the data is grouped in accordance with the values ​​of this feature. If it is necessary to carry out a classification based on several criteria, ranked among themselves by degree of importance, then first the classification is carried out according to the first characteristic, then each of the resulting classes is divided into subclasses according to the second characteristic, etc. Most combinational statistical groupings are constructed in a similar way.

In cases where it is not possible to organize classification characteristics, the simplest method of multidimensional grouping is used - the creation of an integral indicator (index), functionally dependent on the initial characteristics, followed by classification according to this indicator.

A development of this approach is a classification option based on several general indicators (principal components) obtained using factor or component analysis methods.

If there are several features (initial or generalized), the classification problem can be solved by cluster analysis methods, which differ from other multidimensional classification methods by the absence of training samples, i.e. a priori information about the distribution of the population.

The differences between schemes for solving a classification problem are largely determined by what is meant by the concepts of “similarity” and “degree of similarity.”

Once the goal of the work has been formulated, it is natural to try to determine quality criteria, an objective function, the values ​​of which will allow one to compare different classification schemes.

In economic studies, the objective function, as a rule, should minimize some parameter defined on a set of objects (for example, the purpose of classifying equipment may be a grouping that minimizes the total cost of time and money for repair work).

In cases where it is not possible to formalize the goal of the task, the criterion for the quality of classification can be the possibility of meaningful interpretation of the found groups.

Let's consider the following problem. Let the set be studied P objects, each of which is characterized k measured signs. It is required to divide this totality into groups (classes) that are homogeneous in a certain sense. At the same time, there is practically no a priori information about the nature of the distribution k-dimensional vector X inside classes.

The groups obtained as a result of partitioning are usually called clusters* (taxa**, images), the methods for finding them are called cluster analysis (respectively, numerical taxonomy or pattern recognition with self-learning).

* Cluster(English) - a group of elements characterized by some common property.

**Tahop(English) - a systematic group of any category.

It is necessary from the very beginning to clearly understand which of the two classification problems is to be solved. If the usual typification problem is being solved, then the set of observations is divided into a relatively small number of grouping areas (for example, an interval variation series in the case of one-dimensional observations) so that the elements of one such area are as close as possible to each other.

The solution to another problem is to determine the natural stratification of observational results into clearly defined clusters lying at a certain distance from each other.

If the first typification problem always has a solution, then in the second case it may turn out that the set of observations does not reveal a natural stratification into clusters, i.e. forms one cluster.

Although many cluster analysis methods are quite elementary, most of the work in which they were proposed dates back to the last decade. This is explained by the fact that the effective solution of cluster search problems, which requires performing a large number of arithmetic and logical operations, became possible only with the emergence and development of computer technology.

The usual form of representing initial data in cluster analysis problems is a matrix

each line of which represents measurement results k the considered signs in one of the examined objects. In specific situations, both object grouping and feature grouping may be of interest. In cases where the difference between these two tasks is not significant, for example, when describing some algorithms, we will use only the term “object”, including the term “feature” in this concept.

Matrix X is not the only way to present data in cluster analysis problems. Sometimes the initial information is given in the form of a square matrix

element r ij which determines the degree of proximity i-th object to j-mu.

Most cluster analysis algorithms are entirely based on a matrix of distances (or proximity) or require the calculation of its individual elements, so if the data is presented in the form X, then the first stage of solving the problem of searching for clusters will be the choice of a method for calculating distances, or proximity, between objects or features.

The question of determining the proximity between characteristics is somewhat easier to solve. As a rule, cluster analysis of features pursues the same goals as factor analysis: identifying groups of related features that reflect a certain aspect of the objects being studied. The measure of proximity in this case is various statistical coefficients of connection.


Related information.


Test No. 2

Option No. 5

Exercise 1. Using computer technology, conduct a correlation and regression analysis of the economic indicators under study and build a regression model………………………..…..3

1.1 Construction of the correlation field………………………………………………………4

1.2 Construction of a matrix of pair correlation coefficients……………6

1.3 Construction and analysis of single-factor regression models of linear and exponential form using the built-in functions of TP MS Excel…………………………………………………………………………………………………………………...6

1.4 Construction of a linear one-factor regression model……….10

1.5 Conclusions…………………………………………………………………………………15

Task 2. Using computer technology, solve linear programming problems……………………………………………………….18

a) Optimal production planning problem……………….19

1. Mathematical formulation of the problem……………………………………..19

2. Placement of source data on the TP MS Excel worksheet, calculation of constraint values, calculation of objective function values…………………...19

3. Formulation of the mathematical model of the problem in terms of the cells of the TP MS Excel worksheet………………………………………………..20

4. Search for an optimal solution to a given problem using the “Search for Solution” add-on……………………………………………..20

5. Analysis of results……………………………………………………….21

b) Transportation plan optimization problem (transport problem)…23

1. Mathematical formulation of the problem……………………………………..23

2. Placing data on the TP MS Excel worksheet …………………...24

3. Statement of the problem in terms of an Excel worksheet for using the “Search for Solution” utility….…………………………25

4. Analysis of results……………………………………………………….26

List of references………………………………………………………..28

Task 1. Using computer technology, conduct a correlation and regression analysis of the economic indicators under study and build a regression model.

Use the following as research tools:



Add-in tools TP Analysis Package MS Excel;

Built-in functions of the Stats (Statistics) CKM Maple library.

Conditions for task 1:

Using sample data, investigate the influence of factors X1, X2 and X3 on the effective trait Y.

Construct a correlation field and make an assumption about the presence and type of connection between the factors under study;

Having assessed the closeness of the relationship between the factors under study, construct a multifactorial (single-factor) linear regression model of the form Y=f(X1,X2 X3) or type Y=f(X).

Estimate:

Adequacy of the regression equation according to the value of the coefficient of determination R 2 ;

The significance of the coefficients of the regression equation according to Student's t-test at a given confidence level p = 0.05;

The degree of randomness of the relationship between each factor X and trait Y (Fisher criterion);

The relationship between indicators X 1, X 2, X 3 of fixed assets and the volume of gross output of an enterprise in one of the industries is characterized by the following data:

Option 5

X 1 1.5 2.6 3.5 4.8 5.9 6.3 7.2 8.9 9.5 11.1 15.0
X 2 10.2 15.3 18.4 20.5 24.7 25.6 27.3 28.3 29.6 30.1 31.0
X 3 1.1 2.3 3.5 4.1 5.7 6.6 7.3 8.5 9.8 10.1 12.0
Y

Solution to task 1.

The solution to task 1 assumes.

1. Construction of a correlation field.

2. Construction of a matrix of pair correlation coefficients.

3. Construction and analysis of single-factor regression models of linear and exponential form using the built-in functions of TP MS Excel.

4. Construction of linear one-factor regression models using the “Analysis Package” add-in.

5. Conclusions.

Construction of a correlation field.

Let's place the table with the source data in cells A3:D15 of the Excel worksheet.

Appendix 1.1
Y X1 X2 X3
1,5 10,2 1,1
2,6 15,3 2,3
3,5 18,4 3,5
4,8 20,5 4,1
5,9 24,7 5,7
6,3 25,6 6,6
7,2 27,3 7,3
8,9 28,3 8,5
9,5 29,6 9,8
11,1 30,1 10,1
?

Using the capabilities of the MS Excel TP chart wizard, we will construct a correlation field, that is, we will graphically represent the relationship between the resulting feature Y and each of the factors X. The graphs show that between the resulting feature Y and each of the factors X there is a directly proportional relationship that approaches linear.

.

.

We explore the closeness and nature of the connection between factors.

Construction of a matrix of pair correlation coefficients.

Using the “Analysis Package” add-in of the MS Excel TP (Service – Data Analysis – Correlation), we will build a matrix of pair correlation coefficients. The “Correlation” tool window is presented in Figure 1. The matrix of pair correlation coefficients is presented in Figure 2.

Fig.1. –Window "Correlation"

Fig.2. – Matrix of pair correlation coefficients.

From this matrix it is clear that all factors X1 – X3 under consideration have a close connection with the resultant characteristic Y. In addition, all factors X are multicollinear with each other. Therefore, the construction of a multifactor model of the form Y=f(X1,X2,X3) is impossible.

The correlation coefficient reflects the degree of relationship between two indicators. It always takes a value from -1 to 1. If the coefficient is located around 0, then there is no connection between the variables.

If the value is close to one (from 0.9, for example), then there is a strong direct relationship between the observed objects. If the coefficient is close to the other extreme point of the range (-1), then there is a strong inverse relationship between the variables. When the value is somewhere in between 0 to 1 or 0 to -1, then we are talking about a weak connection (direct or reverse). This relationship is usually not taken into account: it is believed that it does not exist.

Calculation of correlation coefficient in Excel

Let's look at an example of methods for calculating the correlation coefficient, features of direct and inverse relationships between variables.

Values ​​of indicators x and y:

Y is an independent variable, x is a dependent variable. It is necessary to find the strength (strong/weak) and direction (forward/inverse) of the connection between them. The correlation coefficient formula looks like this:


To make it easier to understand, let's break it down into several simple elements.

A strong direct relationship is determined between the variables.

The built-in CORREL function avoids complex calculations. Let's calculate the pair correlation coefficient in Excel using it. Call the function wizard. We find the right one. The function arguments are an array of y values ​​and an array of x values:

Let's show the values ​​of the variables on the graph:


A strong connection between y and x is visible, because the lines run almost parallel to each other. The relationship is direct: y increases - x increases, y decreases - x decreases.



Pair correlation coefficient matrix in Excel

The correlation matrix is ​​a table at the intersection of rows and columns of which the correlation coefficients between the corresponding values ​​are located. It makes sense to build it for several variables.

The matrix of correlation coefficients in Excel is constructed using the “Correlation” tool from the “Data Analysis” package.


A strong direct relationship was found between the values ​​of y and x1. There is a strong feedback between x1 and x2. There is practically no connection with the values ​​in column x3.

1. Calculate the matrix of pair correlation coefficients; analyze the closeness and direction of the connection of the resulting characteristic Y with each factor X; evaluate the statistical significance of correlation coefficients r(Y,X i); choose the most informative factor.

2. Construct a paired regression model with the most informative factor; give an economic interpretation of the regression coefficient.

3. Assess the quality of the model using the average relative error of approximation, coefficient of determination and Fisher’s F test (accept significance level α=0.05).

4. With a confidence probability of γ=80%, predict the average value of the indicator Y(forecast values ​​of factors are given in Appendix 6). Present graphically actual and model values Y,prediction results.

5. Using the inclusion method, build two-factor models, keeping the most informative factor in them; build a three-factor model with a complete list of factors.

6. Select the best of the constructed multiple models. Give an economic interpretation of its coefficients.

7. Check the significance of multiple regression coefficients using t–Student's test (accept significance level α=0.05). Has the quality of the multiple model improved compared to the paired model?

8. Assess the influence of factors on the result using elasticity coefficients, beta and delta coefficients.

Task 2. Modeling a univariate time series

Appendix 7 shows time series Y(t) socio-economic indicators for the Altai Territory for the period from 2000 to 2011. It is required to study the dynamics of the indicator corresponding to the task option.

Option Designation, name, unit of measurement of the indicator
Y1 Average consumer spending per capita (per month), rub.
Y2 Emissions of pollutants into the atmospheric air, thousand tons
Y3 Average prices on the secondary housing market (at the end of the year, per square meter of total area), rubles
Y4 Volume of paid services per capita, rub
Y5 Average annual number of people employed in the economy, thousand people
Y6 Number of own passenger cars per 1000 population (at the end of the year), units
Y7 Average per capita cash income (per month), rub.
Y8 Consumer price index (December compared to December of the previous year), %
Y9 Investments in fixed assets (in actual prices), million rubles
Y10 Retail trade turnover per capita (in actual prices), rubles


Work order

1. Construct a linear time series model, the parameters of which can be estimated by least squares. Explain the meaning of the regression coefficient.

2. Assess the adequacy of the constructed model using the properties of randomness, independence and compliance of the residual component with the normal distribution law.

3. Assess the accuracy of the model based on the use of the average relative error of approximation.

4. Forecast the indicator under consideration for a year in advance (calculate the forecast interval with a confidence probability of 70%).

5. Present graphically the actual values ​​of the indicator, the results of modeling and forecasting.

6. Calculate the parameters of logarithmic, polynomial (2nd degree polynomial), power, exponential and hyperbolic trends. Based on the graphical image and the value of the determination index, select the most suitable type of trend.

7. Using the best nonlinear model, make a point forecast of the indicator in question for the year ahead. Compare the result obtained with the confidence forecast interval constructed using a linear model.

EXAMPLE

Carrying out the test

Problem 1

The company sells used cars. The names of indicators and initial data for econometric modeling are presented in the table:

Sales price, thousand.e. ( Y) Price of a new car, thousand.e. ( X1) Service life, years ( X2) Left hand drive - 1, right hand drive - 0, ( X3)
8,33 13,99 3,8
10,40 19,05 2,4
10,60 17,36 4,5
16,58 25,00 3,5
20,94 25,45 3,0
19,13 31,81 3,5
13,88 22,53 3,0
8,80 16,24 5,0
13,89 16,54 2,0
11,03 19,04 4,5
14,88 22,61 4,6
20,43 27,56 4,0
14,80 22,51 3,3
26,05 31,75 2,3

Required:

1. Calculate the matrix of pair correlation coefficients; analyze the closeness and direction of the connection between the resulting characteristic Y and each of the factors X; evaluate the statistical significance of the correlation coefficients r(Y, X i); choose the most informative factor.

We use Excel (Data / Data Analysis / CORRELATION):

We obtain a matrix of pairwise correlation coefficients between all available variables:

U X1 X2 X3
U
X1 0,910987
X2 -0,4156 -0,2603
X3 0,190785 0,221927 -0,30308

Let's analyze the correlation coefficients between the resulting characteristic Y and each of the factors X j:

> 0, therefore, between variables Y And X 1 there is a direct correlation: the higher the price of a new car, the higher the selling price.

> 0.7 – this dependence is close.

< 0, значит, между переменными Y And X 2 observed

inverse correlation: the selling price is lower for cars

mobile phones with a long service life.

– this dependence is moderate, closer to weak.

> 0, which means between variables Y And X 3 there is a direct correlation: the sales price is higher for left-hand drive cars.

< 0,4 – эта зависимость слабая.

To check the significance of the found correlation coefficients, we use the Student's test.

For each correlation coefficient let's calculate t-statistics according to the formula and enter the calculation results in an additional column of the correlation table:

U X1 X2 X3 t-statistics
U
X1 0,910987 7,651524603
X2 -0,4156 -0,2603 1,582847988
X3 0,190785 0,221927 -0,30308 0,673265587

According to the table of critical points of the Student distribution at the significance level and the number of degrees of freedom, we determine the critical value (Appendix 1, or the STUDARSOBR function).Y and the service life X 2 is reliable.

< , следовательно, коэффициент не является значимым. На основании выборочных данных нет оснований утверждать, что зависимость между ценой реализации Y and steering wheel position X 3 is reliable.

Thus, the closest and most significant relationship is observed between the selling price Y and the price of a new car X 1 ; factor X 1 is the most informative.

Analysis of the matrix of paired correlation coefficients shows that the effective indicator is most closely related to the indicator x(4) - the amount of fertilizer consumed per 1 hectare ().

At the same time, the connection between the attributes-arguments is quite close. Thus, there is a practically functional relationship between the number of wheeled tractors ( x(1)) and the number of surface tillage tools
.

The presence of multicollinearity is also indicated by correlation coefficients
And
. Considering the close relationship between the indicators x (1) , x(2) and x(3), only one of them can be included in the yield regression model.

To demonstrate the negative impact of multicollinearity, consider a regression model of yield, including all input indicators:


F obs = 121.

The values ​​of the corrected estimates of the standard deviations of the estimates of the coefficients of the equation are indicated in parentheses
.

The following adequacy parameters are presented under the regression equation: multiple coefficient of determination
; corrected residual variance estimator
, the average relative error of approximation and the calculated value of the criterion F obs = 121.

The regression equation is significant because F obs = 121 > F kp = 2.85 found from the table F-distributions at=0.05; 1 =6 and 2 =14.

It follows from this that 0, i.e. and at least one of the coefficients of the equation j (j= 0, 1, 2, ..., 5) is not zero.

To test the hypothesis about the significance of individual regression coefficients H0:  j =0, where j=1,2,3,4,5, compare the critical value t kp = 2.14, found from the table t-distributions at significance level=2 Q=0.05 and number of degrees of freedom=14, with the calculated value . It follows from the equation that the regression coefficient is statistically significant only when x(4) , since t 4 =2.90 > t kp =2.14.

Negative signs of regression coefficients do not lend themselves to economic interpretation when x(1) and x(5) . From the negative values ​​of the coefficients it follows that the increase in the saturation of agriculture with wheeled tractors ( x(1)) and plant health products ( x(5)) has a negative effect on yield. Therefore, the resulting regression equation is unacceptable.

To obtain a regression equation with significant coefficients, we use a step-by-step regression analysis algorithm. Initially, we use a step-by-step algorithm with the elimination of variables.

Let's exclude the variable from the model x(1) , which corresponds to the minimum absolute value t 1 =0.01. For the remaining variables, we again construct the regression equation:

The resulting equation is significant because F observed = 155 > F kp = 2.90, found at the significance level  = 0.05 and the numbers of degrees of freedom  1 = 5 and  2 = 15 according to the table F-distribution, i.e. vector0. However, only the regression coefficient at x(4) . Estimated values t j for other coefficients is less t kr = 2.131, found from the table t-distributions at=2 Q=0.05 and=15.

By excluding the variable from the model x(3) , which corresponds to the minimum value t 3 =0.35 and we get the regression equation:

(2.9)

In the resulting equation, the coefficient at x(5) . By excluding x(5) we obtain the regression equation:

(2.10)

We obtained a significant regression equation with significant and interpretable coefficients.

However, the resulting equation is not the only “good” and not the “best” yield model in our example.

Let's show that in the multicollinearity condition, a stepwise algorithm with the inclusion of variables is more efficient. The first step in the yield model y variable included x(4) , which has the highest correlation coefficient with y, explained by the variable r(y,x(4))=0.58. In the second step, including the equation along with x(4) variables x(1) or x(3), we will obtain models that, for economic reasons and statistical characteristics, exceed (2.10):

(2.11)

(2.12)

Including any of the three remaining variables in the equation worsens its properties. See, for example, equation (2.9).

Thus, we have three “good” yield models, from which we need to choose one for economic and statistical reasons.

According to statistical criteria, model (2.11) is most adequate. It corresponds to the minimum values ​​of the residual variance =2.26 and the average relative error of approximation and the largest values
and F obs = 273.

Model (2.12) has slightly worse adequacy indicators, followed by model (2.10).

We will now choose the best of models (2.11) and (2.12). These models differ from each other in terms of variables x(1) and x(3) . However, in yield models the variable x(1) (number of wheeled tractors per 100 ha) is more preferable than variable x(3) (number of surface tillage implements per 100 ha), which is to some extent secondary (or derived from x (1)).

In this regard, for economic reasons, preference should be given to model (2.12). Thus, after implementing the stepwise regression analysis algorithm with the inclusion of variables and taking into account the fact that only one of the three related variables should enter the equation ( x (1) ,x(2) or x(3)) choose the final regression equation:

The equation is significant at =0.05, because F obs = 266 > F kp = 3.20, found from the table F-distributions at= Q=0.05; 1 =3 and 2 =17. All regression coefficients are also significant And in the equation t j > t kp (=2 Q=0.05;=17)=2.11. The regression coefficient 1 should be considered significant ( 1 0) for economic reasons, while t 1 =2.09 only slightly less t kp = 2.11.

From the regression equation it follows that an increase by one in the number of tractors per 100 hectares of arable land (at a fixed value x(4)) leads to an increase in grain yields by an average of 0.345 c/ha.

An approximate calculation of the elasticity coefficients e 1 0.068 and e 2 0.161 shows that with increasing indicators x(1) and x(4) by 1%, grain yield increases on average by 0.068% and 0.161%, respectively.

Multiple coefficient of determination
indicates that only 46.9% of the yield variation is explained by the indicators included in the model ( x(1) and x(4)), that is, the saturation of crop production with tractors and fertilizers. The rest of the variation is due to the action of unaccounted factors ( x (2) ,x (3) ,x(5), weather conditions, etc.). The average relative error of approximation characterizes the adequacy of the model, as well as the value of the residual variance
. When interpreting the regression equation, the values ​​of the relative errors of approximation are of interest
. Let us remind you that - model value of the effective indicator, characterizes the average yield value for the totality of the regions under consideration, provided that the values ​​of the explanatory variables x(1) and x(4) are fixed at the same level, namely x (1) =x i(1) and x (4) = x i(4) . Then according to the values i You can compare regions by yield. Areas to which the values ​​correspond i>0, have above average yield, a i <0 - ниже среднего.

In our example, in terms of yield, the most efficient crop production is carried out in the area to which  corresponds 7 =28%, where the yield is 28% higher than the regional average, and the least effective is in the region with 20 =27,3%.

CATEGORIES

POPULAR ARTICLES

2023 “kingad.ru” - ultrasound examination of human organs