Construct a matrix of pair correlation coefficients. Check for multicollinearity

Economic data represents quantitative characteristics of any economic objects or processes. They are formed under the influence of many factors, not all of which are accessible to external control. Uncontrollable factors can take on random values from some set of values and thereby cause the data they define to be random. One of the main tasks in economic research is analysis of dependencies between variables.

When considering the dependencies between characteristics, it is necessary to distinguish, first of all, two types of connections:

functional - are characterized by complete correspondence between the change in the factor characteristic and the change in the resulting value: Each value of a factor characteristic corresponds to very specific values of the resulting characteristic. This type of relationship is expressed as a formulaic relationship. Functional dependence can connect an effective characteristic with one or more factor characteristics. Thus, the amount of wages for time-based wages depends on the number of hours worked;
correlational- there is no complete correspondence between the change in two signs; the impact of individual factors is manifested only on average, with mass observation of actual data. The simultaneous impact on the studied trait of a large number of different factors leads to the fact that one and the same value of a factor characteristic corresponds to a whole distribution of values of the resulting characteristic, since in each specific case other factor characteristics can change the strength and direction of their impact.

It should be borne in mind that if there is a functional relationship between the characteristics, it is possible, knowing the value of the factor characteristic, to accurately determine the value of the resultant sign. In the presence of a correlation dependence, only trend of change in the resultant characteristic when the value of the factor characteristic changes.

When studying the relationships between signs, they are classified according to direction, form, number of factors:

towards connections are divided into straight And reverse. With a direct connection, the direction of change in the resulting characteristic coincides with the direction of change in the factor characteristic. With feedback, the direction of change in the resulting characteristic is opposite to the direction of change in the factor characteristic. For example, the higher the qualifications of the worker, the higher the level of productivity of his labor (direct relationship). The higher the labor productivity, the lower the cost per unit of production (feedback);
according to form(type of function) connections are divided into linear(straight-line) and nonlinear(curvilinear). A linear relationship is represented by a straight line, a nonlinear relationship by a curve (parabola, hyperbola, etc.). In a linear relationship, with an increase in the value of a factor characteristic, there is a uniform increase (decrease) in the value of the resulting characteristic;
by the number of factors acting on the effective characteristic, connections are divided into single-factor(paired) and multifactorial.

The study of the dependence of trait variation on environmental conditions is the content of correlation theory.

When conducting correlation analysis, the entire set of data is considered as a set of variables (factors), each of which contains P observations.

When studying the relationship between two factors, they are usually designated X=(x p x 2,...,x n) And Y= (y ( , y 2 ,..., y and).

Covariance - this is statistical measure of interaction two variables. For example, a positive value for the covariance of the returns of two securities indicates that the returns of these securities tend to move in one direction.

Covariance between two variables X And Y calculated as follows:

where are the actual values of the variables

X And G;

If random variables Chi Y independent, the theoretical covariance is zero.

Covariance depends on the units in which the variables are measured Hee Y, it is a non-standardized quantity. Therefore, to measure connection strength another statistic called a correlation coefficient is used between two variables.

For two variables X And Y pair correlation coefficient

is defined as follows:

Where SSy- estimates of variances of quantities Hee Y. These estimates characterize degree of scatter values x (, x 2, ..., x n (y 1, y 2, y n) around your average x(y respectively), or variability(variability) of these variables over a set of observations.

Dispersion(estimation of variance) is determined by the formula

In general, to obtain an unbiased estimate of the variance, the sum of squares should be divided by the number of degrees of freedom of the estimate (etc), Where P - sample size, R - number of connections superimposed on the sample. Since the sample has already been used once to determine the mean X, then the number of superimposed connections in this case is equal to one (p = 1), and the number of degrees of freedom of the estimate (i.e., the number of independent sample elements) is equal to (P - 1).

It is more natural to measure the degree of dispersion of variable values in the same units in which the variable itself is measured. This problem is solved by an indicator called standard deviation (standard deviation) or standard error variable X(variable Y) and determined by the relation

The terms in the numerator of formula (3.2.1) express the interaction of two variables and determine the sign of the correlation (positive or negative). If, for example, there is a strong positive relationship between variables (an increase in one variable while the other increases), each term will be a positive number. Likewise, if there is a strong negative relationship between variables, all terms in the numerator will be negative numbers, resulting in a negative correlation value.

The denominator of the expression for the pairwise correlation coefficient [see formula (3.2.2)] simply normalizes the numerator in such a way that the correlation coefficient turns out to be an easily interpretable number without dimension, and takes values from -1 to +1.

The numerator of the expression for the correlation coefficient, which is difficult to interpret due to the unusual units of measurement, is covariance HiU. Despite the fact that it is sometimes used as an independent characteristic (for example, in finance theory to describe the joint change in stock prices on two exchanges), it is more convenient to use the correlation coefficient. Correlation and covariance represent essentially the same information, but correlation represents this information in a more useful form.

To qualitatively assess the correlation coefficient, various scales are used, most often the Chaddock scale. Depending on the value of the correlation coefficient, the relationship can have one of the following ratings:

0.1-0.3 - weak;
0.3-0.5 - noticeable;
0.5-0.7 - moderate;
0.7-0.9 - high;
0.9-1.0 - very high.

Assessment of the degree of closeness of a connection using the correlation coefficient is carried out, as a rule, on the basis of more or less limited information about the phenomenon being studied. In this regard, there is a need to assess the significance of the linear correlation coefficient, which makes it possible to extend conclusions based on the sample results to the general population.

Assessment of the significance of the correlation coefficient for small sample sizes is performed using Student's 7-test. In this case, the actual (observed) value of this criterion is determined by the formula

The value / obs calculated using this formula is compared with the critical value of the 7-criterion, which is taken from the table of Student’s /-test values (see Appendix 2) taking into account the given significance level oc and the number of degrees of freedom (P - 2).

If 7 obs > 7 tabs, then the resulting value of the correlation coefficient is considered significant (i.e., the null hypothesis stating that the correlation coefficient is equal to zero is rejected). And thus it is concluded that there is a close statistical relationship between the variables under study.

If the value g y x close to zero, the relationship between the variables is weak. If the correlation between random variables:

positive, then as one random variable increases, the other tends to increase on average;
negative, then as one random variable increases, the other tends to decrease on average. A convenient graphical tool for analyzing paired data is scatter plot, which represents each observation in a space of two dimensions corresponding to two factors. A scatterplot, which depicts a set of values of two characteristics, is also called correlation field. Each point in this diagram has coordinates x (. and y g As the strength of the linear relationship increases, the points on the graph will lie closer to the straight line and the magnitude G will be closer to unity.

Pairwise correlation coefficients are used to measure the strength of linear relationships between different pairs of features from a set of them. For many features one gets matrix of pair correlation coefficients.

Let the entire set of data consist of a variable Y = =(y p y 2, ..., y p) And T variables (factors) X, each of which contains P observations. Variable values Y And X, contained in the observed population are recorded in a table (Table 3.2.1).

Table 3.2.1

Variable Number observations


					X TZ

					X tp

Based on the data contained in this table, calculate matrix of pair correlation coefficients R, it is symmetrical about the main diagonal:

Analysis of the matrix of pair correlation coefficients is used when constructing multiple regression models.

One correlation matrix cannot completely describe the dependencies between quantities. In this regard, multivariate correlation analysis considers two problems:

1. Determination of the close relationship of one random variable with the totality of other variables included in the analysis.
2. Determination of the closeness of the connection between two quantities while fixing or excluding the influence of other quantities.

These problems are solved using multiple and partial correlation coefficients, respectively.

The solution to the first problem (determining the close relationship of one random variable with the totality of other variables included in the analysis) is carried out using sample multiple correlation coefficient according to the formula

Where R- R[cm. formula (3.2.6)]; Rjj- algebraic complement of an element of the same matrix R.

Squared multiple correlation coefficient SCHj 2 j _j J+l m usually called sample multiple coefficient of determination; it shows what proportion of the variation (random spread) of the value being studied Xj explains the variation of the remaining random variables X ( , X 2 ,..., X t.

The coefficients of multiple correlation and determination are positive quantities, taking values in the range from 0 to 1. When approximating the coefficient R 2 to unity, we can conclude that the relationship between random variables is close, but not about its direction. The multiple correlation coefficient can only increase if additional variables are included in the model, and will not increase if any of the existing characteristics are excluded.

Checking the significance of the coefficient of determination is carried out by comparing the calculated value of Fisher's /'-criterion

with tabular F rabl. The tabular value of the criterion (see Appendix 1) is determined by the given significance level a and degrees of freedom v l = mnv 2 = n-m-l. Coefficient R 2 is significantly different from zero if the inequality holds

If the random variables under consideration correlate with each other then the value of the pair correlation coefficient is partially affected by the influence of other quantities. In this regard, there is a need to study the partial correlation between quantities while excluding the influence of other random variables (one or more).

Sample partial correlation coefficient determined by the formula

Where R Jk , Rjj, R kk - algebraic additions to the corresponding matrix elements R[cm. formula (3.2.6)].

The partial correlation coefficient, as well as the pair correlation coefficient, varies from -1 to +1.

Expression (3.2.9) subject to t = 3 will look like

The coefficient r 12(3) is called correlation coefficient between x ( And x 2 for fixed x y It is symmetric with respect to primary indexes 1, 2. Its secondary index 3 refers to a fixed variable.

Example 3.2.1. Calculation of pair coefficients,

multiple and partial correlation.

In table 3.2.2 provides information on sales volumes and advertising costs of one company, as well as the consumer spending index for a number of current years.

1. Construct a scatter diagram (correlation field) for the variables “sales volume” and “consumer spending index”.
2. Determine the degree of influence of the consumer spending index on sales volume (calculate the pair correlation coefficient).
3. Assess the significance of the calculated pair correlation coefficient.
4. Construct a matrix of pairwise correlation coefficients for three variables.
5. Find an estimate of the multiple correlation coefficient.
6. Find estimates of partial correlation coefficients.

1. In our example, the scatter diagram has the form shown in Fig. 3.2.1. The elongation of the cloud of points on the scatter diagram along the inclined line allows us to make the assumption that there is some objective tendency for a direct linear relationship between the values of the variables X 2 Y(volume of sales).

Rice. 3.2.1.

2. Intermediate calculations when calculating the correlation coefficient between variables X 2(Consumer Expenditure Index) and Y(sales volume) are given in table. 3.2.3.

Average values random variables X 2 And Y, which are the simplest indicators characterizing the sequences jCj, x 2,..., x 16 and y v y 2 ,..., y 16, calculate using the following formulas:

Sales volume Y, thousand rubles.	Index consume telsky expenses	Sales volume Y, thousand rubles.	Index consume telsky expenses

Table 3.2.3

l:, - X	(AND - U)(x, - x)	(x, - x) 2	(y, - - y) 2

Dispersion characterizes the degree of spread of values x v x 2,x:

Let us now consider the solution to example 3.2.1 in Excel.

To calculate correlation using Excel, you can use the function =correl(), specifying the addresses of two columns of numbers, as shown in Fig. 3.2.2. The answer is placed in D8 and is equal to 0.816.

Rice. 3.2.2.

(Note: Function arguments correls must be numbers or names, arrays or references containing numbers. If the argument, which is an array or reference, contains text, boolean values, or empty cells, then such values are ignored; however, cells that contain zero values are counted.

If array! and array2 have different numbers of data points, then the function correl returns the error value #n/a.

If array1 or array2 is empty or if o (standard deviation) of their values is zero, then the function correl returns the error value #div/0!.)

The critical value of Student's t-statistic can also be obtained using the function studistribution of 1 Excel package. As function arguments, you must specify the number of degrees of freedom equal to P- 2 (in our example 16 - 2= 14) and significance level a (in our example a = 0.1) (Fig. 3.2.3). If actual value/-statistics taken modulo is greater critical, then with probability (1 - a) the correlation coefficient is significantly different from zero.

Rice. 3.2.3. The critical value of the /-statistic is 1.7613

Excel includes a set of data analysis tools (the so-called analysis package) designed to solve various statistical problems. To calculate the matrix of pair correlation coefficients R you should use the Correlation tool (Fig. 3.2.4) and set the analysis parameters in the corresponding dialog box. The answer will be placed on a new worksheet (Fig. 3.2.5).

1 In Excel 2010, the name of the function studrasprobr changed to stu-

DENT.OBR.2X.

Rice. 3.2.4.

Rice. 3.2.5.

The founders of the theory of correlation are considered to be the English statisticians F. Galton (1822-1911) and K. Pearson (1857-1936). The term “correlation” was borrowed from natural science and means “correlation, correspondence.” The idea of correlation as interdependence between random variables underlies the mathematical-statistical theory of correlation.

Task 2

1. Construct a matrix of pair correlation coefficients. Check for multicollinearity. Justify the selection of factors in the model.

2. Construct a multiple regression equation in linear form with selected factors.

3. Assess the statistical significance of the regression equation and its parameters using the Fisher and Student tests.

4. Construct a regression equation with statistically significant factors. Assess the quality of the regression equation using the coefficient of determination R2. Evaluate the accuracy of the constructed model.

5. Evaluate the forecast of production volume if the forecast values of the factors are 75% of their maximum values.

Problem conditions (Option 21)

According to the data presented in Table 1 (n = 17), the dependence of the production volume Y (million rubles) on the following factors (variables) is studied:

X 1 – number of industrial production personnel, people.

X 2 – average annual cost of fixed assets, million rubles.

X 3 – depreciation of fixed assets, %

X 4 – power supply, kWh.

X 5 – technical equipment of one worker, million rubles.

X 6 – production of marketable products per worker, rub.

Table 1. Product release data

№ Y X 1 X 2 X 3 X 4 X 5 X 6
39,5 4,9 3,2
46,4 60,5 20,4
43,7 24,9 9,5
35,7 50,4 34,7
41,8 5,1 17,9
49,8 35,9 12,1
44,1 48,1 18,9
48,1 69,5 12,2
47,6 31,9 8,1
58,6 139,4 29,7
70,4 16,9 5,3
37,5 17,8 5,6
62,0 27,6 12,3
34,4 13,9 3,2
35,4 37,3 19,0
40,8 55,3 19,3
48,1 35,1 12,4

Construct a matrix of pair correlation coefficients. Check for multicollinearity. Justify the selection of factors in the model

Table 2 shows pair correlation coefficient matrix for all variables involved in the consideration. The matrix was obtained using the tool Correlation from the package Data analysis V Excel.

Table 2. Matrix of pair correlation coefficients

Y X1 X2 X3 X4 X5 X6
Y
X1 0,995634
X2 0,996949 0,994947
X3 -0,25446 -0,27074 -0,26264
X4 0,12291 0,07251 0,107572 0,248622
X5 0,222946 0,166919 0,219914 -0,07573 0,671386
X6 0,067685 -0,00273 0,041955 -0,28755 0,366382 0,600899
Visual analysis of the matrix allows you to establish:

1) U has fairly high pairwise correlations with variables X1, X2 (>0,5) and low with variables X3,X4,X5,X6 (<0,5);

2) Analysis variables X1, X2 demonstrate fairly high pairwise correlations, which necessitates checking factors for the presence of multicollinearity between them. Moreover, one of the conditions of the classical regression model is the assumption of independence of explanatory variables.

To identify multicollinearity of factors, we perform Farrar-Glouber test by factors X1, X2, X3,X4,X5,X6.

Checking the Farrar-Glouber test for multicollinearity of factors includes several stages.

1) Checking for multicollinearity of the entire array of variables .

One of the conditions of the classical regression model is the assumption of independence of the explanatory variables. To identify multicollinearity between factors, the matrix of interfactor correlations R is calculated using the Data Analysis Package (Table 3).

Table 3. Matrix of interfactor correlations R

X1 X2 X3 X4 X5 X6
X1 0,994947 -0,27074 0,07251 0,166919 -0,00273
X2 0,994947 -0,26264 0,107572 0,219914 0,041955
X3 -0,27074 -0,26264 0,248622 -0,07573 -0,28755
X4 0,07251 0,107572 0,248622 0,671386 0,366382
X5 0,166919 0,219914 -0,07573 0,671386 0,600899
X6 -0,00273 0,041955 -0,28755 0,366382 0,600899
There is a strong dependence (>0.5) between factors X1 and X2, X5 and X4, X6 and X5.

The determinant det (R) = 0.001488 is calculated using the MOPRED function. The determinant of the matrix R tends to zero, which allows us to make an assumption about the general multicollinearity of the factors.

2) Checking for multicollinearity of each variable with other variables:

· Let's calculate the inverse matrix R -1 using the Excel function MOBR (Table 4):

Table 4. Inverse matrix R -1

X1 X2 X3 X4 X5 X6
X1 150,1209 -149,95 3,415228 -1,70527 6,775768 4,236465
X2 -149,95 150,9583 -3,00988 1,591549 -7,10952 -3,91954
X3 3,415228 -3,00988 1,541199 -0,76909 0,325241 0,665121
X4 -1,70527 1,591549 -0,76909 2,218969 -1,4854 -0,213
X5 6,775768 -7,10952 0,325241 -1,4854 2,943718 -0,81434
X6 4,236465 -3,91954 0,665121 -0,213 -0,81434 1,934647
· Calculation of F-criteria, where are the diagonal elements of the matrix, n=17, k = 6 (Table 5).

Table 5. F-test values

F1 (X1) F2 (X2) F3 (X3) F4 (X4) F5 (X5) F6 (X6)
89,29396 89,79536 0,324071 0,729921 1,163903 0,559669
· Actual F-test values are compared with the table value F table = 3.21(FDIST(0.05;6;10)) with n1= 6 and n2 = n - k – 1=17-6-1=10 degrees of freedom and significance level α=0.05, where k is the number of factors.

· The F-criteria values for factors X1 and X2 are greater than the tabulated ones, which indicates the presence of multicollinearity between these factors. Factor X3 has the least effect on the overall multicollinearity of factors.

3) Checking for multicollinearity of each pair of variables

· Let us calculate the partial correlation coefficients using the formula , where are the elements of the matrix (Table 6)

Table 6. Matrix of partial correlation coefficients

X1 X2 X3 X4 X5 X6
X1
X2 0,996086
X3 -0,22453 0,197329
X4 0,093432 -0,08696 0,415882
X5 -0,32232 0,337259 -0,1527 0,581191
X6 -0,24859 0,229354 -0,38519 0,102801 0,341239
· Calculation t-criteria according to the formula (Table 7)

n - number of data = 17

K - number of factors = 6

Table 7.t-tests for partial correlation coefficients

X1 X2 X3 X4 X5 X6
X1
X2 35,6355
X3 -0,72862 0,636526
X4 0,296756 -0,27604 1,446126
X5 -1,07674 1,13288 -0,4886 2,258495
X6 -0,81158 0,745143 -1,31991 0,326817 1,147999
t table = STUDARSOBR(0.05,10) = 2.23

The actual values of the t-tests are compared with the table value with degrees of freedom n-k-1 = 17-6-1=10 and significance level α=0.05;

t21 > ttable

t54 > ttable

From tables 6 and 7 it is clear that two pairs of factors X1 and X2, X4 and X5 have a high statistically significant partial correlation, that is, they are multicollinear. In order to get rid of multicollinearity, you can exclude one of the variables of the collinear pair. In the pair X1 and X2 we leave X2, in the pair X4 and X5 we leave X5.

Thus, as a result of checking the Farrar-Glouber test, the following factors remain: X2, X3, X5, X6.

When completing the correlation analysis procedures, it is advisable to look at the partial correlations of the selected factors with the result Y.

Let's build a matrix of paired correlation coefficients based on the data in Table 8.

Table 8. Product output data with selected factors X2, X3, X5, X6.

Observation No. Y X 2 X 3 X 5 X 6
39,5 3,2
46,4 20,4
43,7 9,5
35,7 34,7
41,8 17,9
49,8 12,1
44,1 18,9
48,1 12,2
47,6 8,1
58,6 29,7
70,4 5,3
37,5 5,6
12,3
34,4 3,2
35,4
40,8 19,3
48,1 12,4
The last column of Table 9 presents the t-test values for the Y column.

Table 9. Matrix of partial correlation coefficients with the result Y

Y X2 X3 X5 X6 t criterion (t table (0.05;11)= 2.200985
Y 0,996949 -0,25446 0,222946 0,067685
X2 0,996949 -0,26264 0,219914 0,041955 44,31676
X3 -0,25446 -0,26264 -0,07573 -0,28755 0,916144
X5 0,222946 0,219914 -0,07573 0,600899 -0,88721
X6 0,067685 0,041955 -0,28755 0,600899 1,645749
From Table 9 it is clear that the variable Y has a high and at the same time statistically significant partial correlation with factor X2.

Y X 1 X 2 X 3 X 4 X 5 X 6
Y
X 1 0,519
X 2 -0,273 0,030
X 3 0,610 0,813 -0,116
X 4 -0,572 -0,013 -0,022 -0,091
X 5 0,297 0,043 -0,461 0,120 -0,359
X 6 0,118 -0,366 -0,061 -0,329 -0,100 -0,290
Analysis interfactorial(between the “X’s”!) correlation coefficients shows that the value of 0.8 exceeds in absolute value only the correlation coefficient between a pair of factors X 1 –X 3 (in bold). Factors X 1 –X 3 are thus recognized as collinear.

2. As shown in paragraph 1, factors X 1 –X 3 are collinear, meaning that they are effectively duplicates of each other, and including them simultaneously in the model will lead to incorrect interpretation of the corresponding regression coefficients. It is clear that the factor X 3 has a larger modulo correlation coefficient with result Y than factor X 1: r y , x 1 =0,519; r y , x 3 =0.610; (cm. table 1). This indicates a stronger influence of the factor X 3 per change Y. Factor X 1 is therefore excluded from consideration.

To construct a regression equation, the values of the variables used ( Y,X 2 , X 3 , X 4 , X 5 , X 6) copy to a blank worksheet ( adj. 3). We build the regression equation using the add-in “ Data Analysis...Regression" (menu " Service"® « Data analysis…» ® « Regression"). The regression analysis panel with filled fields is shown in rice. 2.

The results of the regression analysis are given in adj. 4 and moved to table 2. The regression equation has the form (see “ Odds" V table 2):

The regression equation is considered statistically significant, since the probability of its random formation in the form in which it was obtained is 8.80 × 10 -6 (see. "Significance F" V table 2), which is significantly lower than the accepted significance level of a=0.05.

X 3 , X 4 , X 6 below the accepted significance level a=0.05 (see “ P-Value" V table 2), which indicates the statistical significance of the coefficients and the significant influence of these factors on the change in annual profit Y.

Probability of random formation of coefficients for factors X 2 and X 5 exceeds the accepted significance level a=0.05 (see “ P-Value" V table 2), and these coefficients are not considered statistically significant.

rice. 2. Model regression analysis panel Y(X 2 , X 3 , X 4 , X 5 , X 6)

table 2

Y(X 2 , X 3 , X 4 , X 5 , X 6)

Regression statistics
Plural R 0,868
R-square 0,753
Normalized R-squared 0,694
Standard error 242,3
Observations
Analysis of variance
df SS MS F Significance F
Regression 3749838,2 749967,6 12,78 8.80E-06
Remainder 1232466,8 58688,9
Total 4982305,0
Regression equation
Odds Standard error t-statistic P-Value
Y-intersection 487,5 641,4 0,760 0,456
X2 -0,0456 0,0373 -1,224 0,235
X3 0,1043 0,0194 5,375 0,00002
X4 -0,0965 0,0263 -3,674 0,001
X5 2,528 6,323 0,400 0,693
X6 248,2 113,0 2,197 0,039

3. Based on the results of checking the statistical significance of the regression equation coefficients carried out in the previous paragraph, we build a new regression model containing only informative factors, which include:

· factors whose coefficients are statistically significant;

factors whose coefficients t-statistics exceeds one in absolute value (in other words, the absolute value of the coefficient is greater than its standard error).

The first group includes factors X 3 , X 4 , X 6, to the second - factor X 2. Factor X 5 is excluded from consideration as uninformative, and the final regression model will contain factors X 2 , X 3 , X 4 , X 6 .

To build a regression equation, copy the values of the variables used to a blank worksheet ( adj. 5) and carry out regression analysis ( rice. 3). Its results are given in adj. 6 and moved to table 3. The regression equation is:

(cm. " Odds" V table 3).

rice. 3. Model regression analysis panel Y(X 2 , X 3 , X 4 , X 6)

Table 3

Results of regression analysis of the model Y(X 2 , X 3 , X 4 , X 6)

Regression statistics
Plural R 0,866
R-square 0,751
Normalized R-squared 0,705
Standard error 237,6
Observations
Analysis of variance
df SS MS F Significance F
Regression 3740456,2 935114,1 16,57 2.14E-06
Remainder 1241848,7 56447,7
Total 4982305,0
Regression equation
Odds Standard error t-statistic P-Value
Y-intersection 712,2 303,0 2,351 0,028
X2 -0,0541 0,0300 -1,806 0,085
X3 0,1032 0,0188 5,476 0,00002
X4 -0,1017 0,0223 -4,560 0,00015
X6 227,5 98,5 2,310 0,031

The regression equation is statistically significant: the probability of its random formation is below the acceptable significance level of a=0.05 (see “ Significance F" V table 3).

The coefficients for the factors are also considered statistically significant X 3 , X 4 , X 6: the probability of their random formation is below the acceptable significance level a=0.05 (see “ P-Value" V table 3). This indicates a significant impact of the annual insurance premiums X 3, annual amount of insurance payments X 4 and forms of ownership X 6 per change in annual profit Y.

Factor coefficient X 2 (annual size of insurance reserves) is not statistically significant. However, this factor can still be considered informative, since t-the statistics of its coefficient exceeds modulo unit, although further conclusions regarding the factor X 2 should be treated with some caution.

4. Let us evaluate the quality and accuracy of the last regression equation using some statistical characteristics obtained during regression analysis (see . « Regression statistics" V table 3):

multiple coefficient of determination

shows that the regression model explains 75.1% of the variation in annual profit Y, and this variation is due to changes in the factors included in the regression model X 2 , X 3 , X 4 and X 6 ;

standard error of regression

thousand roubles.

shows that the values of annual profit predicted by the regression equation Y differ from actual values by an average of 237.6 thousand rubles.

The average relative approximation error is determined by the approximate formula:

Where thousand roubles. - average annual profit (determined using the built-in function " AVERAGE»; adj. 1).

E rel shows that the values of annual profit predicted by the regression equation Y differ from actual values by an average of 26.7%. The model has unsatisfactory accuracy (at - the accuracy of the model is high, at - good, with - satisfactory, with - unsatisfactory).

5. For the economic interpretation of the coefficients of the regression equation, we tabulate the average values and standard deviations of the variables in the source data ( table 4) . The average values were determined using the built-in function " AVERAGE", standard deviations - using the built-in function " STANDARD DEVIATION" (cm. adj. 1).

Data for 2011 are provided for the territories of the Southern Federal District of the Russian Federation

Territories of the Federal District

Gross regional product, billion rubles, Y

Investments in fixed assets, billion rubles, X1

1. Rep. Adygea

2. Rep. Dagestan

3. Rep. Ingushetia

4. Kabardino-Balkarian Republic

5. Rep. Kalmykia

6. Karachay-Cherkess Republic

7. Rep. North Ossetia Alania

8. Krasnodar region)

9. Stavropol region

10. Astrakhan region.

11. Volgograd region.

12. Rostov region.

1. Calculate the matrix of pair correlation coefficients; evaluate the statistical significance of the correlation coefficients.

2. Construct a field of correlation between the effective characteristic and the factor most closely related to it.

3. Calculate the parameters of linear pair regression for each factor X..

4. Assess the quality of each model through the coefficient of determination, average error of approximation and Fisher's F test. Choose the best model.

will be 80% of its maximum value. Present graphically: actual and model values, forecast points.

6. Using step-by-step multiple regression (exclusion method or inclusion method), build a model of apartment price formation due to significant factors. Give an economic interpretation of the regression model coefficients.

7. Evaluate the quality of the constructed model. Has the quality of the model improved compared to the single-factor model? Assess the influence of significant factors on the result using the elasticity coefficients, in - and -? coefficients

When solving this problem, we will carry out calculations and construct graphs and diagrams using the Excel Data Analysis settings.

1. Calculate the matrix of pair correlation coefficients and evaluate the statistical significance of the correlation coefficients

In the Correlation dialog box, in the Input interval field, enter the range of cells containing the source data. Since we have also selected the column headings, we check the Labels checkbox in the first row.

We got the following results:

Table 1.1 Matrix of pair correlation coefficients

Analysis of the matrix of pairwise correlation coefficients shows that the dependent variable Y, i.e. gross regional product, has a closer relationship with X1 (investment in fixed capital). The correlation coefficient is 0.936. This means that 93.6% of the dependent variable Y (gross regional product) depends on the indicator X1 (investment in fixed capital).

We will determine the statistical significance of the correlation coefficients using Student's t-test. We compare the table value with the calculated values.

Let's calculate the table value using the STUDISCOVER function.

t table = 0.129 with a confidence level of 0.9 and degrees of freedom (n-2).

Factor X1 is statistically significant.

2. Let’s construct a field of correlation between the effective attribute (gross regional product) and the factor most closely related to it (investment in fixed capital)

To do this, we will use the Excel scatter plot tool.

As a result, we obtain a correlation field for the price of the gross regional product, billion rubles. and investments in fixed assets, billion rubles. (Figure 1.1.).

Figure 1.1

3. Calculate the parameters of linear pair regression for each factor X

To calculate the parameters of linear pairwise regression, we will use the Regression tool included in the Data Analysis setting.

In the Regression dialog box, in the Input interval Y field, enter the address of the range of cells that the dependent variable represents. In field

Input interval X we enter the address of the range that contains the values of the independent variables. Let us calculate the parameters of paired regression for factor X.

For X1 we received the following data presented in Table 1.2:

Table 1.2

The regression equation for the dependence of the price of gross regional product on investment in fixed capital has the form:

4. Let's evaluate the quality of each model through the coefficient of determination, average error of approximation and Fisher's F-test. Let's determine which model is the best.

We obtained the coefficient of determination, the average error of approximation, as a result of the calculations carried out in paragraph 3. The data obtained are presented in the following tables:

X1 data:

Table 1.3a

Table 1.4b

A) The coefficient of determination determines what proportion of the variation of trait Y is taken into account in the model and is due to the influence of factor X on it. The greater the value of the coefficient of determination, the closer the connection between the traits in the constructed mathematical model.

Excel refers to R-squared.

Based on this criterion, the most adequate model is the regression equation of the dependence of the price of the gross regional product on investment in fixed capital (X1).

B) We calculate the average approximation error using the formula:

where the numerator is the sum of the squares of the deviation of the calculated values from the actual ones. In tables it is located in the SS column, the Remaining line.

We calculate the average price of an apartment in Excel using the AVERAGE function. = 24.18182 billion rubles.

When carrying out economic calculations, a model is considered sufficiently accurate if the average error of approximation is less than 5%; the model is considered acceptable if the average error of approximation is less than 15%.

According to this criterion, the most adequate is the mathematical model for the regression equation of the dependence of the price of the gross regional product on investment in fixed capital (X1).

C) The F-test is used to test the significance of the regression model. To do this, a comparison is also made of the critical (tabular) values of the Fisher F-test.

The calculated values are given in tables 1.4b (indicated by the letter F).

We will calculate the tabular value of Fisher's F test in Excel using the FDIST function. Let's take the probability equal to 0.05. Received: = 4.75

The calculated values of Fisher's F test for each factor are comparable to the table value:

71.02 > = 4.75 the model is adequate according to this criterion.

Having analyzed the data according to all three criteria, we can conclude that the best mathematical model is built for the gross regional product factor, which is described by the linear equation

5. For the selected model of dependence of the price of gross regional product

We will predict the average value of the indicator at a significance level if the predicted value of the factor is 80% of its maximum value. Let's present it graphically: actual and model values, forecast points.

Let's calculate the predicted value of X; according to the condition, it will be 80% of the maximum value.

Let's calculate X max in Excel using the MAX function.

0,8 *52,8 = 42,24

To obtain predictive estimates of the dependent variable, we substitute the obtained value of the independent variable into the linear equation:

5.07+2.14*42.24 = 304.55 billion rubles.

Let us determine the confidence interval of the forecast, which will have the following boundaries:

To calculate the confidence interval for the predicted value, we calculate the deviation from the regression line.

For a paired regression model, the deviation value is calculated:

those. standard error value from Table 1.5a.

(Since the number of degrees of freedom is equal to one, the denominator will be equal to n-2). correlation pair regression forecast

To calculate the coefficient, we will use the Excel function STUDISCOVER, take the probability equal to 0.1, and the number of degrees of freedom 38.

We calculate the value using Excel and get 12294.

Let's determine the upper and lower boundaries of the interval.

304,55+27,472= 332,022

304,55-27,472= 277,078

Thus, the forecast value = 304.55 thousand dollars will be between the lower limit equal to 277.078 thousand dollars. and an upper limit equal to 332.022 billion. Rub.

Actual and model values, forecast points are presented graphically in Figure 1.2.

Figure 1.2

6. Using step-by-step multiple regression (elimination method), we will build a model for the formation of the price of the gross regional product due to significant factors

To build multiple regression, we will use the Regression function of Excel, including all factors. As a result, we obtain the result tables, from which we need the Student’s t-test.

Table 1.8a

Table 1.8b

Table 1.8c.

We get a model like:

Because the< (4,75 < 71,024), уравнение регрессии следует признать адекватным.

Let's choose the smallest absolute value of the Student's t-test, it is equal to 8.427, compare it with the table value, which we calculate in Excel, take the significance level equal to 0.10, the number of degrees of freedom n-m-1=12-4=8: =1.8595

Since 8.427>1.8595 the model should be considered adequate.

7. To assess the significant factor of the resulting mathematical model, we calculate the elasticity coefficients, and - coefficients

The elasticity coefficient shows by what percentage the effective attribute will change when the factor attribute changes by 1%:

E X4 = 2.137 * (10.69/24.182) = 0.94%

That is, with an increase in investment in fixed capital of 1%, the cost on average increases by 0.94%.

The coefficient shows by what part of the standard deviation the average value of the dependent variable changes with a change in the independent variable by one standard deviation.

2,137* (14.736/33,632) = 0,936.

Standard deviation data is taken from tables obtained using the Descriptive Statistics tool.

Table 1.11 Descriptive statistics (Y)

Table 1.12 Descriptive statistics (X4)

The coefficient determines the share of the factor’s influence in the total influence of all factors:

To calculate pair correlation coefficients, we calculate the matrix of pair correlation coefficients in Excel using the Correlation tool in the Data Analysis settings.

Table 1.14

(0,93633*0,93626) / 0,87 = 1,00.

Conclusion: From the calculations obtained, we can conclude that the effective attribute Y (gross regional product) has a large dependence on factor X1 (investment in fixed capital) (by 100%).

Bibliography

1. Magnus Y.R., Katyshev P.K., Peresetsky A.A. Econometrics. Beginner course. Tutorial. 2nd ed. - M.: Delo, 1998. - p. 69 - 74.

2. Workshop on econometrics: Textbook / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko et al. 2002. - p. 49 - 105.

3. Dougherty K. Introduction to econometrics: Transl. from English - M.: INFRA-M, 1999. - XIV, p. 262 - 285.

4. Ayvyzyan S.A., Mikhtiryan V.S. Applied mathematics and fundamentals of econometrics. -1998., pp. 115-147.

5. Kremer N.Sh., Putko B.A. Econometrics. -2007. from 175-251.

y x (1) x (2) x (3) x (4) x (5)
y 1.00 0.43 0.37 0.40 0.58 0.33
x (1) 0.43 1.00 0.85 0.98 0.11 0.34
x (2) 0.37 0.85 1.00 0.88 0.03 0.46
x (3) 0.40 0.98 0.88 1.00 0.03 0.28
x (4) 0.58 0.11 0.03 0.03 1.00 0.57
x (5) 0.33 0.34 0.46 0.28 0.57 1.00
Analysis of the matrix of paired correlation coefficients shows that the effective indicator is most closely related to the indicator x(4) - the amount of fertilizer consumed per 1 hectare ().

At the same time, the connection between the attributes-arguments is quite close. Thus, there is a practically functional relationship between the number of wheeled tractors ( x(1)) and the number of surface tillage tools .

The presence of multicollinearity is also indicated by the correlation coefficients and . Considering the close relationship between the indicators x (1) , x(2) and x(3), only one of them can be included in the yield regression model.

To demonstrate the negative impact of multicollinearity, consider a regression model of yield, including all input indicators:

F obs = 121.

The values of the corrected estimates of the standard deviations of the estimates of the coefficients of the equation are indicated in parentheses .

The following adequacy parameters are presented under the regression equation: multiple coefficient of determination; corrected estimate of the residual variance, average relative error of approximation and calculated value of the criterion F obs = 121.

The regression equation is significant because F obs = 121 > F kp = 2.85 found from the table F-distributions at a=0.05; n 1 =6 and n 2 =14.

It follows from this that Q¹0, i.e. and at least one of the coefficients of the equation q j (j= 0, 1, 2, ..., 5) is not zero.

To test the hypothesis about the significance of individual regression coefficients H0: q j =0, where j=1,2,3,4,5, compare the critical value t kp = 2.14, found from the table t-distributions at significance level a=2 Q=0.05 and the number of degrees of freedom n=14, with the calculated value . It follows from the equation that the regression coefficient is statistically significant only when x(4) since ½ t 4 ½=2.90 > t kp =2.14.

Negative signs of regression coefficients do not lend themselves to economic interpretation when x(1) and x(5) . From the negative values of the coefficients it follows that the increase in the saturation of agriculture with wheeled tractors ( x(1)) and plant health products ( x(5)) has a negative effect on yield. Therefore, the resulting regression equation is unacceptable.

To obtain a regression equation with significant coefficients, we use a step-by-step regression analysis algorithm. Initially, we use a step-by-step algorithm with the elimination of variables.

Let's exclude the variable from the model x(1) , which corresponds to the minimum absolute value of ½ t 1 ½=0.01. For the remaining variables, we again construct the regression equation:

The resulting equation is significant because F observed = 155 > F kp = 2.90, found at the significance level a = 0.05 and the numbers of degrees of freedom n 1 = 5 and n 2 = 15 according to the table F-distribution, i.e. vector q¹0. However, only the regression coefficient at x(4) . Estimated values ½ t j ½ for other coefficients is less t kr = 2.131, found from the table t-distributions at a=2 Q=0.05 and n=15.

By excluding the variable from the model x(3) , which corresponds to the minimum value t 3 =0.35 and we get the regression equation:

(2.9)

In the resulting equation, the coefficient at x(5) . By excluding x(5) we obtain the regression equation:

(2.10)

We obtained a significant regression equation with significant and interpretable coefficients.

However, the resulting equation is not the only “good” and not the “best” yield model in our example.

Let's show that in the multicollinearity condition, a stepwise algorithm with the inclusion of variables is more efficient. The first step in the yield model y variable included x(4) , which has the highest correlation coefficient with y, explained by the variable - r(y,x(4))=0.58. In the second step, including the equation along with x(4) variables x(1) or x(3), we will obtain models that, for economic reasons and statistical characteristics, exceed (2.10):

(2.11)

(2.12)

Including any of the three remaining variables in the equation worsens its properties. See, for example, equation (2.9).

Thus, we have three “good” yield models, from which we need to choose one for economic and statistical reasons.

According to statistical criteria, model (2.11) is most adequate. It corresponds to the minimum values of residual variance = 2.26 and the average relative error of approximation and the largest values and Fob = 273.

Model (2.12) has slightly worse adequacy indicators, followed by model (2.10).

We will now choose the best of models (2.11) and (2.12). These models differ from each other in terms of variables x(1) and x(3) . However, in yield models the variable x(1) (number of wheeled tractors per 100 ha) is more preferable than variable x(3) (number of surface tillage implements per 100 ha), which is to some extent secondary (or derived from x (1)).

In this regard, for economic reasons, preference should be given to model (2.12). Thus, after implementing the stepwise regression analysis algorithm with the inclusion of variables and taking into account the fact that only one of the three related variables should enter the equation ( x (1) , x(2) or x(3)) choose the final regression equation:

The equation is significant at a=0.05, because F obs = 266 > F kp = 3.20, found from the table F-distributions at a= Q=0.05; n 1 =3 and n 2 =17. All regression coefficients in equation ½ are also significant t j½> t kp(a=2 Q=0.05; n=17)=2.11. The regression coefficient q 1 should be considered significant (q 1 ¹0) for economic reasons, while t 1 =2.09 only slightly less t kp = 2.11.

From the regression equation it follows that an increase by one in the number of tractors per 100 hectares of arable land (at a fixed value x(4)) leads to an increase in grain yields by an average of 0.345 c/ha.

An approximate calculation of the elasticity coefficients e 1 »0.068 and e 2 »0.161 shows that with increasing indicators x(1) and x(4) by 1%, grain yield increases on average by 0.068% and 0.161%, respectively.

The multiple coefficient of determination indicates that only 46.9% of the yield variation is explained by the indicators included in the model ( x(1) and x(4)), that is, the saturation of crop production with tractors and fertilizers. The rest of the variation is due to the action of unaccounted factors ( x (2) , x (3) , x(5), weather conditions, etc.). The average relative error of approximation characterizes the adequacy of the model, as well as the value of the residual variance. When interpreting the regression equation, the values of the relative errors of approximation are of interest . Let us recall that - the model value of the effective indicator characterizes the average yield value for the totality of the regions under consideration, provided that the values of the explanatory variables x(1) and x(4) are fixed at the same level, namely x (1) = x i(1) and x (4) = xi(4) . Then, according to the values of d i You can compare regions by yield. Areas to which d values correspond i>0, have above average yield, and d i<0 - ниже среднего.

In our example, in terms of yield, crop production is most effective in the area corresponding to d 7 =28%, where the yield is 28% higher than the regional average, and the least effective is in the area with d 20 =-27,3%.

Tasks and exercises

2.1. From the general population ( y, x (1) , ..., x(p)), where y has a normal distribution law with conditional mathematical expectation and variance s 2, a random sample of n, let it go ( y i, x i (1) , ..., x i(p)) - result i th observation ( i=1, 2, ..., n). Determine: a) the mathematical expectation of the least squares estimate of the vector q; b) covariance matrix of the least squares estimate of the vector q; c) mathematical expectation of the assessment.

2.2. According to the conditions of problem 2.1, find the mathematical expectation of the sum of squared deviations due to regression, i.e. EQ R, Where

.

2.3. According to the conditions of problem 2.1, determine the mathematical expectation of the sum of squared deviations caused by the residual variation relative to the regression lines, i.e. EQ ost, where

2.4. Prove that when hypothesis H 0 is fulfilled: q=0 statistics

has an F-distribution with degrees of freedom n 1 =p+1 and n 2 =n-p-1.

2.5. Prove that when the hypothesis H 0: q j =0 is fulfilled, the statistics has a t-distribution with the number of degrees of freedom n=n-p-1.

2.6. Based on the data (Table 2.3) on the dependence of the shrinkage of fodder bread ( y) on storage duration ( x) find a point estimate of the conditional expectation under the assumption that the general regression equation is linear.

Table 2.3.

Required: a) find estimates of the residual variance s 2 under the assumption that the general regression equation has the form ; b) check at a=0.05 the significance of the regression equation, i.e. hypothesis H 0: q=0; c) with reliability g=0.9, determine interval estimates of parameters q 0, q 1; d) with reliability g=0.95, determine the interval estimate of the conditional mathematical expectation at X 0 =6; e) determine at g=0.95 the confidence interval of the prediction at the point X=12.

2.7. Based on data on the dynamics of the growth rate of stock prices for 5 months, given in table. 2.4.

Table 2.4.

months ( x)
y (%)
and the assumption that the general regression equation has the form , it is required: a) determine estimates of both the parameters of the regression equation and the residual variance s 2 ; b) check at a=0.01 the significance of the regression coefficient, i.e. hypotheses H 0: q 1 =0;

c) with reliability g=0.95, find interval estimates of the parameters q 0 and q 1; d) with reliability g=0.9, establish an interval estimate of the conditional mathematical expectation at x 0 =4; e) determine at g=0.9 the confidence interval of the prediction at the point x=5.

2.8. The results of the study of the dynamics of weight gain of young animals are given in Table 2.5.

Table 2.5.

Assuming that the general regression equation is linear, it is required: a) determine estimates of both the parameters of the regression equation and the residual variance s 2 ; b) check at a=0.05 the significance of the regression equation, i.e. hypotheses H 0: q=0;

c) with reliability g=0.8, find interval estimates of the parameters q 0 and q 1; d) with reliability g=0.98, determine and compare interval estimates of the conditional mathematical expectation at x 0 =3 and x 1 =6;

e) determine at g=0.98 the confidence interval of the prediction at the point x=8.

2.9. Cost ( y) one copy of the book depending on the circulation ( x) (thousand copies) is characterized by data collected by the publishing house (Table 2.6). Determine least squares estimates and parameters of a hyperbolic regression equation, with reliability g=0.9, construct confidence intervals for parameters q 0 and q 1, as well as the conditional expectation at x=10.

Table 2.6.

Determine the estimates and parameters of the regression equation of the form , test the hypothesis H 0 at a = 0.05: q 1 = 0 and construct confidence intervals with reliability g = 0.9 for the parameters q 0 and q 1 and the conditional mathematical expectation at x=20.

2.11. In table 2.8 presented data on growth rates (%) of the following macroeconomic indicators n=10 developed countries of the world for 1992: GNP - x(1) , industrial production - x(2) , price index - x (3) .

Table 2.8.

Countries x and parameters of the regression equation, estimation of residual variance; b) check at a=0.05 the significance of the regression coefficient, i.e. H 0: q 1 =0; c) with reliability g=0.9, find interval estimates q 0 and q 1; d) find at g=0.95 the confidence interval for at the point X 0 =x i, Where i=5; e) compare the statistical characteristics of regression equations: 1, 2 and 3.
2.12. Solve Problem 2.11 by taking ( at) index x(1) , and for the explanatory ( X) variable x (3) .

1. Ayvazyan S.A., Mkhitaryan V.S. Applied statistics and fundamentals of econometrics: Textbook. M., UNITY, 1998 (2nd edition 2001);

2. Ayvazyan S.A., Mkhitaryan V.S. Applied statistics in problems and exercises: Textbook. M. UNITY - DANA, 2001;

3. Ayvazyan S.A., Enyukov I.S., Meshalkin L.D. Applied statistics. Dependency research. M., Finance and Statistics, 1985, 487 pp.;

4. Ayvazyan S.A., Bukhstaber V.M., Enyukov I.S., Meshalkin L.D. Applied statistics. Classification and dimension reduction. M., Finance and Statistics, 1989, 607 pp.;

5. Johnston J. Econometric methods, M.: Statistics, 1980, 446 pp.;

6. Dubrov A.V., Mkhitaryan V.S., Troshin L.I. Multivariate statistical methods. M., Finance and Statistics, 2000;

7. Mkhitaryan V.S., Troshin L.I. Study of dependencies using correlation and regression methods. M., MESI, 1995, 120 pp.;

8. Mkhitaryan V.S., Dubrov A.M., Troshin L.I. Multivariate statistical methods in economics. M., MESI, 1995, 149 pp.;

9. Dubrov A.M., Mkhitaryan V.S., Troshin L.I. Mathematical statistics for businessmen and managers. M., MESI, 2000, 140 pp.;

10. Lukashin Yu.I. Regression and adaptive forecasting methods: Textbook, M., MESI, 1997.

11. Lukashin Yu.I. Adaptive methods of short-term forecasting. - M., Statistics, 1979.

APPLICATIONS

Annex 1. Options for tasks for independent computer research.