Build a matrix of paired correlation coefficients. Check for multicollinearity

Economic data are quantitative characteristics of any economic objects or processes. They are formed under the influence of many factors, not all of which are available to external control. Uncontrollable factors can take on random values ​​from a set of values ​​and thereby cause the randomness of the data they determine. One of the main tasks in economic research is analysis of dependencies between variables.

Considering the dependencies between features, it is necessary to distinguish first of all two types of relationships:

  • functional - are characterized by full correspondence between the change in the factor attribute and the change in the resulting value: each value of the attribute-factor corresponds to well-defined values ​​of the effective attribute. This type of relationship is expressed as a formulaic relationship. A functional dependency can link a resultant trait to one or more factor traits. Thus, the amount of wages for time wages depends on the number of hours worked;
  • correlation- there is no complete correspondence between the change of two signs, the influence of individual factors is manifested only on average, with the mass observation of actual data. Simultaneous impact on the studied trait of a large number of various factors leads to the fact that the same value of the attribute-factor corresponds to the whole distribution of values ​​of the resulting attribute, since in each specific case, other factor signs can change the strength and direction of their impact.

It should be borne in mind that if there is a functional relationship between the signs, it is possible, knowing the value of the factor sign, to accurately determine the value of the result. In the presence of a correlation dependence, only tendency of change of the effective attribute when changing the value of the factor sign.

Studying the relationship between the signs, they are classified according to the direction, form, number of factors:

  • towards connections are divided into straight And reverse. With a direct connection, the direction of change in the resultant attribute coincides with the direction of change in the sign-factor. With feedback, the direction of change in the effective feature is opposite to the direction of change in the feature-factor. For example, the higher the qualification of a worker, the higher the level of productivity of his labor (direct relationship). The higher the productivity of labor, the lower the unit cost of production (feedback);
  • in form(type of function) connections are divided into linear(rectilinear) and non-linear(curvilinear). A linear relationship is displayed as a straight line, a non-linear relationship - as a curve (parabola, hyperbola, etc.). With a linear relationship with an increase in the value of the factor attribute, there is a uniform increase (decrease) in the value of the resultant attribute;
  • by the number of factors acting on the effective sign, communications are divided into single-factor(paired) and multifactorial.

The study of the dependence of the variation of a sign on the surrounding conditions is the content of the theory of correlation.

When conducting a correlation analysis, the entire set of data is considered as a set of variables (factors), each of which contains P observations.

When studying the relationship between two factors, they are usually denoted X=(x p x 2,...,x p) And Y= (y ( , y 2 ,..., y and).

Covariance - it's statistical measure of interaction two variables. For example, a positive value for the covariance of returns on two securities indicates that the returns on those securities tend to move in the same direction.

Covariance between two variables X And Y calculated as follows:

where are the actual values ​​of the variables

X And G;

If random variables Hee Y are independent, the theoretical covariance is zero.

Covariance depends on the units in which the variables are measured Hee Y, it is an unnormalized quantity. Therefore, to measure communication forces between two variables another statistic is used, called the correlation coefficient.

For two variables X And Y pair correlation coefficient

is defined as follows:

Where SSy- variance estimates Hee Y. These estimates characterize dispersion degree values x ( , x 2 , ..., x n (y 1 , y 2 , y n) around your average x (y respectively), or variability(variability) of these variables over a set of observations.

Dispersion(variance estimate) is determined by the formula

In the general case, to obtain an unbiased estimate of the variance, the sum of squares should be divided by the number of degrees of freedom of the estimate (etc), Where P - sample size, R - the number of links imposed on the sample. Since the sample has already been used once to determine the mean x, then the number of superimposed bonds in this case is equal to one (p = 1), and the number of degrees of freedom of the estimate (i.e., the number of independent sample elements) is equal to (P - 1).

It is more natural to measure the degree of spread in the values ​​of variables in the same units in which the variable itself is measured. This problem is solved by an indicator called standard deviation (standard deviation) or standard error variable X(variable Y) and determined by the ratio

The terms in the numerator of formula (3.2.1) express the interaction of two variables and determine the sign of the correlation (positive or negative). If, for example, there is a strong positive relationship between the variables (increase in one variable when the other increases), each term will be a positive number. Similarly, if there is a strong negative relationship between variables, all terms in the numerator will be negative numbers, resulting in a negative correlation value.

The denominator of the expression for the pair correlation coefficient [see. formula (3.2.2)] simply normalizes the numerator in such a way that the correlation coefficient turns out to be an easily interpreted number that has no dimension and takes values ​​from -1 to +1.

The numerator of the expression for the correlation coefficient, which is difficult to interpret due to unusual units, is XY covariance. Despite the fact that it is sometimes used as an independent characteristic (for example, in the theory of finance to describe the joint change in stock prices on two exchanges), it is more convenient to use the correlation coefficient. Correlation and covariance represent essentially the same information, but correlation presents this information in a more convenient form.

For a qualitative assessment of the correlation coefficient, various scales are used, most often the Chaddock scale. Depending on the value of the correlation coefficient, the relationship can have one of the following estimates:

  • 0.1-0.3 - weak;
  • 0.3-0.5 - noticeable;
  • 0.5-0.7 - moderate;
  • 0.7-0.9 - high;
  • 0.9-1.0 - very high.

The assessment of the degree of closeness of the connection using the correlation coefficient is carried out, as a rule, on the basis of more or less limited information about the phenomenon under study. In this regard, there is a need to assess the significance of the linear correlation coefficient, which makes it possible to extend the conclusions based on the results of the sample to the general population.

The assessment of the significance of the correlation coefficient for small sample sizes is performed using Student's 7-test. In this case, the actual (observed) value of this criterion is determined by the formula

The value / obs calculated using this formula is compared with the critical value of the θ-criterion, which is taken from the table of values ​​of the Student's /-test (see Appendix 2), taking into account the given significance level σ and the number of degrees of freedom (P - 2).

If 7 obs > 7 tab, then the obtained value of the correlation coefficient is recognized as significant (i.e., the null hypothesis asserting that the correlation coefficient is equal to zero is rejected). And thus it is concluded that there is a close statistical relationship between the studied variables.

If the value y x close to zero, the relationship between the variables is weak. If the correlation between random variables:

  • is positive, then as one random variable increases, the other tends to increase on average;
  • is negative, then as one random variable increases, the other tends to decrease on average. A convenient graphical tool for analyzing paired data is scatterplot, which represents each observation in a space of two dimensions corresponding to two factors. A scatterplot, which depicts a set of values ​​​​of two features, is also called correlation field. Each point of this diagram has x coordinates (. and y g As the strength of the linear relationship increases, the points on the graph will lie closer to a straight line, and the magnitude G will be closer to unity.

Pair correlation coefficients are used to measure the strength of linear relationships between different pairs of features from their set. For a set of features, get matrix of pair correlation coefficients.

Let the entire dataset consist of a variable Y==(at r y 2 , ..., y p) And T variables (factors) x, each of which contains P observations. Variable values Y And x, contained in the observed population are recorded in the table (Table 3.2.1).

Table 3.2.1

Variable

Number

observations

X t3

Х tp

Based on the data contained in this table, calculate the matrix of pair correlation coefficients R, it is symmetrical about the main diagonal:


The analysis of the matrix of pair correlation coefficients is used in the construction of multiple regression models.

One correlation matrix cannot fully describe the dependences between the quantities. In this regard, two problems are considered in multivariate correlation analysis:

  • 1. Determining the tightness of the relationship of one random variable with the totality of other variables included in the analysis.
  • 2. Determining the tightness of the relationship between two quantities while fixing or excluding the influence of other quantities.

These problems are solved, respectively, with the help of multiple and partial correlation coefficients.

The solution of the first problem (determining the closeness of the connection of one random variable with the set of other variables included in the analysis) is carried out using sample multiple correlation coefficient according to the formula

Where R- R[cm. formula (3.2.6)]; Rjj- algebraic complement of an element of the same matrix R.

The square of the multiple correlation coefficient SCHj2 j _j J+l m called selective multiple coefficient of determination; it shows what proportion of the variation (random scatter) of the quantity under study Xj explains the variation of other random variables X ( , X 2 ,..., X t.

The coefficients of multiple correlation and determination are positive values, taking values ​​in the range from 0 to 1. When the coefficient approaches R 2 to unity, we can conclude that the relationship of random variables is close, but not about its direction. The multiple correlation coefficient can only increase if additional variables are included in the model, and will not increase if any of the available features are excluded.

Checking the significance of the coefficient of determination is carried out by comparing the calculated value of /'-Fisher's criterion

with tabular F rabl. The tabular value of the criterion (see Appendix 1) is determined by the given level of significance a and degrees of freedom v l \u003d mnv 2 \u003d n-m-l. Coefficient R2 is significantly different from zero if the inequality

If the considered random variables correlate with each other then the value of the pair correlation coefficient is partially affected by the influence of other quantities. In this regard, there is a need to study the partial correlation between variables while excluding the influence of other random variables (one or more).

Sample partial correlation coefficient is determined by the formula

Where RJk , Rjj, Rkk - algebraic additions to the corresponding matrix elements R[cm. formula (3.2.6)].

The partial correlation coefficient, as well as the pairwise correlation coefficient, varies from -1 to +1.

Expression (3.2.9) under the condition t = 3 will look like

The coefficient r 12(3) is called correlation coefficient between x ( And x 2 for fixed x y It is symmetrical with respect to primary indices 1, 2. Its secondary index 3 refers to a fixed variable.

Example 3.2.1. Calculation of pair coefficients,

multiple and partial correlation.

In table. 3.2.2 provides information on sales volumes and advertising costs of one company, as well as the consumer spending index for a number of current years.

  • 1. Construct a scatterplot (correlation field) for the variables "sales volume" and "consumer spending index".
  • 2. Determine the degree of influence of the consumer spending index on sales volume (calculate the pair correlation coefficient).
  • 3. Assess the significance of the calculated pair correlation coefficient.
  • 4. Build a matrix of pair correlation coefficients for three variables.
  • 5. Find an estimate of the multiple correlation coefficient.
  • 6. Find estimates of the partial correlation coefficients.

1. In our example, the scatter diagram has the form shown in fig. 3.2.1. The elongation of the point cloud in the scatterplot along an inclined straight line allows us to make an assumption that there is some objective tendency for a direct linear relationship between the values ​​of the variables X 2 Y(volume of sales).

Rice. 3.2.1.

2. Intermediate calculations when calculating the correlation coefficient between variables X 2(consumer spending index) and Y(sales volume) are given in table. 3.2.3.

Averages random variables X 2 And Y, which are the simplest indicators characterizing the sequences jCj, x 2,..., x 16 and y v y 2 ,..., y 16 , we calculate by the following formulas:


Sales volume Y, thousand rubles

Index

consume

Telsky

expenses

Sales volume Y, thousand rubles

Index

consume

Telsky

expenses

Table 3.2.3

l:, - X

(AND - Y)(x, - x)

(x, - x) 2

(y, - - y) 2

Dispersion characterizes the degree of spread of values x v x 2 , x :

Consider now the solution to Example 3.2.1 in Excel.

To calculate the correlation using Excel, you can use the function =correl(), specifying the addresses of two columns of numbers, as shown in fig. 3.2.2. The answer is placed in D8 and equals 0.816.

Rice. 3.2.2.

(Note: Function arguments correls must be numbers or names, arrays or references containing numbers. If the argument, which is an array or a link, contains text, booleans, or empty cells, then those values ​​are ignored; however, cells that contain null values ​​are counted.

If an array! and array2 have a different number of data points, then the function correl returns the #n/a error value.

If array1 or array2 is empty or if o (standard deviation) of their values ​​is zero, then the function correl returns the error value #div/0 !.)

The critical value of the /-Student statistic can also be obtained using the function steudrasprobr 1 Excel package. As arguments of the function, you must specify the number of degrees of freedom, equal to P- 2 (in our example 16 - 2= 14) and significance level a (in our example a = 0.1) (Fig. 3.2.3). If actual value/-statistics, taken modulo, more critical, then with probability (1 - a) the correlation coefficient is significantly different from zero.


Rice. 3.2.3. The critical value of the /-statistic is 1.7613

Excel includes a set of data analysis tools (the so-called analysis package) designed to solve various statistical problems. To calculate the matrix of pair correlation coefficients R use the Correlation tool (Fig. 3.2.4) and set the analysis parameters in the corresponding dialog box. The answer will be placed on a new worksheet (Fig. 3.2.5).

1 In Excel 2010, the function name steudrasprobr changed to steu-

DENT.ORD.2X.

Rice. 3.2.4.


Rice. 3.2.5.

  • The English statisticians F. Galton (1822-1911) and K. Pearson (1857-1936) are considered to be the founders of the theory of correlation. The term "correlation" was borrowed from natural science and means "correlation, correspondence". The concept of correlation as an interdependence between random variables underlies the mathematical-statistical theory of correlation.

Task 2

1. Build a matrix of paired correlation coefficients. Check for multicollinearity. Justify the selection of factors in the model.

2. Build a multiple regression equation in linear form with the selected factors.

3. Evaluate the statistical significance of the regression equation and its parameters using the Fisher and Student's tests.

4. Build a regression equation with statistically significant factors. Assess the quality of the regression equation using the coefficient of determination R 2 . Assess the accuracy of the constructed model.

5. Estimate the forecast for the volume of output, if the forecast values ​​of the factors are 75% of their maximum values.

Task conditions (Option 21)

According to the data presented in Table 1 (n = 17), we study the dependence of the volume of output Y (million rubles) on the following factors (variables):

X 1 - the number of industrial and production personnel, people.

X 2 - the average annual cost of fixed assets, million rubles.

X 3 - depreciation of fixed assets,%

X 4 - electric power, kWh.

X 5 - technical equipment of one worker, million rubles.

X 6 - production of marketable products per worker, rub.

Table 1. Production data

Y x1 x2 x3 x4 x5 x6
39,5 4,9 3,2
46,4 60,5 20,4
43,7 24,9 9,5
35,7 50,4 34,7
41,8 5,1 17,9
49,8 35,9 12,1
44,1 48,1 18,9
48,1 69,5 12,2
47,6 31,9 8,1
58,6 139,4 29,7
70,4 16,9 5,3
37,5 17,8 5,6
62,0 27,6 12,3
34,4 13,9 3,2
35,4 37,3 19,0
40,8 55,3 19,3
48,1 35,1 12,4


Build a matrix of paired correlation coefficients. Check for multicollinearity. Justify the selection of factors in the model

Table 2 presents pair correlation coefficient matrix for all variables involved in the consideration. Matrix obtained using the tool Correlation from the package Data analysis V Excel.

Table 2. Matrix of pair correlation coefficients

Y X1 X2 X3 X4 X5 X6
Y
X1 0,995634
X2 0,996949 0,994947
X3 -0,25446 -0,27074 -0,26264
X4 0,12291 0,07251 0,107572 0,248622
X5 0,222946 0,166919 0,219914 -0,07573 0,671386
X6 0,067685 -0,00273 0,041955 -0,28755 0,366382 0,600899

Visual analysis of the matrix allows you to establish:

1) At has rather high pairwise correlations with variables X1, X2 (>0,5) and low with variables X3, X4, X5, X6 (<0,5);

2) The analysis variables X1, X2 demonstrate rather high pairwise correlations, which necessitates checking the factors for the presence of multicollinearity between them. Moreover, one of the conditions of the classical regression model is the assumption of independence of the explanatory variables.

To identify the multicollinearity of factors, we perform Farrar-Glouber test by factors X1, X2, X3,X4,X5,X6.

Verification of the Farrar-Glouber test for multicollinearity of factors includes several stages.

1) Checking for multicollinearity of the entire array of variables .

One of the conditions of the classical regression model is the assumption that the explanatory variables are independent. To identify multicollinearity between factors, the matrix of interfactorial correlations R is calculated using the Data Analysis Package (Table 3).

Table 3. Interfactor correlation matrix R

X1 X2 X3 X4 X5 X6
X1 0,994947 -0,27074 0,07251 0,166919 -0,00273
X2 0,994947 -0,26264 0,107572 0,219914 0,041955
X3 -0,27074 -0,26264 0,248622 -0,07573 -0,28755
X4 0,07251 0,107572 0,248622 0,671386 0,366382
X5 0,166919 0,219914 -0,07573 0,671386 0,600899
X6 -0,00273 0,041955 -0,28755 0,366382 0,600899

There is a strong relationship between factors X1 and X2, X5 and X4, X6 and X5 (>0.5).

The determinant det (R) = 0.001488 is calculated using the MOPRED function. The determinant of the matrix R tends to zero, which allows us to make an assumption about the general multicollinearity of the factors.

2) Checking for multicollinearity of each variable with other variables:

Calculate the inverse matrix R -1 using the Excel MINF function (Table 4):

Table 4. Inverse matrix R -1

X1 X2 X3 X4 X5 X6
X1 150,1209 -149,95 3,415228 -1,70527 6,775768 4,236465
X2 -149,95 150,9583 -3,00988 1,591549 -7,10952 -3,91954
X3 3,415228 -3,00988 1,541199 -0,76909 0,325241 0,665121
X4 -1,70527 1,591549 -0,76909 2,218969 -1,4854 -0,213
X5 6,775768 -7,10952 0,325241 -1,4854 2,943718 -0,81434
X6 4,236465 -3,91954 0,665121 -0,213 -0,81434 1,934647

· Calculation of F-criteria , where are the diagonal elements of the matrix , n=17, k = 6 (Table 5).

Table 5. F-criteria values

F1 (Х1) F2 (Х2) F3 (X3) F4 (X4) F5 (X5) F6 (X6)
89,29396 89,79536 0,324071 0,729921 1,163903 0,559669

The actual values ​​of the F-criteria are compared with the table value F table = 3.21(FDISP(0.05;6;10)) with n1= 6 and n2 = n - k – 1=17-6-1=10 degrees of freedom and significance level α=0.05, where k is the number of factors.

· The values ​​of F-criteria for factors X1 and X2 are larger than the table value, which indicates the presence of multicollinearity between these factors. Factor X3 has the least effect on the overall multicollinearity of factors.

3) Checking for Multicollinearity for Each Pair of Variables

Calculate the partial correlation coefficients using the formula , where are the elements of the matrix (Table 6)

Table 6. Matrix of coefficients of partial correlations

X1 X2 X3 X4 X5 X6
X1
X2 0,996086
X3 -0,22453 0,197329
X4 0,093432 -0,08696 0,415882
X5 -0,32232 0,337259 -0,1527 0,581191
X6 -0,24859 0,229354 -0,38519 0,102801 0,341239

· Calculation t- criteria according to the formula (table 7)

n - number of data = 17

K - number of factors = 6

Table 7.t-tests for partial correlation coefficients

X1 X2 X3 X4 X5 X6
X1
X2 35,6355
X3 -0,72862 0,636526
X4 0,296756 -0,27604 1,446126
X5 -1,07674 1,13288 -0,4886 2,258495
X6 -0,81158 0,745143 -1,31991 0,326817 1,147999

t table \u003d STUDRASP (0.05; 10) \u003d 2.23

The actual values ​​of t-criteria are compared with the table value at degrees of freedom n-k-1 = 17-6-1=10 and significance level α=0.05;

t21 > ttable

t54 > ttable

Tables 6 and 7 show that two pairs of factors X1 and X2, X4 and X5 have a high statistically significant partial correlation, that is, they are multicollinear. In order to get rid of multicollinearity, one of the variables of the collinear pair can be eliminated. In a pair of X1 and X2 we leave X2, in a pair of X4 and X5 we leave X5.

Thus, as a result of checking the Farrar-Glouber test, the following factors remain: X2, X3, X5, X6.

Completing the procedures of correlation analysis, it is advisable to look at the partial correlations of the selected factors with the result Y.

Let's build a matrix of paired correlation coefficients based on the data in Table 8.

Table 8. Output data with selected factors X2, X3, X5, X6.

No. of observation Y x2 x3 x5 x6
39,5 3,2
46,4 20,4
43,7 9,5
35,7 34,7
41,8 17,9
49,8 12,1
44,1 18,9
48,1 12,2
47,6 8,1
58,6 29,7
70,4 5,3
37,5 5,6
12,3
34,4 3,2
35,4
40,8 19,3
48,1 12,4

The last column of Table 9 shows the t-test values ​​for column Y.

Table 9. Matrix of coefficients of partial correlation with the result Y

Y X2 X3 X5 X6 t criterion (t tab (0.05; 11) = 2.200985
Y 0,996949 -0,25446 0,222946 0,067685
X2 0,996949 -0,26264 0,219914 0,041955 44,31676
X3 -0,25446 -0,26264 -0,07573 -0,28755 0,916144
X5 0,222946 0,219914 -0,07573 0,600899 -0,88721
X6 0,067685 0,041955 -0,28755 0,600899 1,645749

Table 9 shows that the variable Y has a high and at the same time statistically significant partial correlation with X2 factor.


Y X 1 X 2 X 3 X 4 X 5 X 6
Y
X 1 0,519
X 2 -0,273 0,030
X 3 0,610 0,813 -0,116
X 4 -0,572 -0,013 -0,022 -0,091
X 5 0,297 0,043 -0,461 0,120 -0,359
X 6 0,118 -0,366 -0,061 -0,329 -0,100 -0,290

Analysis interfactorial(between the “xes”!) of the correlation coefficients shows that the value of 0.8 exceeds in absolute value only the correlation coefficient between a pair of factors X 1 –X 3 (highlighted in bold). Factors X 1 –X 3 are thus recognized as collinear.

2. As shown in paragraph 1, the factors X 1 –X 3 are collinear, meaning that they are effectively duplicates of each other, and including them in the model at the same time will misinterpret the respective regression coefficients. It can be seen that the factor X 3 has a larger modulo correlation coefficient with the result Y than the factor X 1: r y , x 1 =0,519; r y , x 3=0.610; (cm. tab. 1). This indicates a stronger influence of the factor X 3 to change Y. Factor X 1 is thus excluded from consideration.

To construct the regression equation, the values ​​of the variables used ( Y,X 2 , X 3 , X 4 , X 5 , X 6) copy to a blank worksheet ( adj. 3). We build the regression equation using the add-on " Data Analysis… Regression» (menu « Service"® « Data analysis…» ® « Regression"). The regression analysis panel with filled fields is shown in rice. 2.

The results of the regression analysis are given in adj. 4 and transferred to tab. 2. The regression equation has the form (see " Odds» V tab. 2):

The regression equation is recognized as statistically significant, since the probability of its random formation in the form in which it was obtained is 8.80 × 10 -6 (see Fig. "F Significance" V tab. 2), which is significantly lower than the accepted significance level a=0.05.

X 3 , X 4 , X 6 below the accepted significance level a=0.05 (see “ P-Value" V tab. 2), which indicates the statistical significance of the coefficients and the significant impact of these factors on the change in annual profit Y.

Probability of random formation of coefficients at factors X 2 and X 5 exceeds the accepted significance level a=0.05 (see “ P-Value" V tab. 2), and these coefficients are not considered statistically significant.

rice. 2. Model regression analysis panel Y(X 2 , X 3 , X 4 , X 5 , X 6)

table 2

Y(X 2 , X 3 , X 4 , X 5 , X 6)

Regression statistics
Multiple R 0,868
R-square 0,753
Normalized R-square 0,694
standard error 242,3
Observations
Analysis of variance
df SS MS F Significance F
Regression 3749838,2 749967,6 12,78 8.80E-06
Remainder 1232466,8 58688,9
Total 4982305,0
Regression Equation
Odds standard error t-statistic P-Value
Y-intersection 487,5 641,4 0,760 0,456
X2 -0,0456 0,0373 -1,224 0,235
X3 0,1043 0,0194 5,375 0,00002
X4 -0,0965 0,0263 -3,674 0,001
X5 2,528 6,323 0,400 0,693
X6 248,2 113,0 2,197 0,039

3. Based on the results of checking the statistical significance of the coefficients of the regression equation, carried out in the previous paragraph, we build a new regression model containing only informative factors, which include:

factors, the coefficients for which are statistically significant;

factors whose coefficients t‑statistics modulo exceeds one (in other words, the absolute value of the coefficient is greater than its standard error).

The first group includes factors X 3 , X 4 , X 6 , to the second - factor X 2. Factor X 5 is excluded from consideration as uninformative, and the final regression model will contain the factors X 2 , X 3 , X 4 , X 6 .

To build a regression equation, copy the values ​​of the variables used to a blank worksheet ( adj. 5) and perform a regression analysis ( rice. 3). Its results are given in adj. 6 and transferred to tab. 3. The regression equation looks like:

(cm. " Odds» V tab. 3).

rice. 3. Model regression analysis panel Y(X 2 , X 3 , X 4 , X 6)

Table 3

Model Regression Analysis Results Y(X 2 , X 3 , X 4 , X 6)

Regression statistics
Multiple R 0,866
R-square 0,751
Normalized R-square 0,705
standard error 237,6
Observations
Analysis of variance
df SS MS F Significance F
Regression 3740456,2 935114,1 16,57 2.14E-06
Remainder 1241848,7 56447,7
Total 4982305,0
Regression Equation
Odds standard error t-statistic P-Value
Y-intersection 712,2 303,0 2,351 0,028
X2 -0,0541 0,0300 -1,806 0,085
X3 0,1032 0,0188 5,476 0,00002
X4 -0,1017 0,0223 -4,560 0,00015
X6 227,5 98,5 2,310 0,031

The regression equation is statistically significant: the probability of its random formation is below the acceptable significance level a=0.05 (see " Significance F" V tab. 3).

Statistically significant are also coefficients for factors X 3 , X 4 , X 6: the probability of their random formation is below the acceptable significance level a=0.05 (see " P-Value" V tab. 3). This indicates a significant impact of the annual size of insurance fees. X 3 , annual insurance payments X 4 and forms of ownership X 6 per change in annual profit Y.

Coefficient at the factor X 2 (annual amount of insurance reserves) is not statistically significant. However, this factor can still be considered informative, since t-statistics of its ratio exceeds modulo unit, although further conclusions regarding the factor X 2 should be treated with some caution.

4. We evaluate the quality and accuracy of the last regression equation using some statistical characteristics obtained during the regression analysis (see . « regression statistics" V tab. 3):

multiple coefficient of determination

shows that the regression model explains 75.1% of the annual profit variation Y, and this variation is due to a change in the factors included in the regression model X 2 , X 3 , X 4 and X 6 ;

regression standard error

thousand roubles.

shows that the values ​​of annual profit predicted by the regression equation Y differ from the actual values ​​by an average of 237.6 thousand rubles.

The average relative approximation error is determined by the approximate formula:

Where thousand roubles. - the average value of the annual profit (determined using the built-in function " AVERAGE»; adj. 1).

E rel shows that the values ​​of annual profit predicted by the regression equation Y differ from the actual values ​​by an average of 26.7%. The model has unsatisfactory accuracy (at - the accuracy of the model is high, at - good, with - satisfactory, at - unsatisfactory).

5. For the economic interpretation of the coefficients of the regression equation, we tabulate the average values ​​and standard deviations of the variables in the initial data ( tab. 4) . The mean values ​​were determined using the built-in function " AVERAGE”, standard deviations - using the built-in function “ STDEV" (cm. adj. 1).

For the territories of the Southern Federal District of the Russian Federation, data are given for 2011

Territories of the federal district

Gross regional product, billion rubles, Y

Investments in fixed capital, billion rubles, X1

1. Rep. Adygea

2. Rep. Dagestan

3. Rep. Ingushetia

4. Kabardino-Balkarian Republic

5. Rep. Kalmykia

6. Karachay-Cherkess Republic

7. Rep. North Ossetia Alania

8. Krasnodar region)

9. Stavropol Territory

10. Astrakhan region

11. Volgograd region

12. Rostov region

  • 1. Calculate the matrix of paired correlation coefficients; evaluate the statistical significance of the correlation coefficients.
  • 2. Build the correlation field of the resulting feature and the most closely related factor.
  • 3. Calculate the parameters of the linear pair regression for each factor X..
  • 4. Evaluate the quality of each model through the coefficient of determination, the average approximation error and Fisher's F-test. Choose the best model.

will be 80% of its maximum value. Present graphically: actual and model values, forecast points.

  • 6. Using stepwise multiple regression (exclusion method or inclusion method), build a model of apartment price formation due to significant factors. Give an economic interpretation of the coefficients of the regression model.
  • 7. Evaluate the quality of the built model. Has the quality of the model improved compared to the one-factor model? Give an assessment of the influence of significant factors on the result using elasticity coefficients, in - and -? coefficients.

When solving this problem, we will carry out calculations and plotting graphs and charts using the Excel settings Data analysis.

1. Calculate the matrix of paired correlation coefficients and evaluate the statistical significance of the correlation coefficients

In the Correlation dialog box, in the Input interval field, enter the range of cells containing the source data. Since we also selected the column headings, we check the Labels in the first row checkbox.

We got the following results:

Table 1.1 Matrix of pairwise correlation coefficients

An analysis of the matrix of pair correlation coefficients shows that the dependent variable Y, i.e. the gross regional product, has a closer relationship with X1 (investment in fixed assets). The correlation coefficient is 0.936. This means that the dependent variable Y (gross regional product) is 93.6% dependent on X1 (investment in fixed assets).

The statistical significance of the correlation coefficients will be determined using Student's t-test. The table value is compared with the calculated values.

Let's calculate the table value using the STUDRIST function.

t table = 0.129 with a confidence level equal to 0.9 and a degree of freedom (n-2).

The X1 factor is statistically significant.

2. Let's construct the field of correlation of the effective feature (gross regional product) and the most closely related factor (investment in fixed capital)

To do this, we will use the tool for constructing a scatter plot in Excel.

As a result, we obtain the field of correlation of the price of the gross regional product, billion rubles. and investments in fixed capital, billion rubles. (Figure 1.1.).

Figure 1.1

3. Calculate the parameters of linear pair regression for each factor X

To calculate the parameters of a linear pairwise regression, we will use the Regression tool included in the Data Analysis setting.

In the Regression dialog box, in the Input interval Y field, enter the address of the range of cells that represents the dependent variable. In field

Input interval X we enter the address of the range that contains the values ​​of the independent variables. Let's calculate the pairwise regression parameters for factor X.

For X1, the following data were obtained, presented in Table 1.2:

Table 1.2

The regression equation for the dependence of the price of the gross regional product on investment in fixed capital has the form:

4. Let's evaluate the quality of each model through the coefficient of determination, the average approximation error and Fisher's F-criterion. Let's find out which model is the best.

The coefficient of determination, the average approximation error, we obtained as a result of the calculations carried out in paragraph 3. The data obtained are presented in the following tables:

Data for X1:

Table 1.3a

Table 1.4b

A) The coefficient of determination determines what proportion of the variation of the attribute Y is taken into account in the model and is due to the influence of the factor X on it. The greater the value of the coefficient of determination, the closer the relationship between the attributes in the constructed mathematical model.

In Excel, R-square is denoted.

Based on this criterion, the model of the regression equation for the dependence of the price of the gross regional product on investment in fixed assets (X1) is the most adequate.

B) Calculate the average approximation error using the formula:

where the numerator is the sum of the squared deviations of the calculated values ​​from the actual ones. In the tables, it is in the SS column, Residuals row.

We calculate the average value of the price of an apartment in Excel using the AVERAGE function. = 24.18182 billion rubles

When conducting economic calculations, the model is considered sufficiently accurate if the average approximation error is less than 5%, the model is considered acceptable if the average approximation error is less than 15%.

According to this criterion, the most adequate is the mathematical model for the regression equation of the dependence of the price of the gross regional product on investment in fixed assets (X1).

C) An F-test is used to test the significance of the regression model. For this, a comparison is also made of the critical (tabular) values ​​of Fisher's F-test.

The calculated values ​​are given in tables 1.4b (indicated by the letter F).

The table value of Fisher's F-test is calculated in Excel using the FDISP function. We take the probability equal to 0.05. Received: = 4.75

The calculated values ​​of Fisher's F-test for each factor are comparable with the table value:

71.02 > = 4.75 the model is adequate according to this criterion.

After analyzing the data for all three criteria, we can conclude that the best is the mathematical model built for the gross regional product factor, which is described by the linear equation

5. For the chosen model of the dependence of the price of the gross regional product

we will predict the average value of the indicator at the level of significance if the predicted value of the factor is 80% of its maximum value. Let's represent graphically: actual and model values, forecast points.

Calculate the predicted value of X, according to the condition, it will be 80% of the maximum value.

Calculate X max in Excel using the MAX function.

0,8 *52,8 = 42,24

To obtain predictive estimates of the dependent variable, we substitute the obtained value of the independent variable into the linear equation:

5.07 + 2.14 * 42.24 \u003d 304.55 billion rubles.

Let us determine the confidence interval of the forecast, which will have the following boundaries:

To calculate the confidence interval for the predicted value, we calculate the deviation from the regression line.

For a paired regression model, the deviation value is calculated:

those. the standard error value from Table 1.5a.

(Since the number of degrees of freedom is one, the denominator will be equal to n-2). correlation pairwise regression prediction

To calculate the coefficient, we will use the Excel function STUDRASP, the probability will be taken equal to 0.1, the number of degrees of freedom is 38.

We calculate the value using Excel, we get 12294.


Let's define the upper and lower bounds of the interval.

  • 304,55+27,472= 332,022
  • 304,55-27,472= 277,078

Thus, the forecast value = 304.55 thousand dollars will be between the lower limit equal to 277.078 thousand dollars. and an upper limit equal to 332.022 billion rubles. Rub.

Actual and model values, forecast points are presented graphically in Figure 1.2.


Figure 1.2

6. Using stepwise multiple regression (exclusion method), we will build a model for the formation of the price of the gross regional product due to significant factors

To build a multiple regression, we will use the Excel Regression function, including all factors in it. As a result, we obtain result tables, from which we need Student's t-test.

Table 1.8a

Table 1.8b

Table 1.8c.

We get the view model:

Because the< (4,75 < 71,024), уравнение регрессии следует признать адекватным.

Let's choose the smallest modulo value of Student's t-test, it is equal to 8.427, compare it with the tabular value that we calculate in Excel, take the significance level equal to 0.10, the number of degrees of freedom n-m-1=12-4=8: =1.8595

Since 8.427>1.8595 the model should be recognized as adequate.

7. To evaluate the significant factor of the obtained mathematical model, we calculate the coefficients of elasticity, and - the coefficients

The elasticity coefficient shows how many percent the resultant sign will change when the factor sign changes by 1%:

E X4 \u003d 2.137 * (10.69 / 24.182) \u003d 0.94%

That is, with an increase in investment in fixed capital by 1%, the cost increases by an average of 0.94%.

The coefficient shows by what part of the value of the standard deviation the average value of the dependent variable changes with a change in the independent variable by one standard deviation.

2,137* (14.736/33,632) = 0,936.

Standard deviation data is taken from tables obtained using the Descriptive Statistics tool.

Table 1.11 Descriptive statistics (Y)

Table 1.12 Descriptive statistics (X4)

The coefficient determines the share of the influence of the factor in the total influence of all factors:

To calculate the pair correlation coefficients, we calculate the matrix of pair correlation coefficients in Excel using the Correlation tool of the Data Analysis settings.

Table 1.14

(0,93633*0,93626) / 0,87 = 1,00.

Conclusion: Based on the obtained calculations, we can conclude that the effective attribute Y (gross regional product) is highly dependent on factor X1 (investment in fixed capital) (by 100%).

Bibliography

  • 1. Magnus Ya.R., Katyshev P.K., Peresetsky A.A. Econometrics. Initial course. Tutorial. 2nd ed. - M.: Delo, 1998. - p. 69 - 74.
  • 2. Workshop on econometrics: Textbook / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko and others 2002. - p. 49 - 105.
  • 3. Dougerty K. Introduction to econometrics: Per. from English. - M.: INFRA-M, 1999. - XIV, p. 262 - 285.
  • 4. Aivyzyan S.A., Mikhtiryan V.S. Applied mathematics and foundations of econometrics. -1998., pp. 115-147.
  • 5. Kremer N.Sh., Putko B.A. Econometrics. -2007. from 175-251.
y x (1) x (2) x (3) x (4) x (5)
y 1.00 0.43 0.37 0.40 0.58 0.33
x (1) 0.43 1.00 0.85 0.98 0.11 0.34
x (2) 0.37 0.85 1.00 0.88 0.03 0.46
x (3) 0.40 0.98 0.88 1.00 0.03 0.28
x (4) 0.58 0.11 0.03 0.03 1.00 0.57
x (5) 0.33 0.34 0.46 0.28 0.57 1.00

An analysis of the matrix of paired correlation coefficients shows that the performance indicator is most closely related to the indicator x(4) - the amount of fertilizers used per 1 ha ().

At the same time, the relationship between the features-arguments is quite close. So, there is practically a functional relationship between the number of wheeled tractors ( x(1)) and the number of surface tillage tools .

The presence of multicollinearity is also evidenced by the correlation coefficients and . Given the close relationship of indicators x (1) , x(2) and x(3) , only one of them can enter the yield regression model.

To demonstrate the negative impact of multicollinearity, consider a yield regression model including all inputs:

Fobs = 121.

In parentheses are the values ​​of the corrected estimates of the standard deviations of the estimates of the coefficients of the equation .

Under the regression equation, the following adequacy parameters are presented: multiple coefficient of determination ; corrected estimate of the residual variance , average relative approximation error and calculated value of the -criterion Fobs = 121.

The regression equation is significant because F obl = 121 > F kp = 2.85 found from the table F- distributions at a=0.05; n 1 =6 and n 2 =14.

It follows from this that Q¹0, i.e., and at least one of the coefficients of the equation q j (j= 0, 1, 2, ..., 5) is not equal to zero.

To test the hypothesis about the significance of individual regression coefficients H0: q j =0, where j=1,2,3,4,5, compare critical value t kp = 2.14, found from the table t-distributions at significance level a=2 Q=0.05 and the number of degrees of freedom n=14, with the calculated value . It follows from the equation that the regression coefficient is statistically significant only when x(4) since ½ t 4½=2.90 > t kp=2.14.



The negative signs of the regression coefficients at x(1) and x(5) . From the negative values ​​of the coefficients, it follows that an increase in the saturation of agriculture with wheeled tractors ( x(1)) and plant health products ( x(5)) negatively affects the yield. Thus, the resulting regression equation is unacceptable.

To obtain a regression equation with significant coefficients, we use a step-by-step regression analysis algorithm. Initially, we use a step-by-step algorithm with the elimination of variables.

Exclude a variable from the model x(1) , which corresponds to the minimum absolute value of ½ t 1½=0.01. For the remaining variables, we will again construct the regression equation:

The resulting equation is significant, because F obs = 155 > F kp = 2.90, found at a significance level a=0.05 and numbers of degrees of freedom n 1 =5 and n 2 =15 according to the table F-distributions, i.e. vector q¹0. However, only the regression coefficient is significant in the equation at x(4) . Calculated values ​​½ t j ½ for other coefficients less than t kr = 2.131 found in the table t-distributions for a=2 Q=0.05 and n=15.

Excluding a variable from the model x(3) , which corresponds to the minimum value t 3 =0.35 and get the regression equation:

(2.9)

In the resulting equation, it is not statistically significant and we cannot economically interpret the coefficient at x(5) . Excluding x(5) we get the regression equation:

(2.10)

We have obtained a meaningful regression equation with meaningful and interpretable coefficients.

However, the resulting equation is not the only “good” or “best” yield model in our example.

Let us show that in the condition of multicollinearity, the step-by-step algorithm with the inclusion of variables is more efficient. The first step in the yield model y includes a variable x(4) , which has the highest correlation coefficient with y, explained by the variable - r(y,x(4))=0.58. In the second step, including the equation along with x(4) variables x(1) or x(3) , we will get models that are superior to (2.10) for economic reasons and statistical characteristics:

(2.11)

(2.12)

The inclusion of any of the three remaining variables in the equation worsens its properties. See, for example, equation (2.9).

Thus, we have three “good” yield models, from which one must be chosen for economic and statistical reasons.

According to statistical criteria, model (2.11) is the most adequate. It corresponds to the minimum values ​​of the residual variance = 2.26 and the average relative approximation error and the largest values ​​and Fobs = 273.

Model (2.12) has somewhat worse indicators of adequacy, and then model (2.10).

We will now choose the best of models (2.11) and (2.12). These models differ from each other in variables x(1) and x(3) . However, in yield models, the variable x(1) (number of wheeled tractors per 100 ha) is preferable to variable x(3) (number of surface tillage implements per 100 ha), which is somewhat secondary (or derived from x (1)).

In this connection, for economic reasons, preference should be given to model (2.12). Thus, after implementing the stepwise regression analysis algorithm with the inclusion of variables and taking into account the fact that only one of the three related variables should enter the equation ( x (1) , x(2) or x(3)) choose the final regression equation:

The equation is significant at a=0.05, because F obl = 266 > F kp = 3.20 found from the table F-distributions for a= Q=0.05; n 1 =3 and n 2 =17. All regression coefficients are also significant in equation ½ t j½> t kp (a=2 Q=0.05; n=17)=2.11. The regression coefficient q 1 should be recognized as significant (q 1 ¹0) for economic reasons, while t 1 =2.09 only slightly less t kp = 2.11.

It follows from the regression equation that an increase per unit in the number of tractors per 100 hectares of arable land (with a fixed value x(4)) leads to an increase in grain yields by an average of 0.345 c/ha.

An approximate calculation of the coefficients of elasticity e 1 "0.068 and e 2" 0.161 shows that with an increase in indicators x(1) and x(4) by 1%, the grain yield increases by an average of 0.068% and 0.161%, respectively.

The multiple coefficient of determination indicates that only 46.9% of the yield variation is explained by the indicators included in the model ( x(1) and x(4)), that is, the saturation of crop production with tractors and fertilizers. The rest of the variation is due to the action of unaccounted for factors ( x (2) , x (3) , x(5) , weather conditions, etc.). The average relative approximation error characterizes the adequacy of the model, as well as the value of the residual variance . When interpreting the regression equation, the values ​​of relative approximation errors are of interest . Recall that - the model value of the effective indicator characterizes the average value of productivity for the totality of the considered areas, provided that the values ​​of the explanatory variables x(1) and x(4) fixed at the same level, namely x (1) = x i(1) and x (4) = x i(4) . Then for the values ​​of d i yields can be compared. Areas that correspond to d values i>0, have an above-average yield, and d i<0 - ниже среднего.

In our example, crop production is most efficient in the area corresponding to d 7 \u003d 28%, where the yield is 28% higher than the average for the region, and the least efficient - in the area with d 20 =-27,3%.


Tasks and exercises

2.1. From the general population ( y, x (1) , ..., x(p)), where y has a normal distribution law with conditional mathematical expectation and variance s 2 , a random sample of volume n, let it go ( y i, x i (1) , ..., x i(p)) - result i th observation ( i=1, 2, ..., n). Determine: a) the mathematical expectation of the least squares estimate of the vector q; b) the covariance matrix of the least squares estimate of the vector q; c) the mathematical expectation of the estimate.

2.2. According to the condition of problem 2.1, find the mathematical expectation of the sum of squared deviations due to regression, i.e. EQ R, Where

.

2.3. According to the condition of problem 2.1, determine the mathematical expectation of the sum of squared deviations due to the residual variation relative to the regression lines, i.e. EQ ost where

2.4. Prove that under the hypothesis Н 0: q=0 the statistics

has an F-distribution with degrees of freedom n 1 =p+1 and n 2 =n-p-1.

2.5. Prove that when the hypothesis H 0: q j =0 is fulfilled, the statistics has a t-distribution with the number of degrees of freedom n=n-p-1.

2.6. Based on the data (Table 2.3) on the dependence of fodder bread shrinkage ( y) on the duration of storage ( x) find a point estimate of the conditional mathematical expectation under the assumption that the general regression equation is linear.

Table 2.3.

It is required: a) to find estimates and residual variance s 2 under the assumption that the general regression equation has the form ; b) check for a=0.05 the significance of the regression equation, i.e. hypothesis H 0: q=0; c) with reliability g=0.9 determine the interval estimates of the parameters q 0 , q 1 ; d) with reliability g=0.95 determine the interval estimate of the conditional expectation for X 0=6; e) determine at g=0.95 the confidence interval of prediction at the point X=12.

2.7. Based on the data on the dynamics of the growth rate of the share price for 5 months, given in Table. 2.4.

Table 2.4.

months ( x)
y (%)

and the assumption that the general regression equation has the form , it is required: a) to determine the estimates and parameters of the regression equation and the residual variance s 2 ; b) check at a=0.01 the significance of the regression coefficient, i.e. hypotheses H 0: q 1 =0;

c) with reliability g=0.95 find interval estimates of the parameters q 0 and q 1 ; d) with reliability g = 0.9, establish an interval estimate of the conditional mathematical expectation for x 0=4; e) determine at g=0.9 the confidence interval of prediction at the point x=5.

2.8. The results of the study of the dynamics of weight gain in young animals are given in Table 2.5.

Table 2.5.

Assuming that the general regression equation is linear, it is required: a) to determine estimates and parameters of the regression equation and residual variance s 2 ; b) check for a=0.05 the significance of the regression equation, i.e. hypotheses H 0: q=0;

c) with reliability g=0.8 find interval estimates of the parameters q 0 and q 1 ; d) with reliability g=0.98 determine and compare the interval estimates of the conditional mathematical expectation for x 0 =3 and x 1 =6;

e) determine at g=0.98 the confidence interval of prediction at the point x=8.

2.9. Cost price ( y) one copy of the book, depending on the circulation ( x) (thousand copies) is characterized by data collected by the publishing house (Table 2.6). Determine the least squares estimates and parameters of the hyperbolic regression equation , with the reliability g=0.9 build confidence intervals for the parameters q 0 and q 1 , as well as the conditional mathematical expectation at x=10.

Table 2.6.

Determine estimates and parameters of the regression equation of the type x=20.

2.11. In table. 2.8 reported growth rates (%) of the following macroeconomic indicators n\u003d 10 developed countries of the world for 1992: GNP - x(1) , industrial production - x(2) , price index - x (3) .

Table 2.8.

Countries x and parameters of the regression equation, estimation of the residual variance; b) check at a=0.05 the significance of the regression coefficient, i.e. H 0: q 1 =0; c) with reliability g=0.9 find interval estimates q 0 and q 1 ; d) find at g=0.95 the confidence interval for at the point X 0 =x i, Where i=5; e) compare the statistical characteristics of the regression equations: 1, 2 and 3.

2.12. Solve problem 2.11, taking for the value to be explained ( at) index x(1) , and for the explanatory ( X) variable x (3) .

1. Ayvazyan S.A., Mkhitaryan V.S. Applied Statistics and Fundamentals of Econometrics: Textbook. M., UNITI, 1998 (2nd edition 2001);

2. Ayvazyan S.A., Mkhitaryan V.S. Applied Statistics in Problems and Exercises: Textbook. M. UNITY - DANA, 2001;

3. Aivazyan S.A., Enyukov I.S., Meshalkin L.D. Applied statistics. Dependency research. M., Finance and statistics, 1985, 487p.;

4. Aivazyan S.A., Buchstaber V.M., Enyukov I.S., Meshalkin L.D. Applied statistics. Classification and dimensionality reduction. M., Finance and statistics, 1989, 607p.;

5. Johnston J. Econometric Methods, Moscow: Statistics, 1980, 446 pp.;

6. Dubrov A.V., Mkhitaryan V.S., Troshin L.I. Multivariate statistical methods. M., Finance and statistics, 2000;

7. Mkhitaryan V.S., Troshin L.I. Research of dependences by methods of correlation and regression. M., MESI, 1995, 120 pp.;

8. Mkhitaryan V.S., Dubrov A.M., Troshin L.I. Multidimensional statistical methods in economics. M., MESI, 1995, 149p.;

9. Dubrov A.M., Mkhitaryan V.S., Troshin L.I. Mathematical statistics for businessmen and managers. M., MESI, 2000, 140s.;

10. Lukashin Yu.I. Regression and adaptive forecasting methods: Textbook, M., MESI, 1997.

11. Lukashin Yu.I. Adaptive methods of short-term forecasting. - M., Statistics, 1979.


APPS


Annex 1. Options for tasks for independent computer research.

CATEGORIES

POPULAR ARTICLES

2023 "kingad.ru" - ultrasound examination of human organs