A matrix of pair correlation coefficients is given. Construction of a matrix of pair correlation coefficients

Z 1 (t)

Z 2 (t)

t

y(t)

Z 1 (t)

Z 2 (t)

t

y(t)

The main task facing the selection of factors included in the correlation model is to introduce into the analysis all the main factors influencing the level of the phenomenon being studied. However, introducing a large number of factors into the model is impractical; it is more correct to select only a relatively small number of main factors that are presumably in correlation with the selected functional indicator.

This can be done using the so-called two-stage selection. In accordance with it, all pre-selected factors are included in the model. Then, among them, on the basis of a special quantitative assessment and additional qualitative analysis, insignificantly influencing factors are identified, which are gradually discarded until those remain for which it can be argued that the available statistical material is consistent with the hypothesis of their joint significant influence on the dependent variable with the chosen form of connection.

Two-stage selection received its most complete expression in the technique of so-called multi-step regression analysis, in which the elimination of unimportant factors occurs on the basis of indicators of their significance, in particular on the basis of the value of t f - the calculated value of the Student's test.

Let's calculate t f using the found pair correlation coefficients and compare them with t critical for a 5% significance level (two-sided) and 18 degrees of freedom (ν = n-2).

where r is the value of the pair correlation coefficient;

n – number of observations (n=20)

When comparing t f for each coefficient with t cr = 2,101 we find that the found coefficients are considered significant, because t f > t cr.

t f for r yx 1 = 2, 5599 ;

t f for r yx 2 = 7,064206 ;

t f for r yx 3 = 2,40218 ;

t f for r x1 x 2 = 4,338906 ;

t f for r x1 x 3 = 15,35065;

t f for r x2 x 3 = 4,749981

When selecting factors to be included in the analysis, specific requirements are imposed on them. First of all, indicators expressing these factors must be quantitatively measurable.

The factors included in the model should not be in a functional or close relationship with each other. The presence of such relationships is characterized by multicollinearity.

Multicollinearity indicates that some factors characterize one and the same aspect of the phenomenon being studied. Therefore, their simultaneous inclusion in the model is inappropriate, since they duplicate each other to a certain extent. If there are no special assumptions by speakers in favor of one of these factors, preference should be given to the one that is characterized by a large pairwise (or partial) correlation coefficient.

It is believed that the maximum value of the correlation coefficient between two factors is 0.8.

Multicollinearity usually leads to degeneration of the matrix of variables and, consequently, to the fact that the main determinant decreases its value and in the limit becomes close to zero. Estimates of the coefficients of the regression equation become highly dependent on the accuracy of finding the source data and change their values ​​sharply when the number of observations changes.

Matrix of pair correlation coefficients

Y X1 X2 X3 X4 X5
Y
X1 0,732705
X2 0,785156 0,706287
X3 0,179211 -0,29849 0,208514
X4 0,667343 0,924333 0,70069 0,299583
X5 0,709204 0,940488 0,691809 0,326602 0,992945

The nodes of the matrix contain paired correlation coefficients that characterize the close relationship between the factor characteristics. Analyzing these coefficients, we note that the greater their absolute value, the greater the influence the corresponding factor characteristic has on the resultant one. The analysis of the resulting matrix is ​​carried out in two stages:

1. If in the first column of the matrix there are correlation coefficients for which /r /< 0,5, то соответствующие признаки из модели исключаются. В данном случае в первом столбце матрицы коэффициентов корреляции исключается фактор или коэффициент роста уровня инфляции. Данный фактор оказывает меньшее влияние на результативный признак, нежели оставшиеся четыре признака.

2. Analyzing paired correlation coefficients of factor characteristics with each other, (r XiXj), characterizing the closeness of their relationship, it is necessary to assess their independence from each other, since this is a necessary condition for further regression analysis. In view of the fact that there are no absolutely independent characteristics in economics, it is necessary to highlight, if possible, the most independent ones. Factor characteristics that are closely correlated with each other are called multicollinear. The inclusion of multicollinear characteristics in the model makes it impossible to interpret the regression model economically, since a change in one factor entails a change in the factors associated with it, which can lead to a “breakdown” of the model as a whole.

The criterion for multicollaterality of factors is as follows:

/r XiXj / > 0.8

In the resulting matrix of paired correlation coefficients, this criterion is met by two indicators located at the intersection of the rows And . Of each pair of these features, one must be left in the model; it should have a greater impact on the resulting feature. As a result, factors and are excluded from the model, i.e. the growth rate of the cost of goods sold and the growth rate of its sales volume.

So, we introduce factors X1 and X2 into the regression model.

Next, regression analysis is carried out (service, data analysis, regression). Again, a table of initial data is compiled with factors X1 and X2. Regression in general is used to analyze the impact on a separate dependent variable of the values ​​of independent variables (factors) and allows the correlation between characteristics to be represented in the form of some functional dependence called a regression equation or a correlation-regression model.

As a result of regression analysis, we obtain the results of calculating multivariate regression. Let's analyze the results obtained.

All regression coefficients are significant according to Student's t test. The multiple correlation coefficient R was 0.925; the square of this value (coefficient of determination) means that on average 85.5% of the variation in the effective characteristic is explained by the variation in the factor characteristics included in the model. The coefficient of determinism characterizes the close relationship between the set of factor characteristics and the effective indicator. The closer the R-squared value is to 1, the stronger the relationship. In our case, an indicator equal to 0.855 indicates the correct selection of factors and the presence of a relationship between the factors and the effective indicator.

The model under consideration is adequate, since the calculated value of Fisher's F-test significantly exceeds its tabulated value (F obs =52.401; F tab =1.53).

The general result of the correlation and regression analysis is a multiple regression equation, which has the form:

The resulting regression equation meets the purpose of correlation-regression analysis and is a linear model of the dependence of the enterprise's balance sheet profit on two factors: the labor productivity growth coefficient and the industrial property coefficient.

Based on the obtained model, we can conclude that with an increase in the level of labor productivity by 1% compared to the level of the previous period, the amount of balance sheet profit will increase by 0.95 percentage points; an increase in the coefficient of industrial property by 1% will lead to an increase in the effective indicator by 27.9 percentage points. Consequently, the dominant influence on the growth of balance sheet profit is exerted by the increase in the value of property for production purposes (renewal and growth of fixed assets of the enterprise).

Using a multiple regression model, a multifactorial forecast of the effective characteristic is performed. Let it be known that X1 = 3.0, and X3 = 0.7. Let's substitute the values ​​of the factor characteristics into the model, we get Control = 0.95*3.0 + 27.9*0.7 – 19.4 = 2.98. Thus, with an increase in labor productivity and modernization of fixed assets at the enterprise, balance sheet profit in the 1st quarter of 2005 compared to the previous period (IV quarter of 2004) will increase by 2.98%.

Economic data represents quantitative characteristics of any economic objects or processes. They are formed under the influence of many factors, not all of which are accessible to external control. Uncontrollable factors can take on random values ​​from some set of values ​​and thereby cause the data they define to be random. One of the main tasks in economic research is analysis of dependencies between variables.

When considering the dependencies between characteristics, it is necessary to distinguish, first of all, two types of connections:

  • functional - are characterized by complete correspondence between the change in the factor characteristic and the change in the resulting value: Each value of a factor characteristic corresponds to very specific values ​​of the resulting characteristic. This type of relationship is expressed as a formulaic relationship. Functional dependence can connect an effective characteristic with one or more factor characteristics. Thus, the amount of wages for time-based wages depends on the number of hours worked;
  • correlational- there is no complete correspondence between the change in two signs; the impact of individual factors is manifested only on average, with mass observation of actual data. The simultaneous impact on the studied trait of a large number of different factors leads to the fact that one and the same value of a factor characteristic corresponds to a whole distribution of values ​​of the resulting characteristic, since in each specific case other factor characteristics can change the strength and direction of their impact.

It should be borne in mind that if there is a functional relationship between the characteristics, it is possible, knowing the value of the factor characteristic, to accurately determine the value of the resultant sign. In the presence of a correlation dependence, only trend of change in the resultant characteristic when the value of the factor characteristic changes.

When studying the relationships between signs, they are classified according to direction, form, number of factors:

  • towards connections are divided into straight And reverse. With a direct connection, the direction of change in the resulting characteristic coincides with the direction of change in the factor characteristic. With feedback, the direction of change in the resulting characteristic is opposite to the direction of change in the factor characteristic. For example, the higher the qualifications of the worker, the higher the level of productivity of his labor (direct relationship). The higher the labor productivity, the lower the cost per unit of production (feedback);
  • according to form(type of function) connections are divided into linear(straight-line) and nonlinear(curvilinear). A linear relationship is represented by a straight line, a nonlinear relationship by a curve (parabola, hyperbola, etc.). In a linear relationship, with an increase in the value of a factor characteristic, there is a uniform increase (decrease) in the value of the resulting characteristic;
  • by the number of factors acting on the effective characteristic, connections are divided into single-factor(paired) and multifactorial.

The study of the dependence of trait variation on environmental conditions is the content of correlation theory.

When conducting correlation analysis, the entire set of data is considered as a set of variables (factors), each of which contains P observations.

When studying the relationship between two factors, they are usually designated X=(x p x 2,...,x n) And Y= (y ( , y 2 ,..., y and).

Covariance - this is statistical measure of interaction two variables. For example, a positive value for the covariance of the returns of two securities indicates that the returns of these securities tend to move in one direction.

Covariance between two variables X And Y calculated as follows:

where are the actual values ​​of the variables

X And G;

If random variables Chi Y independent, the theoretical covariance is zero.

Covariance depends on the units in which the variables are measured Hee Y, it is a non-standardized quantity. Therefore, to measure connection strength another statistic called a correlation coefficient is used between two variables.

For two variables X And Y pair correlation coefficient

is defined as follows:

Where SSy- estimates of variances of quantities Hee Y. These estimates characterize degree of scatter values x (, x 2, ..., x n (y 1, y 2, y n) around your average x(y respectively), or variability(variability) of these variables over a set of observations.

Dispersion(estimation of variance) is determined by the formula

In general, to obtain an unbiased estimate of the variance, the sum of squares should be divided by the number of degrees of freedom of the estimate (etc), Where P - sample size, R - number of connections superimposed on the sample. Since the sample has already been used once to determine the mean X, then the number of superimposed connections in this case is equal to one (p = 1), and the number of degrees of freedom of the estimate (i.e., the number of independent sample elements) is equal to (P - 1).

It is more natural to measure the degree of dispersion of variable values ​​in the same units in which the variable itself is measured. This problem is solved by an indicator called standard deviation (standard deviation) or standard error variable X(variable Y) and determined by the relation

The terms in the numerator of formula (3.2.1) express the interaction of two variables and determine the sign of the correlation (positive or negative). If, for example, there is a strong positive relationship between variables (an increase in one variable while the other increases), each term will be a positive number. Likewise, if there is a strong negative relationship between variables, all terms in the numerator will be negative numbers, resulting in a negative correlation value.

The denominator of the expression for the pairwise correlation coefficient [see formula (3.2.2)] simply normalizes the numerator in such a way that the correlation coefficient turns out to be an easily interpretable number without dimension, and takes values ​​from -1 to +1.

The numerator of the expression for the correlation coefficient, which is difficult to interpret due to the unusual units of measurement, is covariance HiU. Despite the fact that it is sometimes used as an independent characteristic (for example, in finance theory to describe the joint change in stock prices on two exchanges), it is more convenient to use the correlation coefficient. Correlation and covariance represent essentially the same information, but correlation represents this information in a more useful form.

To qualitatively assess the correlation coefficient, various scales are used, most often the Chaddock scale. Depending on the value of the correlation coefficient, the relationship can have one of the following ratings:

  • 0.1-0.3 - weak;
  • 0.3-0.5 - noticeable;
  • 0.5-0.7 - moderate;
  • 0.7-0.9 - high;
  • 0.9-1.0 - very high.

Assessment of the degree of closeness of a connection using the correlation coefficient is carried out, as a rule, on the basis of more or less limited information about the phenomenon being studied. In this regard, there is a need to assess the significance of the linear correlation coefficient, which makes it possible to extend conclusions based on the sample results to the general population.

Assessment of the significance of the correlation coefficient for small sample sizes is performed using Student's 7-test. In this case, the actual (observed) value of this criterion is determined by the formula

The value / obs calculated using this formula is compared with the critical value of the 7-criterion, which is taken from the table of Student’s /-test values ​​(see Appendix 2) taking into account the given significance level oc and the number of degrees of freedom (P - 2).

If 7 obs > 7 tabs, then the resulting value of the correlation coefficient is considered significant (i.e., the null hypothesis stating that the correlation coefficient is equal to zero is rejected). And thus it is concluded that there is a close statistical relationship between the variables under study.

If the value g y x close to zero, the relationship between the variables is weak. If the correlation between random variables:

  • positive, then as one random variable increases, the other tends to increase on average;
  • negative, then as one random variable increases, the other tends to decrease on average. A convenient graphical tool for analyzing paired data is scatter plot, which represents each observation in a space of two dimensions corresponding to two factors. A scatterplot, which depicts a set of values ​​of two characteristics, is also called correlation field. Each point in this diagram has coordinates x (. and y g As the strength of the linear relationship increases, the points on the graph will lie closer to the straight line and the magnitude G will be closer to unity.

Pairwise correlation coefficients are used to measure the strength of linear relationships between different pairs of features from a set of them. For many features one gets matrix of pair correlation coefficients.

Let the entire set of data consist of a variable Y = =(y p y 2, ..., y p) And T variables (factors) X, each of which contains P observations. Variable values Y And X, contained in the observed population are recorded in a table (Table 3.2.1).

Table 3.2.1

Variable

Number

observations

X TZ

X tp

Based on the data contained in this table, calculate matrix of pair correlation coefficients R, it is symmetrical about the main diagonal:


Analysis of the matrix of pair correlation coefficients is used when constructing multiple regression models.

One correlation matrix cannot completely describe the dependencies between quantities. In this regard, multivariate correlation analysis considers two problems:

  • 1. Determination of the close relationship of one random variable with the totality of other variables included in the analysis.
  • 2. Determination of the closeness of the connection between two quantities while fixing or excluding the influence of other quantities.

These problems are solved using multiple and partial correlation coefficients, respectively.

The solution to the first problem (determining the close relationship of one random variable with the totality of other variables included in the analysis) is carried out using sample multiple correlation coefficient according to the formula

Where R- R[cm. formula (3.2.6)]; Rjj- algebraic complement of an element of the same matrix R.

Squared multiple correlation coefficient SCHj 2 j _j J+l m usually called sample multiple coefficient of determination; it shows what proportion of the variation (random spread) of the value being studied Xj explains the variation of the remaining random variables X ( , X 2 ,..., X t.

The coefficients of multiple correlation and determination are positive quantities, taking values ​​in the range from 0 to 1. When approximating the coefficient R 2 to unity, we can conclude that the relationship between random variables is close, but not about its direction. The multiple correlation coefficient can only increase if additional variables are included in the model, and will not increase if any of the existing characteristics are excluded.

Checking the significance of the coefficient of determination is carried out by comparing the calculated value of Fisher's /'-criterion

with tabular F rabl. The tabular value of the criterion (see Appendix 1) is determined by the given significance level a and degrees of freedom v l = mnv 2 = n-m-l. Coefficient R 2 is significantly different from zero if the inequality holds

If the random variables under consideration correlate with each other then the value of the pair correlation coefficient is partially affected by the influence of other quantities. In this regard, there is a need to study the partial correlation between quantities while excluding the influence of other random variables (one or more).

Sample partial correlation coefficient determined by the formula

Where R Jk , Rjj, R kk - algebraic additions to the corresponding matrix elements R[cm. formula (3.2.6)].

The partial correlation coefficient, as well as the pair correlation coefficient, varies from -1 to +1.

Expression (3.2.9) subject to t = 3 will look like

The coefficient r 12(3) is called correlation coefficient between x ( And x 2 for fixed x y It is symmetric with respect to primary indexes 1, 2. Its secondary index 3 refers to a fixed variable.

Example 3.2.1. Calculation of pair coefficients,

multiple and partial correlation.

In table 3.2.2 provides information on sales volumes and advertising costs of one company, as well as the consumer spending index for a number of current years.

  • 1. Construct a scatter diagram (correlation field) for the variables “sales volume” and “consumer spending index”.
  • 2. Determine the degree of influence of the consumer spending index on sales volume (calculate the pair correlation coefficient).
  • 3. Assess the significance of the calculated pair correlation coefficient.
  • 4. Construct a matrix of pairwise correlation coefficients for three variables.
  • 5. Find an estimate of the multiple correlation coefficient.
  • 6. Find estimates of partial correlation coefficients.

1. In our example, the scatter diagram has the form shown in Fig. 3.2.1. The elongation of the cloud of points on the scatter diagram along the inclined line allows us to make the assumption that there is some objective tendency for a direct linear relationship between the values ​​of the variables X 2 Y(volume of sales).

Rice. 3.2.1.

2. Intermediate calculations when calculating the correlation coefficient between variables X 2(Consumer Expenditure Index) and Y(sales volume) are given in table. 3.2.3.

Average values random variables X 2 And Y, which are the simplest indicators characterizing the sequences jCj, x 2,..., x 16 and y v y 2 ,..., y 16, calculate using the following formulas:


Sales volume Y, thousand rubles.

Index

consume

telsky

expenses

Sales volume Y, thousand rubles.

Index

consume

telsky

expenses

Table 3.2.3

l:, - X

(AND - U)(x, - x)

(x, - x) 2

(y, - - y) 2

Dispersion characterizes the degree of spread of values x v x 2,x:

Let us now consider the solution to example 3.2.1 in Excel.

To calculate correlation using Excel, you can use the function =correl(), specifying the addresses of two columns of numbers, as shown in Fig. 3.2.2. The answer is placed in D8 and is equal to 0.816.

Rice. 3.2.2.

(Note: Function arguments correls must be numbers or names, arrays or references containing numbers. If the argument, which is an array or reference, contains text, boolean values, or empty cells, then such values ​​are ignored; however, cells that contain zero values ​​are counted.

If array! and array2 have different numbers of data points, then the function correl returns the error value #n/a.

If array1 or array2 is empty or if o (standard deviation) of their values ​​is zero, then the function correl returns the error value #div/0!.)

The critical value of Student's t-statistic can also be obtained using the function studistribution of 1 Excel package. As function arguments, you must specify the number of degrees of freedom equal to P- 2 (in our example 16 - 2= 14) and significance level a (in our example a = 0.1) (Fig. 3.2.3). If actual value/-statistics taken modulo is greater critical, then with probability (1 - a) the correlation coefficient is significantly different from zero.


Rice. 3.2.3. The critical value of the /-statistic is 1.7613

Excel includes a set of data analysis tools (the so-called analysis package) designed to solve various statistical problems. To calculate the matrix of pair correlation coefficients R you should use the Correlation tool (Fig. 3.2.4) and set the analysis parameters in the corresponding dialog box. The answer will be placed on a new worksheet (Fig. 3.2.5).

1 In Excel 2010, the name of the function studrasprobr changed to stu-

DENT.OBR.2X.

Rice. 3.2.4.


Rice. 3.2.5.

  • The founders of the theory of correlation are considered to be the English statisticians F. Galton (1822-1911) and K. Pearson (1857-1936). The term “correlation” was borrowed from natural science and means “correlation, correspondence.” The idea of ​​correlation as interdependence between random variables underlies the mathematical-statistical theory of correlation.

Data for 2011 are provided for the territories of the Southern Federal District of the Russian Federation

Territories of the Federal District

Gross regional product, billion rubles, Y

Investments in fixed assets, billion rubles, X1

1. Rep. Adygea

2. Rep. Dagestan

3. Rep. Ingushetia

4. Kabardino-Balkarian Republic

5. Rep. Kalmykia

6. Karachay-Cherkess Republic

7. Rep. North Ossetia Alania

8. Krasnodar region)

9. Stavropol region

10. Astrakhan region.

11. Volgograd region.

12. Rostov region.

  • 1. Calculate the matrix of pair correlation coefficients; evaluate the statistical significance of the correlation coefficients.
  • 2. Construct a field of correlation between the effective characteristic and the factor most closely related to it.
  • 3. Calculate the parameters of linear pair regression for each factor X..
  • 4. Assess the quality of each model through the coefficient of determination, average error of approximation and Fisher's F test. Choose the best model.

will be 80% of its maximum value. Present graphically: actual and model values, forecast points.

  • 6. Using step-by-step multiple regression (exclusion method or inclusion method), build a model of apartment price formation due to significant factors. Give an economic interpretation of the regression model coefficients.
  • 7. Evaluate the quality of the constructed model. Has the quality of the model improved compared to the single-factor model? Assess the influence of significant factors on the result using the elasticity coefficients, in - and -? coefficients

When solving this problem, we will carry out calculations and construct graphs and diagrams using the Excel Data Analysis settings.

1. Calculate the matrix of pair correlation coefficients and evaluate the statistical significance of the correlation coefficients

In the Correlation dialog box, in the Input interval field, enter the range of cells containing the source data. Since we have also selected the column headings, we check the Labels checkbox in the first row.

We got the following results:

Table 1.1 Matrix of pair correlation coefficients

Analysis of the matrix of pairwise correlation coefficients shows that the dependent variable Y, i.e. gross regional product, has a closer relationship with X1 (investment in fixed capital). The correlation coefficient is 0.936. This means that 93.6% of the dependent variable Y (gross regional product) depends on the indicator X1 (investment in fixed capital).

We will determine the statistical significance of the correlation coefficients using Student's t-test. We compare the table value with the calculated values.

Let's calculate the table value using the STUDISCOVER function.

t table = 0.129 with a confidence level of 0.9 and degrees of freedom (n-2).

Factor X1 is statistically significant.

2. Let’s construct a field of correlation between the effective attribute (gross regional product) and the factor most closely related to it (investment in fixed capital)

To do this, we will use the Excel scatter plot tool.

As a result, we obtain a correlation field for the price of the gross regional product, billion rubles. and investments in fixed assets, billion rubles. (Figure 1.1.).

Figure 1.1

3. Calculate the parameters of linear pair regression for each factor X

To calculate the parameters of linear pairwise regression, we will use the Regression tool included in the Data Analysis setting.

In the Regression dialog box, in the Input interval Y field, enter the address of the range of cells that the dependent variable represents. In field

Input interval X we enter the address of the range that contains the values ​​of the independent variables. Let us calculate the parameters of paired regression for factor X.

For X1 we received the following data presented in Table 1.2:

Table 1.2

The regression equation for the dependence of the price of gross regional product on investment in fixed capital has the form:

4. Let's evaluate the quality of each model through the coefficient of determination, average error of approximation and Fisher's F-test. Let's determine which model is the best.

We obtained the coefficient of determination, the average error of approximation, as a result of the calculations carried out in paragraph 3. The data obtained are presented in the following tables:

X1 data:

Table 1.3a

Table 1.4b

A) The coefficient of determination determines what proportion of the variation of trait Y is taken into account in the model and is due to the influence of factor X on it. The greater the value of the coefficient of determination, the closer the connection between the characteristics in the constructed mathematical model.

Excel refers to R-squared.

Based on this criterion, the most adequate model is the regression equation of the dependence of the price of the gross regional product on investment in fixed capital (X1).

B) We calculate the average approximation error using the formula:

where the numerator is the sum of the squares of the deviation of the calculated values ​​from the actual ones. In tables it is located in the SS column, the Remaining line.

We calculate the average price of an apartment in Excel using the AVERAGE function. = 24.18182 billion rubles.

When carrying out economic calculations, a model is considered sufficiently accurate if the average error of approximation is less than 5%; the model is considered acceptable if the average error of approximation is less than 15%.

According to this criterion, the most adequate is the mathematical model for the regression equation of the dependence of the price of the gross regional product on investment in fixed capital (X1).

C) The F-test is used to test the significance of the regression model. To do this, a comparison is also made of the critical (tabular) values ​​of the Fisher F-test.

The calculated values ​​are given in tables 1.4b (indicated by the letter F).

We will calculate the tabular value of Fisher's F test in Excel using the FDIST function. Let's take the probability equal to 0.05. Received: = 4.75

The calculated values ​​of Fisher's F test for each factor are comparable to the table value:

71.02 > = 4.75 the model is adequate according to this criterion.

Having analyzed the data according to all three criteria, we can conclude that the best mathematical model is built for the gross regional product factor, which is described by the linear equation

5. For the selected model of dependence of the price of gross regional product

We will predict the average value of the indicator at a significance level if the predicted value of the factor is 80% of its maximum value. Let's present it graphically: actual and model values, forecast points.

Let's calculate the predicted value of X; according to the condition, it will be 80% of the maximum value.

Let's calculate X max in Excel using the MAX function.

0,8 *52,8 = 42,24

To obtain predictive estimates of the dependent variable, we substitute the obtained value of the independent variable into the linear equation:

5.07+2.14*42.24 = 304.55 billion rubles.

Let us determine the confidence interval of the forecast, which will have the following boundaries:

To calculate the confidence interval for the predicted value, we calculate the deviation from the regression line.

For a paired regression model, the deviation value is calculated:

those. standard error value from Table 1.5a.

(Since the number of degrees of freedom is equal to one, the denominator will be equal to n-2). correlation pair regression forecast

To calculate the coefficient, we will use the Excel function STUDISCOVER, take the probability equal to 0.1, and the number of degrees of freedom 38.

We calculate the value using Excel and get 12294.


Let's determine the upper and lower boundaries of the interval.

  • 304,55+27,472= 332,022
  • 304,55-27,472= 277,078

Thus, the forecast value = 304.55 thousand dollars will be between the lower limit equal to 277.078 thousand dollars. and an upper limit equal to 332.022 billion. Rub.

Actual and model values, forecast points are presented graphically in Figure 1.2.


Figure 1.2

6. Using step-by-step multiple regression (elimination method), we will build a model for the formation of the price of the gross regional product due to significant factors

To build multiple regression, we will use the Regression function of Excel, including all factors. As a result, we obtain the result tables, from which we need the Student’s t-test.

Table 1.8a

Table 1.8b

Table 1.8c.

We get a model like:

Because the< (4,75 < 71,024), уравнение регрессии следует признать адекватным.

Let's choose the smallest absolute value of the Student's t-test, it is equal to 8.427, compare it with the table value, which we calculate in Excel, take the significance level equal to 0.10, the number of degrees of freedom n-m-1=12-4=8: =1.8595

Since 8.427>1.8595 the model should be considered adequate.

7. To assess the significant factor of the resulting mathematical model, we calculate the elasticity coefficients, and - coefficients

The elasticity coefficient shows by what percentage the effective attribute will change when the factor attribute changes by 1%:

E X4 = 2.137 * (10.69/24.182) = 0.94%

That is, with an increase in investment in fixed capital of 1%, the cost on average increases by 0.94%.

The coefficient shows by what part of the standard deviation the average value of the dependent variable changes with a change in the independent variable by one standard deviation.

2,137* (14.736/33,632) = 0,936.

Standard deviation data is taken from tables obtained using the Descriptive Statistics tool.

Table 1.11 Descriptive statistics (Y)

Table 1.12 Descriptive statistics (X4)

The coefficient determines the share of the factor’s influence in the total influence of all factors:

To calculate pair correlation coefficients, we calculate the matrix of pair correlation coefficients in Excel using the Correlation tool in the Data Analysis settings.

Table 1.14

(0,93633*0,93626) / 0,87 = 1,00.

Conclusion: From the calculations obtained, we can conclude that the effective attribute Y (gross regional product) has a large dependence on factor X1 (investment in fixed capital) (by 100%).

Bibliography

  • 1. Magnus Y.R., Katyshev P.K., Peresetsky A.A. Econometrics. Beginner course. Tutorial. 2nd ed. - M.: Delo, 1998. - p. 69 - 74.
  • 2. Workshop on econometrics: Textbook / I.I. Eliseeva, S.V. Kurysheva, N.M. Gordeenko et al. 2002. - p. 49 - 105.
  • 3. Dougherty K. Introduction to econometrics: Transl. from English - M.: INFRA-M, 1999. - XIV, p. 262 - 285.
  • 4. Ayvyzyan S.A., Mikhtiryan V.S. Applied mathematics and fundamentals of econometrics. -1998., pp. 115-147.
  • 5. Kremer N.Sh., Putko B.A. Econometrics. -2007. from 175-251.
y x (1) x (2) x (3) x (4) x (5)
y 1.00 0.43 0.37 0.40 0.58 0.33
x (1) 0.43 1.00 0.85 0.98 0.11 0.34
x (2) 0.37 0.85 1.00 0.88 0.03 0.46
x (3) 0.40 0.98 0.88 1.00 0.03 0.28
x (4) 0.58 0.11 0.03 0.03 1.00 0.57
x (5) 0.33 0.34 0.46 0.28 0.57 1.00

Analysis of the matrix of paired correlation coefficients shows that the effective indicator is most closely related to the indicator x(4) - the amount of fertilizer consumed per 1 hectare ().

At the same time, the connection between the attributes-arguments is quite close. Thus, there is a practically functional relationship between the number of wheeled tractors ( x(1)) and the number of surface tillage tools .

The presence of multicollinearity is also indicated by the correlation coefficients and . Considering the close relationship between the indicators x (1) , x(2) and x(3), only one of them can be included in the yield regression model.

To demonstrate the negative impact of multicollinearity, consider a regression model of yield, including all input indicators:

F obs = 121.

The values ​​of the corrected estimates of the standard deviations of the estimates of the coefficients of the equation are indicated in parentheses .

The following adequacy parameters are presented under the regression equation: multiple coefficient of determination; corrected estimate of the residual variance, average relative error of approximation and calculated value of the criterion F obs = 121.

The regression equation is significant because F obs = 121 > F kp = 2.85 found from the table F-distributions at a=0.05; n 1 =6 and n 2 =14.

It follows from this that Q¹0, i.e. and at least one of the coefficients of the equation q j (j= 0, 1, 2, ..., 5) is not zero.

To test the hypothesis about the significance of individual regression coefficients H0: q j =0, where j=1,2,3,4,5, compare the critical value t kp = 2.14, found from the table t-distributions at significance level a=2 Q=0.05 and the number of degrees of freedom n=14, with the calculated value . It follows from the equation that the regression coefficient is statistically significant only when x(4) since ½ t 4 ½=2.90 > t kp =2.14.



Negative signs of regression coefficients do not lend themselves to economic interpretation when x(1) and x(5) . From the negative values ​​of the coefficients it follows that the increase in the saturation of agriculture with wheeled tractors ( x(1)) and plant health products ( x(5)) has a negative effect on yield. Therefore, the resulting regression equation is unacceptable.

To obtain a regression equation with significant coefficients, we use a step-by-step regression analysis algorithm. Initially, we use a step-by-step algorithm with the elimination of variables.

Let's exclude the variable from the model x(1) , which corresponds to the minimum absolute value of ½ t 1 ½=0.01. For the remaining variables, we again construct the regression equation:

The resulting equation is significant because F observed = 155 > F kp = 2.90, found at the significance level a = 0.05 and the numbers of degrees of freedom n 1 = 5 and n 2 = 15 according to the table F-distribution, i.e. vector q¹0. However, only the regression coefficient at x(4) . Estimated values ​​½ t j ½ for other coefficients is less t kr = 2.131, found from the table t-distributions at a=2 Q=0.05 and n=15.

By excluding the variable from the model x(3) , which corresponds to the minimum value t 3 =0.35 and we get the regression equation:

(2.9)

In the resulting equation, the coefficient at x(5) . By excluding x(5) we obtain the regression equation:

(2.10)

We obtained a significant regression equation with significant and interpretable coefficients.

However, the resulting equation is not the only “good” and not the “best” yield model in our example.

Let's show that in the multicollinearity condition, a stepwise algorithm with the inclusion of variables is more efficient. The first step in the yield model y variable included x(4) , which has the highest correlation coefficient with y, explained by the variable - r(y,x(4))=0.58. In the second step, including the equation along with x(4) variables x(1) or x(3), we will obtain models that, for economic reasons and statistical characteristics, exceed (2.10):

(2.11)

(2.12)

Including any of the three remaining variables in the equation worsens its properties. See, for example, equation (2.9).

Thus, we have three “good” yield models, from which we need to choose one for economic and statistical reasons.

According to statistical criteria, model (2.11) is most adequate. It corresponds to the minimum values ​​of residual variance = 2.26 and the average relative error of approximation and the largest values ​​and Fob = 273.

Model (2.12) has slightly worse adequacy indicators, followed by model (2.10).

We will now choose the best of models (2.11) and (2.12). These models differ from each other in terms of variables x(1) and x(3) . However, in yield models the variable x(1) (number of wheeled tractors per 100 ha) is more preferable than variable x(3) (number of surface tillage implements per 100 ha), which is to some extent secondary (or derived from x (1)).

In this regard, for economic reasons, preference should be given to model (2.12). Thus, after implementing the stepwise regression analysis algorithm with the inclusion of variables and taking into account the fact that only one of the three related variables should enter the equation ( x (1) , x(2) or x(3)) choose the final regression equation:

The equation is significant at a=0.05, because F obs = 266 > F kp = 3.20, found from the table F-distributions at a= Q=0.05; n 1 =3 and n 2 =17. All regression coefficients in equation ½ are also significant t j½> t kp(a=2 Q=0.05; n=17)=2.11. The regression coefficient q 1 should be considered significant (q 1 ¹0) for economic reasons, while t 1 =2.09 only slightly less t kp = 2.11.

From the regression equation it follows that an increase by one in the number of tractors per 100 hectares of arable land (at a fixed value x(4)) leads to an increase in grain yields by an average of 0.345 c/ha.

An approximate calculation of the elasticity coefficients e 1 »0.068 and e 2 »0.161 shows that with increasing indicators x(1) and x(4) by 1%, grain yield increases on average by 0.068% and 0.161%, respectively.

The multiple coefficient of determination indicates that only 46.9% of the yield variation is explained by the indicators included in the model ( x(1) and x(4)), that is, the saturation of crop production with tractors and fertilizers. The rest of the variation is due to the action of unaccounted factors ( x (2) , x (3) , x(5), weather conditions, etc.). The average relative error of approximation characterizes the adequacy of the model, as well as the value of the residual variance. When interpreting the regression equation, the values ​​of the relative errors of approximation are of interest . Let us recall that - the model value of the effective indicator characterizes the average yield value for the totality of the regions under consideration, provided that the values ​​of the explanatory variables x(1) and x(4) are fixed at the same level, namely x (1) = x i(1) and x (4) = xi(4) . Then, according to the values ​​of d i You can compare regions by yield. Areas to which d values ​​correspond i>0, have above average yield, and d i<0 - ниже среднего.

In our example, in terms of yield, crop production is most effective in the area corresponding to d 7 =28%, where the yield is 28% higher than the regional average, and the least effective is in the area with d 20 =-27,3%.


Tasks and exercises

2.1. From the general population ( y, x (1) , ..., x(p)), where y has a normal distribution law with conditional mathematical expectation and variance s 2, a random sample of n, let it go ( y i, x i (1) , ..., x i(p)) - result i th observation ( i=1, 2, ..., n). Determine: a) the mathematical expectation of the least squares estimate of the vector q; b) covariance matrix of the least squares estimate of the vector q; c) mathematical expectation of the assessment.

2.2. According to the conditions of problem 2.1, find the mathematical expectation of the sum of squared deviations due to regression, i.e. EQ R, Where

.

2.3. According to the conditions of problem 2.1, determine the mathematical expectation of the sum of squared deviations caused by the residual variation relative to the regression lines, i.e. EQ ost, where

2.4. Prove that when hypothesis H 0 is fulfilled: q=0 statistics

has an F-distribution with degrees of freedom n 1 =p+1 and n 2 =n-p-1.

2.5. Prove that when the hypothesis H 0: q j =0 is fulfilled, the statistics has a t-distribution with the number of degrees of freedom n=n-p-1.

2.6. Based on the data (Table 2.3) on the dependence of the shrinkage of fodder bread ( y) on storage duration ( x) find a point estimate of the conditional expectation under the assumption that the general regression equation is linear.

Table 2.3.

Required: a) find estimates of the residual variance s 2 under the assumption that the general regression equation has the form ; b) check at a=0.05 the significance of the regression equation, i.e. hypothesis H 0: q=0; c) with reliability g=0.9, determine interval estimates of parameters q 0, q 1; d) with reliability g=0.95, determine the interval estimate of the conditional mathematical expectation at X 0 =6; e) determine at g=0.95 the confidence interval of the prediction at the point X=12.

2.7. Based on data on the dynamics of the growth rate of stock prices for 5 months, given in table. 2.4.

Table 2.4.

months ( x)
y (%)

and the assumption that the general regression equation has the form , it is required: a) determine estimates of both the parameters of the regression equation and the residual variance s 2 ; b) check at a=0.01 the significance of the regression coefficient, i.e. hypotheses H 0: q 1 =0;

c) with reliability g=0.95, find interval estimates of the parameters q 0 and q 1; d) with reliability g=0.9, establish an interval estimate of the conditional mathematical expectation at x 0 =4; e) determine at g=0.9 the confidence interval of the prediction at the point x=5.

2.8. The results of the study of the dynamics of weight gain of young animals are given in Table 2.5.

Table 2.5.

Assuming that the general regression equation is linear, it is required: a) determine estimates of both the parameters of the regression equation and the residual variance s 2 ; b) check at a=0.05 the significance of the regression equation, i.e. hypotheses H 0: q=0;

c) with reliability g=0.8, find interval estimates of the parameters q 0 and q 1; d) with reliability g=0.98, determine and compare interval estimates of the conditional mathematical expectation at x 0 =3 and x 1 =6;

e) determine at g=0.98 the confidence interval of the prediction at the point x=8.

2.9. Cost ( y) one copy of the book depending on the circulation ( x) (thousand copies) is characterized by data collected by the publishing house (Table 2.6). Determine least squares estimates and parameters of a hyperbolic regression equation, with reliability g=0.9, construct confidence intervals for parameters q 0 and q 1, as well as the conditional expectation at x=10.

Table 2.6.

Determine the estimates and parameters of the regression equation of the form , test the hypothesis H 0 at a = 0.05: q 1 = 0 and construct confidence intervals with reliability g = 0.9 for the parameters q 0 and q 1 and the conditional mathematical expectation at x=20.

2.11. In table 2.8 presented data on growth rates (%) of the following macroeconomic indicators n=10 developed countries of the world for 1992: GNP - x(1) , industrial production - x(2) , price index - x (3) .

Table 2.8.

Countries x and parameters of the regression equation, estimation of residual variance; b) check at a=0.05 the significance of the regression coefficient, i.e. H 0: q 1 =0; c) with reliability g=0.9, find interval estimates q 0 and q 1; d) find at g=0.95 the confidence interval for at the point X 0 =x i, Where i=5; e) compare the statistical characteristics of regression equations: 1, 2 and 3.

2.12. Solve Problem 2.11 by taking ( at) index x(1) , and for the explanatory ( X) variable x (3) .

1. Ayvazyan S.A., Mkhitaryan V.S. Applied statistics and fundamentals of econometrics: Textbook. M., UNITY, 1998 (2nd edition 2001);

2. Ayvazyan S.A., Mkhitaryan V.S. Applied statistics in problems and exercises: Textbook. M. UNITY - DANA, 2001;

3. Ayvazyan S.A., Enyukov I.S., Meshalkin L.D. Applied statistics. Dependency research. M., Finance and Statistics, 1985, 487 pp.;

4. Ayvazyan S.A., Bukhstaber V.M., Enyukov I.S., Meshalkin L.D. Applied statistics. Classification and dimension reduction. M., Finance and Statistics, 1989, 607 pp.;

5. Johnston J. Econometric methods, M.: Statistics, 1980, 446 pp.;

6. Dubrov A.V., Mkhitaryan V.S., Troshin L.I. Multivariate statistical methods. M., Finance and Statistics, 2000;

7. Mkhitaryan V.S., Troshin L.I. Study of dependencies using correlation and regression methods. M., MESI, 1995, 120 pp.;

8. Mkhitaryan V.S., Dubrov A.M., Troshin L.I. Multivariate statistical methods in economics. M., MESI, 1995, 149 pp.;

9. Dubrov A.M., Mkhitaryan V.S., Troshin L.I. Mathematical statistics for businessmen and managers. M., MESI, 2000, 140 pp.;

10. Lukashin Yu.I. Regression and adaptive forecasting methods: Textbook, M., MESI, 1997.

11. Lukashin Yu.I. Adaptive methods of short-term forecasting. - M., Statistics, 1979.


APPLICATIONS


Annex 1. Options for tasks for independent computer research.

CATEGORIES

POPULAR ARTICLES

2023 “kingad.ru” - ultrasound examination of human organs