What is a regression function. Regression analysis is a statistical method for studying the dependence of a random variable on variables

Regression analysis is a method of establishing an analytical expression for the stochastic dependence between the characteristics under study. The regression equation shows how the average changes at when changing any of x i , and has the form:

Where y - dependent variable (it is always the same);

X i - independent variables (factors) (there may be several of them).

If there is only one independent variable, this is a simple regression analysis. If there are several of them ( n 2), then such an analysis is called multifactorial.

Regression analysis solves two main problems:

constructing a regression equation, i.e. finding the type of relationship between the result indicator and independent factors x 1 , x 2 , …, x n .

assessment of the significance of the resulting equation, i.e. determining how much selected factor characteristics explain the variation of a trait u.

Regression analysis is used mainly for planning, as well as for developing a regulatory framework.

Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed characteristics, regression analysis also provides its formalized expression. In addition, if correlation analysis studies any relationship between factors, then regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor characteristics affects the effective characteristic.

Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, to implement regression analysis it is necessary to fulfill a number of special requirements (in particular, x l ,x 2 ,...,x n ;y must be independent, normally distributed random variables with constant variances). In real life, strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are very common in economic research. Dependencies in economics can be not only direct, but also inverse and nonlinear. A regression model can be built in the presence of any dependence, however, in multivariate analysis only linear models of the form are used:

The regression equation is constructed, as a rule, using the least squares method, the essence of which is to minimize the sum of squared deviations of the actual values of the resulting characteristic from its calculated values, i.e.:

Where T - number of observations;

j =a+b 1 x 1 j + b 2 x 2 j + ... + b n X n j - calculated value of the result factor.

It is recommended to determine regression coefficients using analytical packages for a personal computer or a special financial calculator. In the simplest case, the regression coefficients of a one-factor linear regression equation of the form y = a + bx can be found using the formulas:

Cluster analysis

Cluster analysis is one of the methods of multidimensional analysis intended for grouping (clustering) a population whose elements are characterized by many characteristics. The values of each feature serve as the coordinates of each unit of the population under study in the multidimensional space of features. Each observation, characterized by the values of several indicators, can be represented as a point in the space of these indicators, the values of which are considered as coordinates in a multidimensional space. Distance between points r And q With k coordinates is defined as:

The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space the following inequality must be observed:

Where r 1, 2 - distance between clusters 1 and 2.

Just like regression analysis procedures, the clustering procedure is quite labor-intensive; it is advisable to perform it on a computer.

In his works dating back to 1908. He described it using the example of the work of an agent selling real estate. In his records, the house sales specialist kept track of a wide range of input data for each specific building. Based on the results of the auction, it was determined which factor had the greatest influence on the transaction price.

Analysis of a large number of transactions yielded interesting results. The final price was influenced by many factors, sometimes leading to paradoxical conclusions and even obvious “outliers” when a house with high initial potential was sold at a reduced price.

The second example of the application of such an analysis is the work of which was entrusted with determining employee remuneration. The complexity of the task lay in the fact that it required not the distribution of a fixed amount to everyone, but its strict correspondence to the specific work performed. The emergence of many problems with practically similar solutions required a more detailed study of them at the mathematical level.

A significant place was allocated to the section “regression analysis”, which combined practical methods used to study dependencies that fall under the concept of regression. These relationships are observed between data obtained from statistical studies.

Among the many tasks to be solved, the main goals are three: determination of a general regression equation; constructing estimates of parameters that are unknowns that are part of the regression equation; testing of statistical regression hypotheses. In the course of studying the relationship that arises between a pair of quantities obtained as a result of experimental observations and constituting a series (set) of the type (x1, y1), ..., (xn, yn), they rely on the provisions of regression theory and assume that for one quantity Y there is a certain probability distribution, while the other X remains fixed.

The result Y depends on the value of the variable X; this dependence can be determined by various patterns, while the accuracy of the results obtained is influenced by the nature of the observations and the purpose of the analysis. The experimental model is based on certain assumptions that are simplified but plausible. The main condition is that the parameter X is a controlled quantity. Its values are set before the start of the experiment.

If a pair of uncontrolled variables XY is used during an experiment, then regression analysis is carried out in the same way, but methods are used to interpret the results, during which the relationship of the random variables under study is studied. Methods of mathematical statistics are not an abstract topic. They find application in life in various spheres of human activity.

In the scientific literature, the term linear regression analysis is widely used to define the above method. For variable X, the term regressor or predictor is used, and dependent Y variables are also called criterion variables. This terminology reflects only the mathematical dependence of the variables, but not the cause-and-effect relationship.

Regression analysis is the most common method used in processing the results of a wide variety of observations. Physical and biological dependencies are studied using this method; it is implemented both in economics and in technology. A lot of other fields use regression analysis models. Analysis of variance and multivariate statistical analysis work closely with this method of study.

Regression and correlation analysis are statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using specific practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when combining them.

Regression Analysis in Excel

Shows the influence of some values (independent, independent) on the dependent variable. For example, how does the number of economically active population depend on the number of enterprises, wages and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to highlight priorities. And based on the main factors, predict, plan the development of priority areas, and make management decisions.

Regression happens:

linear (y = a + bx);
parabolic (y = a + bx + cx 2);
exponential (y = a * exp(bx));
power (y = a*x^b);
hyperbolic (y = b/x + a);
logarithmic (y = b * 1n(x) + a);
exponential (y = a * b^x).

Let's look at an example of building a regression model in Excel and interpreting the results. Let's take the linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of quitting employees were analyzed. It is necessary to determine the dependence of the number of quitting employees on the average salary.

The linear regression model looks like this:

Y = a 0 + a 1 x 1 +…+a k x k.

Where a are regression coefficients, x are influencing variables, k is the number of factors.

In our example, Y is the indicator of quitting employees. The influencing factor is wages (x).

Excel has built-in functions that can help you calculate the parameters of a linear regression model. But the “Analysis Package” add-on will do this faster.

We activate a powerful analytical tool:

Once activated, the add-on will be available in the Data tab.

Now let's do the regression analysis itself.

First of all, we pay attention to R-squared and coefficients.

R-squared is the coefficient of determination. In our example – 0.755, or 75.5%. This means that the calculated parameters of the model explain 75.5% of the relationship between the studied parameters. The higher the coefficient of determination, the better the model. Good - above 0.8. Bad – less than 0.5 (such an analysis can hardly be considered reasonable). In our example – “not bad”.

The coefficient 64.1428 shows what Y will be if all variables in the model under consideration are equal to 0. That is, the value of the analyzed parameter is also influenced by other factors not described in the model.

The coefficient -0.16285 shows the weight of variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the fewer people quit. Which is fair.

Correlation Analysis in Excel

Correlation analysis helps determine whether there is a relationship between indicators in one or two samples. For example, between the operating time of a machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a connection, then does an increase in one parameter lead to an increase (positive correlation) or a decrease (negative) of the other. Correlation analysis helps the analyst determine whether the value of one indicator can be used to predict the possible value of another.

The correlation coefficient is denoted by r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient is 0, there is no linear relationship between samples.

Let's look at how to find the correlation coefficient using Excel.

To find paired coefficients, the CORREL function is used.

Objective: Determine whether there is a relationship between the operating time of a lathe and the cost of its maintenance.

Place the cursor in any cell and press the fx button.

In the “Statistical” category, select the CORREL function.
Argument “Array 1” - the first range of values – machine operating time: A2:A14.
Argument “Array 2” - second range of values – repair cost: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use “Data Analysis” (the “Analysis Package” add-on). You need to select correlation from the list and designate the array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this:

Correlation and regression analysis

In practice, these two techniques are often used together.

Example:

Now the regression analysis data has become visible.

The purpose of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factor, explanatory, determinant, regressor and predictor variables.

The dependent variable is sometimes called the determined, explained, or “response” variable. The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective method for modeling and forecasting.

Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.

Paired Regression Analysis

The first steps when using regression analysis will be almost identical to those we took in calculating the correlation coefficient. The three main conditions for the effectiveness of correlation analysis using the Pearson method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the least squares method.

To more clearly illustrate the differences between the two methods of data analysis, let us turn to the example already discussed with the variables “SPS support” and “rural population share”. The source data is identical. The difference in scatterplots will be that in regression analysis it is correct to plot the dependent variable - in our case, “SPS support” on the Y-axis, whereas in correlation analysis this does not matter. After cleaning outliers, the scatterplot looks like this:

The fundamental idea of regression analysis is that, having a general trend for the variables - in the form of a regression line - it is possible to predict the value of the dependent variable, given the values of the independent one.

Let's imagine an ordinary mathematical linear function. Any line in Euclidean space can be described by the formula:

where a is a constant that specifies the displacement along the ordinate axis; b is a coefficient that determines the angle of inclination of the line.

Knowing the slope and constant, you can calculate (predict) the value of y for any x.

This simplest function formed the basis of the regression analysis model with the caveat that we will not predict the value of y exactly, but within a certain confidence interval, i.e. approximately.

The constant is the point of intersection of the regression line and the y-axis (F-intersection, usually denoted “interceptor” in statistical packages). In our example with voting for the Union of Right Forces, its rounded value will be 10.55. The angular coefficient b will be approximately -0.1 (as in correlation analysis, the sign shows the type of connection - direct or inverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. us. + 10.55.

ATP = -0.10 x 47 + 10.55 = 5.63.

The difference between the original and predicted values is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of the “Republic of Adygea” the remainder will be equal to 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less successfully the predicted value.

We calculate the predicted values and residuals for all cases:

Happening	Sat down. us.	ATP (original)	ATP (predicted)	Leftovers
Republic of Adygea	47	3,92	5,63	-1,71 -
Altai Republic	76	5,4	2,59	2,81
Republic of Bashkortostan	36	6,04	6,78	-0,74
Republic of Buryatia	41	8,36	6,25	2,11
Republic of Dagestan	59	1,22	4,37	-3,15
Republic of Ingushetia	59	0,38	4,37	3,99
Etc.

Analysis of the ratio of initial and predicted values serves to assess the quality of the resulting model and its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variables, in our case - 0.63. To interpret multiple R meaningfully, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R-squared (R 2) shows the proportion of variation in the dependent variable that is explained by the independent variable(s).

In our case, R 2 = 0.39 (0.63 2); this means that the variable “share of rural population” explains approximately 40% of the variation in the variable “SPS support”. The larger the coefficient of determination, the higher the quality of the model.

Another indicator of model quality is the standard error of estimate. This is a measure of how widely the points are “scattered” around the regression line. The measure of spread for interval variables is the standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of residuals. The higher its value, the greater the scatter and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the variable “SPS support.”

Regression statistics also include analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two quantities (/"-ratio). Dispersion statistics are especially important for sample studies - it shows how likely it is that there is a relationship between the independent and dependent variables in the population. However, for continuous studies (as in our example) the study the results of variance analysis are not useful. In this case, they check whether the identified statistical pattern is caused by a coincidence of random circumstances, to what extent it is characteristic of the set of conditions in which the population under study is located, i.e., it is established that the obtained result is not true for some broader general one. aggregate, but the degree of its regularity, freedom from random influences.

In our case, the ANOVA statistics are as follows:

	SS	df	MS	F	meaning
Regress.	258,77	1,00	258,77	54,29	0.000000001
Remainder	395,59	83,00	L,11
Total	654,36

The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is due to chance).

The t criterion performs a similar function, but in relation to regression coefficients (angular and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 + …+ bpXp + a.

If there are more than two independent variables, we are not able to get a visual idea of their relationship; in this regard, multiple regression is less “visual” than pairwise regression. When you have two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows you to visually represent the structure of the data well.

When working with multiple regression, as opposed to pairwise regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm involves the sequential inclusion (exclusion) of independent variables based on their explanatory “weight”. The stepwise method is good when there are many independent variables; it “cleanses” the model of frankly weak predictors, making it more compact and concise.

An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements we considered for the case of pairwise regression. In addition, there are other important components to the statistics of multiple regression analysis.

We will illustrate the work with multiple regression using the example of testing hypotheses that explain differences in the level of electoral activity across Russian regions. Specific empirical studies have suggested that voter turnout levels are influenced by:

National factor (variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;

Urbanization factor (the “urban population” variable; operationalized as the share of the urban population in the constituent entities of the Russian Federation; we have already worked with this factor as part of the correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.

The dependent variable - “intensity of electoral activity” (“active”) is operationalized through average turnout data by region in federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will be as follows:

Happening	Variables
Happening	Assets.	Gor. us.	Rus. us.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
Republic of Buryatia	60,75	59	70
Republic of Dagestan	79,92	41	9
Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Cherkess Republic	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Republic of Mari El	65,19	62	47

Etc. (after cleaning out emissions, 83 out of 88 cases remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the urbanization factor together explain about 38% of the variation in the “electoral activity” variable.

2. The average error is 3.38. This is exactly how “wrong on average” the constructed model is when predicting the level of turnout.

3. /l-ratio of explained and unexplained variation is 25.2 at the 0.000000003 level. The null hypothesis about the randomness of the identified relationships is rejected.

4. The criterion / for the constant and regression coefficients of the variables “urban population” and “Russian population” is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis that the coefficients are random is rejected.

Additional useful statistics in analyzing the relationship between the original and predicted values of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values of all independent variables for a given case deviates from the average value for all independent variables simultaneously). The second is a measure of the influence of the case. Different observations have different effects on the slope of the regression line, and Cook's distance can be used to compare them according to this indicator. This can be useful when cleaning up outliers (an outlier can be thought of as an overly influential case).

In our example, Dagestan in particular is a unique and influential case.

Happening	Original values	Predska values	Leftovers	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00

The regression model itself has the following parameters: Y-intercept (constant) = 75.99; b (horizontal) = -0.1; Kommersant (Russian nas.) = -0.06. Final formula.

Regression analysis is one of the most popular methods of statistical research. It can be used to establish the degree of influence of independent variables on the dependent variable. Microsoft Excel has tools designed to perform this type of analysis. Let's look at what they are and how to use them.

But, in order to use the function that allows you to perform regression analysis, you first need to activate the Analysis Package. Only then the tools necessary for this procedure will appear on the Excel ribbon.

Now when we go to the tab "Data", on the ribbon in the toolbox "Analysis" we will see a new button - "Data Analysis".

Types of Regression Analysis

There are several types of regressions:

parabolic;
sedate;
logarithmic;
exponential;
demonstrative;
hyperbolic;
linear regression.

We will talk in more detail about performing the last type of regression analysis in Excel later.

Linear Regression in Excel

Below, as an example, is a table showing the average daily air temperature outside and the number of store customers for the corresponding working day. Let's find out using regression analysis exactly how weather conditions in the form of air temperature can affect the attendance of a retail establishment.

The general linear regression equation is as follows: Y = a0 + a1x1 +…+ akhk. In this formula Y means a variable, the influence of factors on which we are trying to study. In our case, this is the number of buyers. Meaning x are the various factors that influence a variable. Options a are regression coefficients. That is, they are the ones who determine the significance of a particular factor. Index k denotes the total number of these very factors.

Analysis results analysis

The results of the regression analysis are displayed in the form of a table in the place specified in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependency less than 0.5 is bad.

Another important indicator is located in the cell at the intersection of the line "Y-intersection" and column "Odds". This indicates what value Y will have, and in our case, this is the number of buyers, with all other factors equal to zero. In this table, this value is 58.04.

Value at the intersection of the graph "Variable X1" And "Odds" shows the level of dependence of Y on X. In our case, this is the level of dependence of the number of store customers on temperature. A coefficient of 1.31 is considered a fairly high influence indicator.

As you can see, using Microsoft Excel it is quite easy to create a regression analysis table. But only a trained person can work with the output data and understand its essence.