Regressive addiction. Regression analysis

In statistical modeling, regression analysis is a study used to evaluate the relationship between variables. This mathematical method includes many other methods for modeling and analyzing multiple variables where the focus is on the relationship between a dependent variable and one or more independent ones. More specifically, regression analysis helps us understand how the typical value of a dependent variable changes if one of the independent variables changes while the other independent variables remain fixed.

In all cases, the target estimate is a function of the independent variables and is called a regression function. In regression analysis, it is also of interest to characterize the change in the dependent variable as a function of regression, which can be described using a probability distribution.

Regression Analysis Problems

This statistical research method is widely used for forecasting, where its use has significant advantage, but sometimes it can lead to illusion or false relationships, so it is recommended to use it carefully in the said matter, since, for example, correlation does not mean causation.

A large number of methods have been developed for regression analysis, such as linear and ordinary least squares regression, which are parametric. Their essence is that the regression function is defined in terms of a finite number of unknown parameters that are estimated from the data. Nonparametric regression allows its function to lie within a specific set of functions, which can be infinite-dimensional.

As a statistical research method, regression analysis in practice depends on the form of the data generation process and how it relates to the regression approach. Since the true form of the data process generating is usually an unknown number, regression analysis of the data often depends to some extent on assumptions about the process. These assumptions are sometimes testable if there is enough data available. Regression models are often useful even when the assumptions are moderately violated, although they may not perform at peak efficiency.

In a narrower sense, regression may refer specifically to the estimation of continuous response variables, as opposed to the discrete response variables used in classification. The continuous output variable case is also called metric regression to distinguish it from related problems.

Story

The earliest form of regression is the well-known least squares method. It was published by Legendre in 1805 and Gauss in 1809. Legendre and Gauss applied the method to the problem of determining from astronomical observations the orbits of bodies around the Sun (mainly comets, but later also newly discovered minor planets). Gauss published a further development of least squares theory in 1821, including a version of the Gauss–Markov theorem.

The term "regression" was coined by Francis Galton in the 19th century to describe a biological phenomenon. The idea was that the height of descendants from that of their ancestors tends to regress downwards towards the normal mean. For Galton, regression had only this biological meaning, but later his work was continued by Udney Yoley and Karl Pearson and brought into a more general statistical context. In the work of Yule and Pearson, the joint distribution of response and explanatory variables is assumed to be Gaussian. This assumption was rejected by Fischer in papers of 1922 and 1925. Fisher suggested that the conditional distribution of the response variable is Gaussian, but the joint distribution need not be. In this regard, Fischer's proposal is closer to Gauss's formulation of 1821. Before 1970, it sometimes took up to 24 hours to get the result of a regression analysis.

Regression analysis methods continue to be an area of active research. In recent decades, new methods have been developed for robust regression; regressions involving correlated responses; regression methods that accommodate different types of missing data; nonparametric regression; Bayesian regression methods; regressions in which predictor variables are measured with error; regression with more predictors than observations, and cause-and-effect inference with regression.

Regression models

Regression analysis models include the following variables:

Unknown parameters, designated beta, which can be a scalar or a vector.
Independent Variables, X.
Dependent Variables, Y.

Different fields of science where regression analysis is used use different terms in place of dependent and independent variables, but in all cases the regression model relates Y to a function of X and β.

The approximation is usually written as E(Y | X) = F(X, β). To carry out regression analysis, the type of function f must be determined. Less commonly, it is based on knowledge about the relationship between Y and X, which does not rely on data. If such knowledge is not available, then the flexible or convenient form F is chosen.

Dependent variable Y

Let us now assume that the vector of unknown parameters β has length k. To perform regression analysis, the user must provide information about the dependent variable Y:

If N data points of the form (Y, X) are observed, where N< k, большинство классических подходов к регрессионному анализу не могут быть выполнены, так как система уравнений, определяющих модель регрессии в качестве недоопределенной, не имеет достаточного количества данных, чтобы восстановить β.

If exactly N = K are observed and the function F is linear, then the equation Y = F(X, β) can be solved exactly rather than approximately. This amounts to solving a set of N-equations with N-unknowns (elements β) that has a unique solution as long as X is linearly independent. If F is nonlinear, there may be no solution, or many solutions may exist.
The most common situation is where N > data points are observed. In this case, there is enough information in the data to estimate a unique value for β that best fits the data, and a regression model where the application to the data can be viewed as an overdetermined system in β.

In the latter case, regression analysis provides tools for:

Finding a solution for the unknown parameters β, which will, for example, minimize the distance between the measured and predicted value of Y.
Under certain statistical assumptions, regression analysis uses excess information to provide statistical information about the unknown parameters β and the predicted values of the dependent variable Y.

Required number of independent measurements

Consider a regression model that has three unknown parameters: β 0 , β 1 and β 2 . Suppose the experimenter makes 10 measurements on the same value of the independent variable vector X. In this case, regression analysis does not produce a unique set of values. The best you can do is estimate the mean and standard deviation of the dependent variable Y. Similarly, by measuring two different values of X, you can obtain enough data for regression with two unknowns, but not with three or more unknowns.

If the experimenter's measurements were made at three different values of the independent variable vector X, then the regression analysis will provide a unique set of estimates for the three unknown parameters in β.

In the case of general linear regression, the above statement is equivalent to the requirement that the matrix X T X is invertible.

Statistical Assumptions

When the number of measurements N is greater than the number of unknown parameters k and the measurement errors ε i , then, as a rule, the excess information contained in the measurements is then disseminated and used for statistical predictions regarding the unknown parameters. This excess information is called the regression degree of freedom.

Fundamental Assumptions

Classic assumptions for regression analysis include:

Sampling is representative of inference prediction.
The error term is a random variable with a mean of zero, which is conditional on the explanatory variables.
Independent variables are measured without errors.
As independent variables (predictors), they are linearly independent, that is, it is not possible to express any predictor as a linear combination of the others.
The errors are uncorrelated, that is, the error covariance matrix of the diagonals and each non-zero element is the error variance.
The error variance is constant across observations (homoscedasticity). If not, then weighted least squares or other methods can be used.

These sufficient conditions for least squares estimation have the required properties; in particular, these assumptions mean that parameter estimates will be objective, consistent, and efficient, especially when taken into account in the class of linear estimators. It is important to note that evidence rarely satisfies the conditions. That is, the method is used even if the assumptions are not correct. Variation from the assumptions can sometimes be used as a measure of how useful the model is. Many of these assumptions can be relaxed in more advanced methods. Statistical analysis reports typically include analysis of tests on sample data and methodology for the usefulness of the model.

Additionally, variables in some cases refer to values measured at point locations. There may be spatial trends and spatial autocorrelations in variables that violate statistical assumptions. Geographic weighted regression is the only method that deals with such data.

A feature of linear regression is that the dependent variable, which is Yi, is a linear combination of parameters. For example, simple linear regression uses one independent variable, x i , and two parameters, β 0 and β 1 , to model n-points.

In multiple linear regression, there are multiple independent variables or functions of them.

When a random sample is taken from a population, its parameters allow one to obtain a sample linear regression model.

In this aspect, the most popular is the least squares method. It is used to obtain parameter estimates that minimize the sum of squared residuals. This kind of minimization (which is typical of linear regression) of this function leads to a set of normal equations and a set of linear equations with parameters, which are solved to obtain parameter estimates.

Under the further assumption that population error is generally propagated, a researcher can use these standard error estimates to create confidence intervals and conduct hypothesis tests about its parameters.

Nonlinear regression analysis

An example where the function is not linear with respect to the parameters indicates that the sum of squares should be minimized using an iterative procedure. This introduces many complications that define the differences between linear and nonlinear least squares methods. Consequently, the results of regression analysis when using a nonlinear method are sometimes unpredictable.

Calculation of power and sample size

There are generally no consistent methods regarding the number of observations versus the number of independent variables in the model. The first rule was proposed by Dobra and Hardin and looks like N = t^n, where N is the sample size, n is the number of independent variables, and t is the number of observations needed to achieve the desired accuracy if the model had only one independent variable. For example, a researcher builds a linear regression model using a data set that contains 1000 patients (N). If the researcher decides that five observations are needed to accurately define the line (m), then the maximum number of independent variables that the model can support is 4.

Other methods

Although regression model parameters are typically estimated using the least squares method, there are other methods that are used much less frequently. For example, these are the following methods:

Bayesian methods (for example, Bayesian linear regression).
Percentage regression, used for situations where reducing percentage errors is considered more appropriate.
Smallest absolute deviations, which is more robust in the presence of outliers leading to quantile regression.
Nonparametric regression, which requires a large number of observations and calculations.
A distance learning metric that is learned to find a meaningful distance metric in a given input space.

Software

All major statistical software packages perform least squares regression analysis. Simple linear regression and multiple regression analysis can be used in some spreadsheet applications as well as some calculators. Although many statistical software packages can perform various types of nonparametric and robust regression, these methods are less standardized; different software packages implement different methods. Specialized regression software has been developed for use in areas such as examination analysis and neuroimaging.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is discussed in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type of equality is used in statistics and econometrics.

Definition of regression

In mathematics, regression means a certain quantity that describes the dependence of the average value of a set of data on the values of another quantity. The regression equation shows, as a function of a particular characteristic, the average value of another characteristic. The regression function has the form of a simple equation y = x, in which y acts as a dependent variable, and x as an independent variable (feature-factor). In fact, regression is expressed as y = f (x).

What are the types of relationships between variables?

In general, there are two opposing types of relationships: correlation and regression.

The first is characterized by the equality of conditional variables. In this case, it is not reliably known which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to construct a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

Today, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c+t*x+E. A hyperbolic equation has the form of a regular hyperbola y = c + m / x + E. A logarithmically linear equation expresses the relationship using a logarithmic function: In y = In c + m * In x + In E.

Multiple and nonlinear

The two more complex types of regression are multiple and nonlinear. The multiple regression equation is expressed by the function y = f(x 1, x 2 ... x c) + E. In this situation, y acts as a dependent variable, and x acts as an explanatory variable. The E variable is stochastic; it includes the influence of other factors in the equation. The nonlinear regression equation is a bit controversial. On the one hand, relative to the indicators taken into account, it is not linear, but on the other hand, in the role of evaluating indicators, it is linear.

Inverse and paired types of regressions

An inverse is a type of function that needs to be converted to a linear form. In the most traditional application programs, it has the form of a function y = 1/c + m*x+E. A pairwise regression equation shows the relationship between the data as a function of y = f (x) + E. Just like in other equations, y depends on x, and E is a stochastic parameter.

Concept of correlation

This is an indicator demonstrating the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence of feedback, a positive indicator indicates direct feedback. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1, the stronger the relationship between the parameters; the closer to 0, the weaker it is.

Methods

Correlation parametric methods can assess the strength of the relationship. They are used on the basis of distribution estimation to study parameters that obey the law of normal distribution.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the selected relationship formula. The correlation field is used as a connection identification method. To do this, all existing data must be depicted graphically. All known data must be plotted in a rectangular two-dimensional coordinate system. This is how a correlation field is formed. The values of the describing factor are marked along the abscissa axis, while the values of the dependent factor are marked along the ordinate axis. If there is a functional relationship between the parameters, they are lined up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can speak of an almost complete absence of connection. If it is between 30% and 70%, then this indicates the presence of medium-close connections. A 100% indicator is evidence of a functional connection.

A nonlinear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of multiple correlation. He talks about the close relationship of the presented set of indicators with the characteristic being studied. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is estimated using this indicator.

In order to calculate the multiple correlation indicator, it is necessary to calculate its index.

Least square method

This method is a way to estimate regression factors. Its essence is to minimize the sum of squared deviations obtained as a result of the dependence of the factor on the function.

A pairwise linear regression equation can be estimated using such a method. This type of equations is used when a paired linear relationship is detected between indicators.

Equation Parameters

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter m demonstrates the average change in the final indicator of the function y, provided that the variable x decreases (increases) by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not carry economic meaning. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say that the change in the result is slow compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed through an equation. For example, factor c has the form c = y - mx.

Grouped data

There are task conditions in which all information is grouped by attribute x, but for a certain group the corresponding average values of the dependent indicator are indicated. In this case, the average values characterize how the indicator depending on x changes. Thus, the grouped information helps to find the regression equation. It is used as an analysis of relationships. However, this method has its drawbacks. Unfortunately, average indicators are often subject to external fluctuations. These fluctuations do not reflect the pattern of the relationship; they just mask its “noise.” Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the number of an individual population by the corresponding average, one can obtain the sum y within the group. Next, you need to add up all the amounts received and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. If the intervals are small, we can conditionally take the x indicator for all units (within the group) to be the same. You should multiply it with the sum of y to find out the sum of the products of x and y. Next, all the amounts are added together and the total amount xy is obtained.

Multiple pairwise regression equation: assessing the importance of a relationship

As discussed earlier, multiple regression has a function of the form y = f (x 1,x 2,…,x m)+E. Most often, such an equation is used to solve the problem of supply and demand for a product, interest income on repurchased shares, and to study the causes and type of the production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the microeconomics level this equation is used a little less frequently.

The main task of multiple regression is to build a model of data containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator that needs to be modeled and its coefficients. The regression equation can take on a wide variety of values. In this case, to assess the relationship, two types of functions are usually used: linear and nonlinear.

The linear function is depicted in the form of the following relationship: y = a 0 + a 1 x 1 + a 2 x 2,+ ... + a m x m. In this case, a2, a m are considered “pure” regression coefficients. They are necessary to characterize the average change in parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of stable values of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm. In this case, the indicators b 1, b 2 ..... b m are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors need to be taken into account when constructing multiple regression

In order to correctly build multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationships between economic factors and what is being modeled. Factors that will need to be included must meet the following criteria:

Must be subject to quantitative measurement. In order to use a factor that describes the quality of an object, in any case it should be given a quantitative form.
There should be no intercorrelation of factors, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditional, and this entails its unreliability and unclear estimates.
In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction methods

There are a huge number of methods and methods that explain how you can select factors for an equation. However, all these methods are based on the selection of coefficients using a correlation indicator. Among them are:

Elimination method.
Switching method.
Stepwise regression analysis.

The first method involves filtering out all coefficients from the total set. The second method involves introducing many additional factors. Well, the third is the elimination of factors that were previously used for the equation. Each of these methods has a right to exist. They have their pros and cons, but they can all solve the issue of eliminating unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Multivariate analysis methods

Such methods for determining factors are based on consideration of individual combinations of interrelated characteristics. These include discriminant analysis, shape recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, but it appeared due to the development of the component method. All of them apply in certain circumstances, subject to certain conditions and factors.

The purpose of regression analysis is to measure the relationship between a dependent variable and one (pairwise regression analysis) or more (multiple) independent variables. Independent variables are also called factor, explanatory, determinant, regressor and predictor variables.

The dependent variable is sometimes called the determined, explained, or “response” variable. The extremely widespread use of regression analysis in empirical research is not only due to the fact that it is a convenient tool for testing hypotheses. Regression, especially multiple regression, is an effective method for modeling and forecasting.

Let's start explaining the principles of working with regression analysis with a simpler one - the pair method.

Paired Regression Analysis

The first steps when using regression analysis will be almost identical to those we took in calculating the correlation coefficient. The three main conditions for the effectiveness of correlation analysis using the Pearson method - normal distribution of variables, interval measurement of variables, linear relationship between variables - are also relevant for multiple regression. Accordingly, at the first stage, scatterplots are constructed, a statistical and descriptive analysis of the variables is carried out, and a regression line is calculated. As in the framework of correlation analysis, regression lines are constructed using the least squares method.

To more clearly illustrate the differences between the two methods of data analysis, let us turn to the example already discussed with the variables “SPS support” and “rural population share”. The source data is identical. The difference in scatterplots will be that in regression analysis it is correct to plot the dependent variable - in our case, “SPS support” on the Y-axis, whereas in correlation analysis this does not matter. After cleaning outliers, the scatterplot looks like this:

The fundamental idea of regression analysis is that, having a general trend for the variables - in the form of a regression line - it is possible to predict the value of the dependent variable, given the values of the independent one.

Let's imagine an ordinary mathematical linear function. Any straight line in Euclidean space can be described by the formula:

where a is a constant that specifies the displacement along the ordinate axis; b is a coefficient that determines the angle of inclination of the line.

Knowing the slope and constant, you can calculate (predict) the value of y for any x.

This simplest function formed the basis of the regression analysis model with the caveat that we will not predict the value of y exactly, but within a certain confidence interval, i.e. approximately.

The constant is the point of intersection of the regression line and the y-axis (F-intersection, usually denoted “interceptor” in statistical packages). In our example with voting for the Union of Right Forces, its rounded value will be 10.55. The angular coefficient b will be approximately -0.1 (as in correlation analysis, the sign shows the type of connection - direct or inverse). Thus, the resulting model will have the form SP C = -0.1 x Sel. us. + 10.55.

ATP = -0.10 x 47 + 10.55 = 5.63.

The difference between the original and predicted values is called the remainder (we have already encountered this term, which is fundamental for statistics, when analyzing contingency tables). So, for the case of the “Republic of Adygea” the remainder will be equal to 3.92 - 5.63 = -1.71. The larger the modular value of the remainder, the less successfully the predicted value.

We calculate the predicted values and residuals for all cases:

Happening	Sat down. us.	THX (original)	THX (predicted)	Leftovers
Republic of Adygea	47	3,92	5,63	-1,71 -
Altai Republic	76	5,4	2,59	2,81
Republic of Bashkortostan	36	6,04	6,78	-0,74
The Republic of Buryatia	41	8,36	6,25	2,11
The Republic of Dagestan	59	1,22	4,37	-3,15
The Republic of Ingushetia	59	0,38	4,37	3,99
Etc.

Analysis of the ratio of initial and predicted values serves to assess the quality of the resulting model and its predictive ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the original and predicted values of the dependent variable. In paired regression analysis, it is equal to the usual Pearson correlation coefficient between the dependent and independent variables, in our case - 0.63. To meaningfully interpret multiple R, it must be converted into a coefficient of determination. This is done in the same way as in correlation analysis - by squaring. The coefficient of determination R-squared (R 2) shows the proportion of variation in the dependent variable that is explained by the independent variable(s).

In our case, R 2 = 0.39 (0.63 2); this means that the variable “rural population share” explains approximately 40% of the variation in the variable “SPS support”. The larger the coefficient of determination, the higher the quality of the model.

Another indicator of model quality is the standard error of estimate. This is a measure of how widely the points are “scattered” around the regression line. The measure of spread for interval variables is the standard deviation. Accordingly, the standard error of the estimate is the standard deviation of the distribution of residuals. The higher its value, the greater the scatter and the worse the model. In our case, the standard error is 2.18. It is by this amount that our model will “err on average” when predicting the value of the “SPS support” variable.

Regression statistics also include analysis of variance. With its help, we find out: 1) what proportion of the variation (dispersion) of the dependent variable is explained by the independent variable; 2) what proportion of the variance of the dependent variable is accounted for by the residuals (unexplained part); 3) what is the ratio of these two quantities (/"-ratio). Dispersion statistics are especially important for sample studies - it shows how likely it is that there is a relationship between the independent and dependent variables in the population. However, for continuous studies (as in our example) the study results of variance analysis are not useful.In this case, they check whether the identified statistical pattern is caused by a combination of random circumstances, how characteristic it is for the complex of conditions in which the population being examined is located, i.e. it is established that the result obtained is not true for some broader general aggregate, but the degree of its regularity, freedom from random influences.

In our case, the ANOVA statistics are as follows:

	SS	df	MS	F	meaning
Regress.	258,77	1,00	258,77	54,29	0.000000001
Remainder	395,59	83,00	L,11
Total	654,36

The F-ratio of 54.29 is significant at the 0.0000000001 level. Accordingly, we can confidently reject the null hypothesis (that the relationship we discovered is due to chance).

The t criterion performs a similar function, but in relation to regression coefficients (angular and F-intersection). Using the / criterion, we test the hypothesis that in the general population the regression coefficients are equal to zero. In our case, we can again confidently reject the null hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the paired regression model; the only difference is that several independent variables are sequentially included in the linear function:

Y = b1X1 + b2X2 + …+ bpXp + a.

If there are more than two independent variables, we are not able to get a visual idea of their relationship; in this regard, multiple regression is less “visual” than pairwise regression. When you have two independent variables, it can be useful to display the data in a 3D scatterplot. In professional statistical software packages (for example, Statistica) there is an option to rotate a three-dimensional chart, which allows you to visually represent the structure of the data well.

When working with multiple regression, as opposed to pairwise regression, it is necessary to determine the analysis algorithm. The standard algorithm includes all available predictors in the final regression model. The step-by-step algorithm involves the sequential inclusion (exclusion) of independent variables based on their explanatory “weight”. The stepwise method is good when there are many independent variables; it “cleanses” the model of frankly weak predictors, making it more compact and concise.

An additional condition for the correctness of multiple regression (along with interval, normality and linearity) is the absence of multicollinearity - the presence of strong correlations between independent variables.

The interpretation of multiple regression statistics includes all the elements we considered for the case of pairwise regression. In addition, there are other important components to the statistics of multiple regression analysis.

We will illustrate the work with multiple regression using the example of testing hypotheses that explain differences in the level of electoral activity across Russian regions. Specific empirical studies have suggested that voter turnout levels are influenced by:

National factor (variable “Russian population”; operationalized as the share of the Russian population in the constituent entities of the Russian Federation). It is assumed that an increase in the share of the Russian population leads to a decrease in voter turnout;

Urbanization factor (the “urban population” variable; operationalized as the share of the urban population in the constituent entities of the Russian Federation; we have already worked with this factor as part of the correlation analysis). It is assumed that an increase in the share of the urban population also leads to a decrease in voter turnout.

The dependent variable - “intensity of electoral activity” (“active”) is operationalized through average turnout data by region in federal elections from 1995 to 2003. The initial data table for two independent and one dependent variable will be as follows:

Happening	Variables
Happening	Assets.	Gor. us.	Rus. us.
Republic of Adygea	64,92	53	68
Altai Republic	68,60	24	60
The Republic of Buryatia	60,75	59	70
The Republic of Dagestan	79,92	41	9
The Republic of Ingushetia	75,05	41	23
Republic of Kalmykia	68,52	39	37
Karachay-Cherkess Republic	66,68	44	42
Republic of Karelia	61,70	73	73
Komi Republic	59,60	74	57
Mari El Republic	65,19	62	47

Etc. (after cleaning out emissions, 83 out of 88 cases remain)

Statistics describing the quality of the model:

1. Multiple R = 0.62; L-square = 0.38. Consequently, the national factor and the urbanization factor together explain about 38% of the variation in the “electoral activity” variable.

2. The average error is 3.38. This is exactly how “wrong on average” the constructed model is when predicting the level of turnout.

3. /l-ratio of explained and unexplained variation is 25.2 at the 0.000000003 level. The null hypothesis about the randomness of the identified relationships is rejected.

4. The criterion / for the constant and regression coefficients of the variables “urban population” and “Russian population” is significant at the level of 0.0000001; 0.00005 and 0.007 respectively. The null hypothesis that the coefficients are random is rejected.

Additional useful statistics in analyzing the relationship between the original and predicted values of the dependent variable are the Mahalanobis distance and Cook's distance. The first is a measure of the uniqueness of the case (shows how much the combination of values of all independent variables for a given case deviates from the average value for all independent variables simultaneously). The second is a measure of the influence of the case. Different observations have different effects on the slope of the regression line, and Cook's distance can be used to compare them on this indicator. This can be useful when cleaning up outliers (an outlier can be thought of as an overly influential case).

In our example, unique and influential cases include Dagestan.

Happening	Original values	Predska values	Leftovers	Distance Mahalanobis	Distance
Adygea	64,92	66,33	-1,40	0,69	0,00
Altai Republic	68,60	69.91	-1,31	6,80	0,01
The Republic of Buryatia	60,75	65,56	-4,81	0,23	0,01
The Republic of Dagestan	79,92	71,01	8,91	10,57	0,44
The Republic of Ingushetia	75,05	70,21	4,84	6,73	0,08
Republic of Kalmykia	68,52	69,59	-1,07	4,20	0,00

The regression model itself has the following parameters: Y-intersection (constant) = 75.99; b (horizontal) = -0.1; Kommersant (Russian nas.) = -0.06. Final formula.

In previous posts, the analysis often focused on a single numerical variable, such as mutual fund returns, Web page loading times, or soft drink consumption. In this and subsequent notes, we will look at methods for predicting the values of a numeric variable depending on the values of one or more other numeric variables.

The material will be illustrated with a cross-cutting example. Forecasting sales volume in a clothing store. The Sunflowers chain of discount clothing stores has been constantly expanding for 25 years. However, the company currently does not have a systematic approach to selecting new outlets. The location in which a company intends to open a new store is determined based on subjective considerations. The selection criteria are favorable rental conditions or the manager’s idea of the ideal store location. Imagine that you are the head of the special projects and planning department. You have been tasked with developing a strategic plan for opening new stores. This plan should include a forecast of annual sales for newly opened stores. You believe that retail space is directly related to revenue and want to factor this into your decision making process. How do you develop a statistical model to predict annual sales based on the size of a new store?

Typically, regression analysis is used to predict the values of a variable. Its goal is to develop a statistical model that can predict the values of a dependent variable, or response, from the values of at least one independent, or explanatory, variable. In this note, we will look at simple linear regression - a statistical method that allows you to predict the values of a dependent variable Y by independent variable values X. Subsequent notes will describe a multiple regression model designed to predict the values of an independent variable Y based on the values of several dependent variables ( X 1, X 2, …, X k).

Download the note in or format, examples in format

Types of regression models

Where ρ 1 – autocorrelation coefficient; If ρ 1 = 0 (no autocorrelation), D≈ 2; If ρ 1 ≈ 1 (positive autocorrelation), D≈ 0; If ρ 1 = -1 (negative autocorrelation), D ≈ 4.

In practice, the application of the Durbin-Watson criterion is based on comparing the value D with critical theoretical values d L And d U for a given number of observations n, number of independent variables of the model k(for simple linear regression k= 1) and significance level α. If D< d L , the hypothesis about the independence of random deviations is rejected (hence, there is a positive autocorrelation); If D>dU, the hypothesis is not rejected (that is, there is no autocorrelation); If d L< D < d U , there are no sufficient grounds for making a decision. When the calculated value D exceeds 2, then with d L And d U It is not the coefficient itself that is compared D, and the expression (4 – D).

To calculate the Durbin-Watson statistics in Excel, let's turn to the bottom table in Fig. 14 Withdrawal of balance. The numerator in expression (10) is calculated using the function =SUMMAR(array1;array2), and the denominator =SUMMAR(array) (Fig. 16).

Rice. 16. Formulas for calculating Durbin-Watson statistics

In our example D= 0.883. The main question is: what value of the Durbin-Watson statistic should be considered small enough to conclude that a positive autocorrelation exists? It is necessary to correlate the value of D with the critical values ( d L And d U), depending on the number of observations n and significance level α (Fig. 17).

Rice. 17. Critical values of Durbin-Watson statistics (table fragment)

Thus, in the problem of sales volume in a store delivering goods to home, there is one independent variable ( k= 1), 15 observations ( n= 15) and significance level α = 0.05. Hence, d L= 1.08 and dU= 1.36. Because the D = 0,883 < d L= 1.08, there is a positive autocorrelation between the residuals, the least squares method cannot be used.

Testing Hypotheses about Slope and Correlation Coefficient

Above, regression was used solely for forecasting. To determine regression coefficients and predict the value of a variable Y for a given variable value X The least squares method was used. In addition, we examined the root mean square error of the estimate and the mixed correlation coefficient. If the analysis of residuals confirms that the conditions of applicability of the least squares method are not violated, and the simple linear regression model is adequate, based on the sample data, it can be argued that there is a linear relationship between the variables in the population.

Applicationt -criteria for slope. By testing whether the population slope β 1 is equal to zero, you can determine whether there is a statistically significant relationship between the variables X And Y. If this hypothesis is rejected, it can be argued that between the variables X And Y there is a linear relationship. The null and alternative hypotheses are formulated as follows: H 0: β 1 = 0 (there is no linear dependence), H1: β 1 ≠ 0 (there is a linear dependence). A-priory t-statistic is equal to the difference between the sample slope and the hypothetical value of the population slope, divided by the root mean square error of the slope estimate:

(11) t = (b 1 – β 1 ) / Sb 1

Where b 1 – slope of direct regression on sample data, β1 – hypothetical slope of direct population, , and test statistics t It has t-distribution with n – 2 degrees of freedom.

Let's check whether there is a statistically significant relationship between store size and annual sales at α = 0.05. t-the criterion is displayed along with other parameters when used Analysis package(option Regression). The complete results of the Analysis Package are shown in Fig. 4, fragment related to t-statistics - in Fig. 18.

Rice. 18. Application results t

Since the number of stores n= 14 (see Fig. 3), critical value t-statistics at a significance level of α = 0.05 can be found using the formula: t L=STUDENT.ARV(0.025,12) = –2.1788, where 0.025 is half the significance level, and 12 = n – 2; t U=STUDENT.OBR(0.975,12) = +2.1788.

Because the t-statistics = 10.64 > t U= 2.1788 (Fig. 19), null hypothesis H 0 rejected. On the other side, R-value for X= 10.6411, calculated by the formula =1-STUDENT.DIST(D3,12,TRUE), is approximately equal to zero, so the hypothesis H 0 again rejected. The fact that R-value of almost zero means that if there were no true linear relationship between store sizes and annual sales, it would be virtually impossible to detect it using linear regression. Therefore, there is a statistically significant linear relationship between average annual store sales and store size.

Rice. 19. Testing the hypothesis about the population slope at a significance level of 0.05 and 12 degrees of freedom

ApplicationF -criteria for slope. An alternative approach to testing hypotheses about the slope of simple linear regression is to use F-criteria. Let us remind you that F-test is used to test the relationship between two variances (for more details, see). When testing the slope hypothesis, the measure of random errors is the error variance (the sum of squared errors divided by the number of degrees of freedom), so F-criterion uses the ratio of the variance explained by the regression (i.e. the value SSR, divided by the number of independent variables k), to the error variance ( MSE = S YX 2 ).

A-priory F-statistic is equal to the mean square of regression (MSR) divided by the error variance (MSE): F = MSR/ MSE, Where MSR=SSR / k, MSE =SSE/(n– k – 1), k– number of independent variables in the regression model. Test statistics F It has F-distribution with k And n– k – 1 degrees of freedom.

For a given significance level α, the decision rule is formulated as follows: if F>FU, the null hypothesis is rejected; otherwise it is not rejected. The results, presented in the form of a summary table of variance analysis, are shown in Fig. 20.

Rice. 20. Analysis of variance table for testing the hypothesis about the statistical significance of the regression coefficient

Likewise t-criterion F-the criterion is displayed in the table when used Analysis package(option Regression). Full results of the work Analysis package are shown in Fig. 4, fragment related to F-statistics – in Fig. 21.

Rice. 21. Application results F-criteria obtained using the Excel Analysis Package

The F-statistic is 113.23, and R-value close to zero (cell SignificanceF). If the significance level α is 0.05, determine the critical value F-distributions with one and 12 degrees of freedom can be obtained using the formula F U=F.OBR(1-0.05;1;12) = 4.7472 (Fig. 22). Because the F = 113,23 > F U= 4.7472, and R-value close to 0< 0,05, нулевая гипотеза H 0 is rejected, i.e. The size of a store is closely related to its annual sales.

Rice. 22. Testing the population slope hypothesis at a significance level of 0.05 with one and 12 degrees of freedom

Confidence interval containing slope β 1 . To test the hypothesis that there is a linear relationship between variables, you can construct a confidence interval containing the slope β 1 and verify that the hypothetical value β 1 = 0 belongs to this interval. The center of the confidence interval containing the slope β 1 is the sample slope b 1 , and its boundaries are the quantities b 1 ±tn –2 Sb 1

As shown in Fig. 18, b 1 = +1,670, n = 14, Sb 1 = 0,157. t 12 =STUDENT.ARV(0.975,12) = 2.1788. Hence, b 1 ±tn –2 Sb 1 = +1.670 ± 2.1788 * 0.157 = +1.670 ± 0.342, or + 1.328 ≤ β 1 ≤ +2.012. Thus, there is a probability of 0.95 that the population slope lies between +1.328 and +2.012 (i.e., $1,328,000 to $2,012,000). Since these values are greater than zero, there is a statistically significant linear relationship between annual sales and store area. If the confidence interval contained zero, there would be no relationship between the variables. In addition, the confidence interval means that each increase in store area by 1,000 sq. ft. results in an increase in average sales volume of between $1,328,000 and $2,012,000.

Usaget -criteria for the correlation coefficient. correlation coefficient was introduced r, which is a measure of the relationship between two numeric variables. It can be used to determine whether there is a statistically significant relationship between two variables. Let us denote the correlation coefficient between the populations of both variables by the symbol ρ. The null and alternative hypotheses are formulated as follows: H 0: ρ = 0 (no correlation), H 1: ρ ≠ 0 (there is a correlation). Checking the existence of a correlation:

Where r = + , If b 1 > 0, r = – , If b 1 < 0. Тестовая статистика t It has t-distribution with n – 2 degrees of freedom.

In the problem about the Sunflowers chain of stores r 2= 0.904, a b 1- +1.670 (see Fig. 4). Because the b 1> 0, the correlation coefficient between annual sales and store size is r= +√0.904 = +0.951. Let's test the null hypothesis that there is no correlation between these variables using t-statistics:

At a significance level of α = 0.05, the null hypothesis should be rejected because t= 10.64 > 2.1788. Thus, it can be argued that there is a statistically significant relationship between annual sales and store size.

When discussing inferences regarding population slope, confidence intervals and hypothesis tests are used interchangeably. However, calculating the confidence interval containing the correlation coefficient turns out to be more difficult, since the type of sampling distribution of the statistic r depends on the true correlation coefficient.

Estimation of mathematical expectation and prediction of individual values

This section discusses methods for estimating the mathematical expectation of a response Y and predictions of individual values Y for given values of the variable X.

Constructing a confidence interval. In example 2 (see section above Least square method) the regression equation made it possible to predict the value of the variable Y X. In the problem of choosing a location for a retail outlet, the average annual sales volume in a store with an area of 4000 sq. feet was equal to 7.644 million dollars. However, this estimate of the mathematical expectation of the general population is point-wise. To estimate the mathematical expectation of the population, the concept of a confidence interval was proposed. Similarly, we can introduce the concept confidence interval for the mathematical expectation of the response for a given variable value X:

Where , = b 0 + b 1 X i– predicted value is variable Y at X = X i, S YX– root mean square error, n– sample size, Xi- specified value of the variable X, µ Y|X = Xi– mathematical expectation of the variable Y at X = Xi, SSX =

Analysis of formula (13) shows that the width of the confidence interval depends on several factors. At a given significance level, an increase in the amplitude of fluctuations around the regression line, measured using the root mean square error, leads to an increase in the width of the interval. On the other hand, as one would expect, an increase in sample size is accompanied by a narrowing of the interval. In addition, the width of the interval changes depending on the values Xi. If the variable value Y predicted for quantities X, close to the average value , the confidence interval turns out to be narrower than when predicting the response for values far from the average.

Let's say that when choosing a store location, we want to construct a 95% confidence interval for the average annual sales of all stores whose area is 4000 square meters. feet:

Therefore, the average annual sales volume in all stores with an area of 4,000 sq. feet, with 95% probability lies in the range from 6.971 to 8.317 million dollars.

Calculate the confidence interval for the predicted value. In addition to the confidence interval for the mathematical expectation of the response for a given value of the variable X, it is often necessary to know the confidence interval for the predicted value. Although the formula for calculating such a confidence interval is very similar to formula (13), this interval contains the predicted value rather than the parameter estimate. Interval for predicted response YX = Xi for a specific variable value Xi determined by the formula:

Suppose that, when choosing a location for a retail outlet, we want to construct a 95% confidence interval for the predicted annual sales volume for a store whose area is 4000 square meters. feet:

Therefore, the predicted annual sales volume for a store with an area of 4000 sq. feet, with a 95% probability lies in the range from 5.433 to 9.854 million dollars. As we can see, the confidence interval for the predicted response value is much wider than the confidence interval for its mathematical expectation. This is because the variability in predicting individual values is much greater than in estimating the mathematical expectation.

Pitfalls and ethical issues associated with using regression

Difficulties associated with regression analysis:

Ignoring the conditions of applicability of the least squares method.
Erroneous assessment of the conditions for the applicability of the least squares method.
Incorrect choice of alternative methods when the conditions of applicability of the least squares method are violated.
Application of regression analysis without deep knowledge of the subject of research.
Extrapolating a regression beyond the range of the explanatory variable.
Confusion between statistical and causal relationships.

The widespread use of spreadsheets and statistical software has eliminated the computational problems that had hampered the use of regression analysis. However, this led to the fact that regression analysis was used by users who did not have sufficient qualifications and knowledge. How can users know about alternative methods if many of them have no idea at all about the conditions of applicability of the least squares method and do not know how to check their implementation?

The researcher should not get carried away with crunching numbers - calculating the shift, slope and mixed correlation coefficient. He needs deeper knowledge. Let's illustrate this with a classic example taken from textbooks. Anscombe showed that all four data sets shown in Fig. 23, have the same regression parameters (Fig. 24).

Rice. 23. Four artificial data sets

Rice. 24. Regression analysis of four artificial data sets; done with Analysis package(click on the picture to enlarge the image)

So, from the point of view of regression analysis, all these data sets are completely identical. If the analysis ended there, we would lose a lot of useful information. This is evidenced by the scatter plots (Figure 25) and residual plots (Figure 26) constructed for these data sets.

Rice. 25. Scatter plots for four data sets

Scatter plots and residual plots indicate that these data differ from each other. The only set distributed along a straight line is set A. The plot of the residuals calculated from set A does not have any pattern. This cannot be said about sets B, C and D. The scatter plot plotted for set B shows a pronounced quadratic pattern. This conclusion is confirmed by the residual plot, which has a parabolic shape. The scatter plot and residual plot show that data set B contains an outlier. In this situation, it is necessary to exclude the outlier from the data set and repeat the analysis. A method for detecting and eliminating outliers in observations is called influence analysis. After eliminating the outlier, the result of re-estimating the model may be completely different. The scatterplot plotted from data from set G illustrates an unusual situation in which the empirical model is significantly dependent on an individual response ( X 8 = 19, Y 8 = 12.5). Such regression models must be calculated especially carefully. So, scatter and residual plots are an essential tool for regression analysis and should be an integral part of it. Without them, regression analysis is not credible.

Rice. 26. Residual plots for four data sets

How to avoid pitfalls in regression analysis:

Analysis of possible relationships between variables X And Y always start by drawing a scatter plot.
Before interpreting the results of regression analysis, check the conditions for its applicability.
Plot the residuals versus the independent variable. This will make it possible to determine how well the empirical model matches the observational results and to detect a violation of the variance constancy.
Use histograms, stem-and-leaf plots, boxplots, and normal distribution plots to test the assumption of a normal error distribution.
If the conditions for applicability of the least squares method are not met, use alternative methods (for example, quadratic or multiple regression models).
If the conditions for the applicability of the least squares method are met, it is necessary to test the hypothesis about the statistical significance of the regression coefficients and construct confidence intervals containing the mathematical expectation and the predicted response value.
Avoid predicting values of the dependent variable outside the range of the independent variable.
Keep in mind that statistical relationships are not always cause-and-effect. Remember that correlation between variables does not mean there is a cause-and-effect relationship between them.

Summary. As shown in the block diagram (Figure 27), the note describes the simple linear regression model, the conditions for its applicability, and how to test these conditions. Considered t-criterion for testing the statistical significance of the regression slope. A regression model was used to predict the values of the dependent variable. An example is considered related to the choice of location for a retail outlet, in which the dependence of annual sales volume on the store area is examined. The information obtained allows you to more accurately select a location for a store and predict its annual sales volume. The following notes will continue the discussion of regression analysis and also look at multiple regression models.

Rice. 27. Note structure diagram

Materials from the book Levin et al. Statistics for Managers are used. – M.: Williams, 2004. – p. 792–872

If the dependent variable is categorical, logistic regression must be used.

CONCLUSION OF RESULTS

Table 8.3a. Regression statistics

Regression statistics
Plural R	0,998364
R-square	0,99673
Normalized R-squared	0,996321
Standard error	0,42405
Observations	10

First, let's look at the top part of the calculations, presented in table 8.3a - regression statistics.

The value R-square, also called a measure of certainty, characterizes the quality of the resulting regression line. This quality is expressed by the degree of correspondence between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases, the R-squared value falls between these values, called extreme values, i.e. between zero and one.

If the R-squared value is close to one, this means that the constructed model explains almost all the variability in the relevant variables. Conversely, an R-squared value close to zero means the quality of the constructed model is poor.

In our example, the measure of certainty is 0.99673, which indicates a very good fit of the regression line to the original data.

Plural R- multiple correlation coefficient R - expresses the degree of dependence of the independent variables (X) and the dependent variable (Y).

Multiple R is equal to the square root of the coefficient of determination; this quantity takes values in the range from zero to one.

In simple linear regression analysis, multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Regression coefficients

	Odds	Standard error	t-statistic
Y-intersection	2,694545455	0,33176878	8,121757129
Variable X 1	2,305454545	0,04668634	49,38177965
* A truncated version of the calculations is provided

Now consider the middle part of the calculations, presented in table 8.3b. Here the regression coefficient b (2.305454545) and the displacement along the ordinate axis are given, i.e. constant a (2.694545455).

Based on the calculations, we can write the regression equation as follows:

Y= x*2.305454545+2.694545455

The direction of the relationship between variables is determined based on the signs (negative or positive) regression coefficients(coefficient b).

If the sign at regression coefficient- positive, the relationship between the dependent variable and the independent variable will be positive. In our case, the sign of the regression coefficient is positive, therefore, the relationship is also positive.

If the sign at regression coefficient- negative, the relationship between the dependent variable and the independent variable is negative (inverse).

In table 8.3c. The results of the derivation of residuals are presented. In order for these results to appear in the report, you must activate the “Residuals” checkbox when running the “Regression” tool.

WITHDRAWAL OF THE REST

Table 8.3c. Leftovers

Observation	Predicted Y	Leftovers	Standard balances
1	9,610909091	-0,610909091	-1,528044662
2	7,305454545	-0,305454545	-0,764022331
3	11,91636364	0,083636364	0,209196591
4	14,22181818	0,778181818	1,946437843
5	16,52727273	0,472727273	1,182415512
6	18,83272727	0,167272727	0,418393181
7	21,13818182	-0,138181818	-0,34562915
8	23,44363636	-0,043636364	-0,109146047
9	25,74909091	-0,149090909	-0,372915662
10	28,05454545	-0,254545455	-0,636685276

Using this part of the report, we can see the deviations of each point from the constructed regression line. Largest absolute value