Regression equation. Multiple regression equation. Correlation and regression analysis in Excel: execution instructions What are the requirements in the regression analysis model

Regression and correlation analysis - statistical research methods. These are the most common ways to show the dependence of a parameter on one or more independent variables.

Below, using concrete practical examples, we will consider these two very popular analyzes among economists. We will also give an example of obtaining results when they are combined.

Regression Analysis in Excel

Shows the influence of some values ​​(independent, independent) on the dependent variable. For example, how the number of economically active population depends on the number of enterprises, wages, and other parameters. Or: how do foreign investments, energy prices, etc. affect the level of GDP.

The result of the analysis allows you to prioritize. And based on the main factors, to predict, plan the development of priority areas, make management decisions.

Regression happens:

  • linear (y = a + bx);
  • parabolic (y = a + bx + cx 2);
  • exponential (y = a * exp(bx));
  • power (y = a*x^b);
  • hyperbolic (y = b/x + a);
  • logarithmic (y = b * 1n(x) + a);
  • exponential (y = a * b^x).

Consider the example of building a regression model in Excel and interpreting the results. Let's take a linear type of regression.

Task. At 6 enterprises, the average monthly salary and the number of employees who left were analyzed. It is necessary to determine the dependence of the number of retired employees on the average salary.

The linear regression model has the following form:

Y \u003d a 0 + a 1 x 1 + ... + a k x k.

Where a are the regression coefficients, x are the influencing variables, and k is the number of factors.

In our example, Y is the indicator of quit workers. The influencing factor is wages (x).

Excel has built-in functions that can be used to calculate the parameters of a linear regression model. But the Analysis ToolPak add-in will do it faster.

Activate a powerful analytical tool:

Once activated, the add-on will be available under the Data tab.

Now we will deal directly with the regression analysis.



First of all, we pay attention to the R-square and coefficients.

R-square is the coefficient of determination. In our example, it is 0.755, or 75.5%. This means that the calculated parameters of the model explain the relationship between the studied parameters by 75.5%. The higher the coefficient of determination, the better the model. Good - above 0.8. Poor - less than 0.5 (such an analysis can hardly be considered reasonable). In our example - "not bad".

The coefficient 64.1428 shows what Y will be if all the variables in the model under consideration are equal to 0. That is, other factors that are not described in the model also affect the value of the analyzed parameter.

The coefficient -0.16285 shows the weight of the variable X on Y. That is, the average monthly salary within this model affects the number of quitters with a weight of -0.16285 (this is a small degree of influence). The “-” sign indicates a negative impact: the higher the salary, the less quit. Which is fair.



Correlation analysis in Excel

Correlation analysis helps to establish whether there is a relationship between indicators in one or two samples. For example, between the operating time of the machine and the cost of repairs, the price of equipment and the duration of operation, the height and weight of children, etc.

If there is a relationship, then whether an increase in one parameter leads to an increase (positive correlation) or a decrease (negative) in the other. Correlation analysis helps the analyst determine whether the value of one indicator can predict the possible value of another.

The correlation coefficient is denoted r. Varies from +1 to -1. The classification of correlations for different areas will be different. When the coefficient value is 0, there is no linear relationship between the samples.

Consider how to use Excel to find the correlation coefficient.

The CORREL function is used to find the paired coefficients.

Task: Determine if there is a relationship between the operating time of a lathe and the cost of its maintenance.

Put the cursor in any cell and press the fx button.

  1. In the "Statistical" category, select the CORREL function.
  2. Argument "Array 1" - the first range of values ​​- the time of the machine: A2: A14.
  3. Argument "Array 2" - the second range of values ​​- the cost of repairs: B2:B14. Click OK.

To determine the type of connection, you need to look at the absolute number of the coefficient (each field of activity has its own scale).

For correlation analysis of several parameters (more than 2), it is more convenient to use "Data Analysis" ("Analysis Package" add-on). In the list, you need to select a correlation and designate an array. All.

The resulting coefficients will be displayed in the correlation matrix. Like this one:

Correlation-regression analysis

In practice, these two techniques are often used together.

Example:


Now the regression analysis data is visible.

During their studies, students very often encounter a variety of equations. One of them - the regression equation - is considered in this article. This type of equation is used specifically to describe the characteristics of the relationship between mathematical parameters. This type equalities are used in statistics and econometrics.

Definition of regression

In mathematics, regression is understood as a certain quantity that describes the dependence of the average value of a data set on the values ​​of another quantity. The regression equation shows, as a function of a particular feature, the average value of another feature. The regression function has the form of a simple equation y \u003d x, in which y acts as a dependent variable, and x is an independent variable (feature factor). In fact, the regression is expressed as y = f (x).

What are the types of relationships between variables

In general, two opposite types of relationship are distinguished: correlation and regression.

The first is characterized by equality of conditional variables. IN this case it is not known for certain which variable depends on the other.

If there is no equality between the variables and the conditions say which variable is explanatory and which is dependent, then we can talk about the presence of a connection of the second type. In order to build a linear regression equation, it will be necessary to find out what type of relationship is observed.

Types of regressions

To date, there are 7 different types of regression: hyperbolic, linear, multiple, nonlinear, pairwise, inverse, logarithmically linear.

Hyperbolic, linear and logarithmic

The linear regression equation is used in statistics to clearly explain the parameters of the equation. It looks like y = c + m * x + E. The hyperbolic equation has the form of a regular hyperbola y \u003d c + m / x + E. The logarithmically linear equation expresses the relationship using the logarithmic function: In y \u003d In c + m * In x + In E.

Multiple and non-linear

Two more complex types of regression are multiple and non-linear. The multiple regression equation is expressed by the function y \u003d f (x 1, x 2 ... x c) + E. In this situation, y is the dependent variable and x is the explanatory variable. The variable E is stochastic and includes the influence of other factors in the equation. The non-linear regression equation is a bit inconsistent. On the one hand, with respect to the indicators taken into account, it is not linear, and on the other hand, in the role of assessing indicators, it is linear.

Inverse and Pairwise Regressions

An inverse is a kind of function that needs to be converted to a linear form. In the most traditional application programs it has the form of a function y = 1/c + m*x + E. The paired regression equation shows the relationship between the data as a function of y = f(x) + E. Just like the other equations, y depends on x and E is a stochastic parameter.

The concept of correlation

This is an indicator that demonstrates the existence of a relationship between two phenomena or processes. The strength of the relationship is expressed as a correlation coefficient. Its value fluctuates within the interval [-1;+1]. A negative indicator indicates the presence feedback, positive - about a straight line. If the coefficient takes a value equal to 0, then there is no relationship. The closer the value is to 1 - the stronger the relationship between the parameters, the closer to 0 - the weaker.

Methods

Correlation parametric methods can estimate the tightness of the relationship. They are used on the basis of distribution estimates to study parameters that obey the normal distribution law.

The parameters of the linear regression equation are necessary to identify the type of dependence, the function of the regression equation and evaluate the indicators of the chosen relationship formula. The correlation field is used as a method for identifying a relationship. To do this, all existing data must be represented graphically. In a rectangular two-dimensional coordinate system, all known data must be plotted. This is how the correlation field is formed. The value of the describing factor is marked along the abscissa, while the values ​​of the dependent factor are marked along the ordinate. If there is a functional relationship between the parameters, they line up in the form of a line.

If the correlation coefficient of such data is less than 30%, we can talk about the almost complete absence of a connection. If it is between 30% and 70%, then this indicates the presence of links of medium closeness. A 100% indicator is evidence of a functional connection.

A non-linear regression equation, just like a linear one, must be supplemented with a correlation index (R).

Correlation for Multiple Regression

The coefficient of determination is an indicator of the square of the multiple correlation. He speaks about the tightness of the relationship of the presented set of indicators with the trait under study. It can also talk about the nature of the influence of parameters on the result. The multiple regression equation is evaluated using this indicator.

In order to calculate the multiple correlation index, it is necessary to calculate its index.

Least square method

This method is a way of estimating regression factors. Its essence lies in minimizing the sum of squared deviations obtained due to the dependence of the factor on the function.

A paired linear regression equation can be estimated using such a method. This type of equations is used in case of detection between the indicators of a paired linear relationship.

Equation Options

Each parameter of the linear regression function has a specific meaning. The paired linear regression equation contains two parameters: c and m. The parameter t shows the average change in the final indicator of the function y, subject to a decrease (increase) in the variable x by one conventional unit. If the variable x is zero, then the function is equal to the parameter c. If the variable x is not zero, then the factor c does not make economic sense. The only influence on the function is the sign in front of the factor c. If there is a minus, then we can say about a slow change in the result compared to the factor. If there is a plus, then this indicates an accelerated change in the result.

Each parameter that changes the value of the regression equation can be expressed in terms of an equation. For example, the factor c has the form c = y - mx.

Grouped data

There are such conditions of the task in which all information is grouped according to the attribute x, but at the same time, for a certain group, the corresponding average values ​​of the dependent indicator are indicated. In this case, the average values ​​characterize how the indicator depends on x. Thus, the grouped information helps to find the regression equation. It is used as a relationship analysis. However, this method has its drawbacks. Unfortunately, averages are often subject to external fluctuations. These fluctuations are not a reflection of the patterns of the relationship, they just mask its "noise". Averages show patterns of relationship much worse than a linear regression equation. However, they can be used as a basis for finding an equation. By multiplying the size of a particular population by the corresponding average, you can get the sum of y within the group. Next, you need to knock out all the received amounts and find the final indicator y. It is a little more difficult to make calculations with the sum indicator xy. In the event that the intervals are small, we can conditionally take the indicator x for all units (within the group) the same. Multiply it with the sum of y to find the sum of the products of x and y. Further, all the sums are knocked together and the total sum xy is obtained.

Multiple Pair Equation Regression: Assessing the Importance of a Relationship

As discussed earlier, multiple regression has a function of the form y \u003d f (x 1, x 2, ..., x m) + E. Most often, such an equation is used to solve the problem of supply and demand for goods, interest income on repurchased shares, studying the causes and type of production cost function. It is also actively used in a wide variety of macroeconomic studies and calculations, but at the level of microeconomics, this equation is used a little less frequently.

The main task of multiple regression is to build a data model containing a huge amount of information in order to further determine what influence each of the factors individually and in their totality has on the indicator to be modeled and its coefficients. The regression equation can take on a variety of values. In this case, two types of functions are usually used to assess the relationship: linear and nonlinear.

A linear function is depicted in the form of such a relationship: y \u003d a 0 + a 1 x 1 + a 2 x 2, + ... + a m x m. In this case, a2, a m , are considered to be the coefficients of "pure" regression. They are necessary to characterize the average change in the parameter y with a change (decrease or increase) in each corresponding parameter x by one unit, with the condition of a stable value of other indicators.

Nonlinear equations have, for example, the form of a power function y=ax 1 b1 x 2 b2 ...x m bm . In this case, the indicators b 1, b 2 ..... b m - are called elasticity coefficients, they demonstrate how the result will change (by how much%) with an increase (decrease) in the corresponding indicator x by 1% and with a stable indicator of other factors.

What factors should be considered when building a multiple regression

In order to correctly construct a multiple regression, it is necessary to find out which factors should be paid special attention to.

It is necessary to have some understanding of the nature of the relationship between economic factors and the modeled. The factors to be included must meet the following criteria:

  • Must be measurable. In order to use a factor describing the quality of an object, in any case, it should be given a quantitative form.
  • There should be no factor intercorrelation, or functional relationship. Such actions most often lead to irreversible consequences - the system of ordinary equations becomes unconditioned, and this entails its unreliability and fuzzy estimates.
  • In the case of a huge correlation indicator, there is no way to find out the isolated influence of factors on the final result of the indicator, therefore, the coefficients become uninterpretable.

Construction Methods

There are a huge number of methods and ways to explain how you can choose the factors for the equation. However, all these methods are based on the selection of coefficients using the correlation index. Among them are:

  • Exclusion method.
  • Turn on method.
  • Stepwise regression analysis.

The first method involves sifting out all coefficients from the aggregate set. The second method involves the introduction of many additional factors. Well, the third is the elimination of factors that were previously applied to the equation. Each of these methods has the right to exist. They have their pros and cons, but they can solve the issue of screening out unnecessary indicators in their own way. As a rule, the results obtained by each individual method are quite close.

Methods of multivariate analysis

Such methods for determining factors are based on the consideration of individual combinations of interrelated features. These include discriminant analysis, pattern recognition, principal component analysis, and cluster analysis. In addition, there is also factor analysis, however, it appeared as a result of the development of the component method. All of them are applied in certain circumstances, under certain conditions and factors.

Regression analysis is a method of establishing an analytical expression of a stochastic relationship between the studied features. The regression equation shows how, on average, changes at when changing any of x i , and looks like:

Where y - dependent variable (it is always one);

X i - independent variables (factors) (there may be several of them).

If there is only one independent variable, this is a simple regression analysis. If there are several P 2), then such an analysis is called multivariate.

In the course of regression analysis, two main tasks are solved:

    construction of the regression equation, i.e. finding the type of relationship between the result indicator and independent factors x 1 , x 2 , …, x n .

    assessment of the significance of the resulting equation, i.e. determination of how much the selected factor features explain the variation of the feature y.

Regression analysis is used mainly for planning, as well as for the development of a regulatory framework.

Unlike correlation analysis, which only answers the question of whether there is a relationship between the analyzed features, regression analysis also gives its formalized expression. In addition, if the correlation analysis studies any relationship of factors, then the regression analysis studies one-sided dependence, i.e. a connection showing how a change in factor signs affects the resultant sign.

Regression analysis is one of the most developed methods of mathematical statistics. Strictly speaking, the implementation of regression analysis requires the fulfillment of a number of special requirements (in particular, x l ,x 2 ,...,x n ;y must be independent, normally distributed random variables with constant variances). IN real life strict compliance with the requirements of regression and correlation analysis is very rare, but both of these methods are very common in economic research. Dependencies in the economy can be not only direct, but also inverse and non-linear. A regression model can be built in the presence of any dependence, however, in multivariate analysis, only linear models of the form are used:

The construction of the regression equation is carried out, as a rule, by the least squares method, the essence of which is to minimize the sum of squared deviations of the actual values ​​of the resulting attribute from its calculated values, i.e.:

Where T - number of observations;

j =a+b 1 x 1 j +b 2 x 2 j + ... + b n X n j - calculated value of the result factor.

Regression coefficients are recommended to be determined using analytical packages for a personal computer or a special financial calculator. In the simplest case, the univariate regression coefficients linear equation type regression y = a + bx can be found using the formulas:

cluster analysis

Cluster analysis is one of the methods of multivariate analysis, designed for grouping (clustering) a population, the elements of which are characterized by many features. The values ​​of each of the features serve as the coordinates of each unit of the studied population in the multidimensional space of features. Each observation, characterized by the values ​​of several indicators, can be represented as a point in the space of these indicators, the values ​​of which are considered as coordinates in a multidimensional space. Distance between points R And q With k coordinates is defined as:

The main criterion for clustering is that the differences between clusters should be more significant than between observations assigned to the same cluster, i.e. in a multidimensional space, the inequality must be observed:

Where r 1, 2 - distance between clusters 1 and 2.

As well as the regression analysis procedures, the clustering procedure is quite laborious, it is advisable to perform it on a computer.

The main goal of regression analysis consists in determining the analytical form of the relationship, in which the change in the resultant attribute is due to the influence of one or more factor signs, and the set of all other factors that also affect the resultant attribute is taken as constant and average values.
Tasks of regression analysis:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between phenomena, there are positive linear and non-linear and negative linear and non-linear regression.
b) Definition of the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of given values ​​of the explanatory variables (i.e., solve the interpolation problem) or evaluate the course of the process outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Pair regression - the equation of the relationship of two variables y and x: y=f(x), where y is the dependent variable (resultant sign); x - independent, explanatory variable (feature-factor).

There are linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are non-linear with respect to the explanatory variables included in the analysis, but linear with respect to the estimated parameters, and regressions that are non-linear with respect to the estimated parameters.
Regressions that are non-linear in explanatory variables:

Regressions that are non-linear in the estimated parameters:

  • power y=a x b ε
  • exponential y=a b x ε
  • exponential y=e a+b x ε
The construction of the regression equation is reduced to estimating its parameters. To estimate the parameters of regressions that are linear in parameters, the method of least squares (LSM) is used. LSM makes it possible to obtain such estimates of parameters under which the sum of the squared deviations of the actual values ​​of the effective feature y from the theoretical values ​​y x is minimal, i.e.
.
For linear and nonlinear equations reducible to linear, the following system is solved for a and b:

You can use ready-made formulas that follow from this system:

The closeness of the connection between the studied phenomena is estimated by the linear pair correlation coefficient r xy for linear regression (-1≤r xy ≤1):

and correlation index p xy - for non-linear regression (0≤p xy ≤1):

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
The average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
Permissible limit of values ​​A - no more than 8-10%.
The average coefficient of elasticity E shows by what percentage on average in the aggregate the result y will change from its medium size when the factor x changes by 1% from its average value:
.

The task of analysis of variance is to analyze the variance of the dependent variable:
∑(y-y )²=∑(y x -y )²+∑(y-y x)²
where ∑(y-y)² is the total sum of squared deviations;
∑(y x -y)² - sum of squared deviations due to regression ("explained" or "factorial");
∑(y-y x)² - residual sum of squared deviations.
The share of the variance explained by regression in the total variance of the effective feature y is characterized by the coefficient (index) of determination R2:

The coefficient of determination is the square of the coefficient or correlation index.

F-test - evaluation of the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of closeness of connection. For this, a comparison of the actual F fact and the critical (tabular) F table of the values ​​of the Fisher F-criterion is performed. F fact is determined from the ratio of the values ​​of the factorial and residual variances calculated for one degree of freedom:
,
where n is the number of population units; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a - the probability of rejecting the correct hypothesis, provided that it is true. Usually a is taken equal to 0.05 or 0.01.
If F table< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F is a fact, then the hypothesis H about is not rejected and the statistical insignificance, the unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, Student's t-test and confidence intervals for each of the indicators are calculated. A hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
Random errors of linear regression parameters and correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t tabl and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistics is expressed by the equality

If t table< t факт то H o отклоняется, т.е. a , b и r xy не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t the fact that the hypothesis H about is not rejected and the random nature of the formation of a, b or r xy is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
Δ a =t table m a , Δ b =t table m b .
The formulas for calculating confidence intervals are as follows:
γ a \u003d aΔ a; γ a \u003d a-Δ a; γ a =a+Δa
γ b = bΔ b ; γ b = b-Δ b ; γb =b+Δb
If zero falls within the boundaries of the confidence interval, i.e. If the lower limit is negative and the upper limit is positive, then the estimated parameter is assumed to be zero, since it cannot simultaneously take on both positive and negative values.
The forecast value y p is determined by substituting the corresponding (forecast) value x p into the regression equation y x =a+b·x . The average standard error of the forecast m y x is calculated:
,
Where
and the confidence interval of the forecast is built:
γ y x =y p Δ y p ; γ y x min=y p -Δ y p ; γ y x max=y p +Δ y p
where Δ y x =t table ·m y x .

Solution Example

Task number 1. For seven territories of the Ural region For 199X, the values ​​of two signs are known.
Table 1.

Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power law (previously it is necessary to perform the procedure of linearization of variables by taking the logarithm of both parts);
c) demonstrative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model through the average approximation error A and Fisher's F-test.

Solution (Option #1)

To calculate the parameters a and b of the linear regression y=a+b·x (the calculation can be done using a calculator).
solve the system of normal equations with respect to A And b:
Based on the initial data, we calculate ∑y, ∑x, ∑y x, ∑x², ∑y²:
y x yx x2 y2 y xy-y xA i
l68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed value (Total/n)57,89
y
54,90
x
3166,05
x y
3048,34
3383,68
XX8,1
s 5,74 5,86 XXXXXX
s232,92 34,34 XXXXXX


a=y -b x = 57.89+0.35 54.9 ≈ 76.88

Regression equation: y= 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of spending on the purchase of food products is reduced by an average of 0.35% points.
Calculate the linear coefficient of pair correlation:

Communication is moderate, reverse.
Let's determine the coefficient of determination: r² xy =(-0.35)=0.127
The 12.7% variation in the result is explained by the variation in the x factor. Substituting the actual values ​​into the regression equation X, we determine the theoretical (calculated) values ​​of y x . Let us find the value of the average approximation error A :

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

The obtained value indicates the need to accept the hypothesis H 0 about the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of the tightness of the connection.
1b. The construction of the power model y=a x b is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:
lg y=lg a + b lg x
Y=C+b Y
where Y=lg(y), X=lg(x), C=lg(a).

For calculations, we use the data in Table. 1.3.
Table 1.3

YX YX Y2 x2 y xy-y x(y-yx)²A i
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Average value1,7605 1,7370 3,0572 3,1011 3,0194 XX28,27 8,0
σ 0,0425 0,0484 XXXXXXX
σ20,0018 0,0023 XXXXXXX

Calculate C and b:

C=Y -b X = 1.7605+0.298 1.7370 = 2.278126
We get a linear equation: Y=2.278-0.298 X
After potentiating it, we get: y=10 2.278 x -0.298
Substituting in this equation the actual values X, we obtain the theoretical values ​​of the result. Based on them, we calculate the indicators: the tightness of the connection - the correlation index p xy and the average approximation error A .

The characteristics of the power model indicate that it describes the relationship somewhat better than the linear function.

1c. The construction of the equation of the exponential curve y \u003d a b x is preceded by the procedure for linearizing the variables when taking the logarithm of both parts of the equation:
lg y=lg a + x lg b
Y=C+B x
For calculations, we use the table data.

Yx Yx Y2 x2y xy-y x(y-yx)²A i
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn.1,7605 54,9 96,5711 3,1011 3048,34 XX28,68 8,0
σ 0,0425 5,86 XXXXXXX
σ20,0018 34,339 XXXXXXX

The values ​​of the regression parameters A and IN amounted to:

A=Y -B x = 1.7605+0.0023 54.9 = 1.887
A linear equation is obtained: Y=1.887-0.0023x. We potentiate the resulting equation and write it in the usual form:
y x =10 1.887 10 -0.0023x = 77.1 0.9947 x
We estimate the tightness of the relationship through the correlation index p xy:

3588,01 56,9 3,0 9,00 5,0 4 56,7 0,0162 0,9175 0,000262 3214,89 55,5 1,2 1,44 2,1 5 55 0,0170 0,9354 0,000289 3025,00 56,4 -1,4 1,96 2,5 6 54,3 0,0212 1,1504 0,000449 2948,49 60,8 -6,5 42,25 12,0 7 49,3 0,0181 0,8931 0,000328 2430,49 57,5 -8,2 67,24 16,6 Total405,2 0,1291 7,5064 0,002413 23685,76 405,2 0,0 194,90 56,5 Average value57,9 0,0184 1,0723 0,000345 3383,68 XX27,84 8,1 σ 5,74 0,002145 XXXXXXX σ232,9476 0,000005 XX

Regression analysis is a statistical research method that allows you to show the dependence of a parameter on one or more independent variables. In the pre-computer era, its use was quite difficult, especially when it came to large amounts of data. Today, having learned how to build a regression in Excel, you can solve complex statistical problems in just a couple of minutes. Below are concrete examples from the field of economics.

Types of regression

The concept itself was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • demonstrative;
  • logarithmic.

Example 1

Consider the problem of determining the dependence of the number of retired team members on the average salary at 6 industrial enterprises.

Task. At six enterprises, we analyzed the average monthly salary and the number of employees who left of their own free will. In tabular form we have:

The number of people who left

Salary

30000 rubles

35000 rubles

40000 rubles

45000 rubles

50000 rubles

55000 rubles

60000 rubles

For the problem of determining the dependence of the number of retired workers on the average salary at 6 enterprises, the regression model has the form of the equation Y = a 0 + a 1 x 1 +…+a k x k , where x i are the influencing variables, a i are the regression coefficients, a k is the number of factors.

For this task, Y is the indicator of employees who left, and the influencing factor is the salary, which we denote by X.

Using the capabilities of the spreadsheet "Excel"

Regression analysis in Excel must be preceded by the application of built-in functions to the available tabular data. However, for these purposes, it is better to use the very useful add-in "Analysis Toolkit". To activate it you need:

  • from the "File" tab, go to the "Options" section;
  • in the window that opens, select the line "Add-ons";
  • click on the "Go" button located at the bottom, to the right of the "Management" line;
  • check the box next to the name "Analysis Package" and confirm your actions by clicking "OK".

If everything is done correctly, the desired button will appear on the right side of the Data tab, located above the Excel worksheet.

in Excel

Now that we have at hand all the necessary virtual tools for performing econometric calculations, we can begin to solve our problem. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, enter the range of values ​​for Y (the number of employees who quit) and for X (their salaries);
  • We confirm our actions by pressing the "Ok" button.

As a result, the program will automatically populate a new sheet of the spreadsheet with regression analysis data. Note! Excel has the ability to manually set the location you prefer for this purpose. For example, it could be the same sheet where the Y and X values ​​are, or even A new book, specially designed for storing such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data of the considered example looks like this:

First of all, you should pay attention to the value of the R-square. It is the coefficient of determination. In this example, R-square = 0.755 (75.5%), i.e., the calculated parameters of the model explain the relationship between the considered parameters by 75.5%. The higher the value of the coefficient of determination, the more applicable the chosen model is to specific task. It is believed that it correctly describes the real situation with an R-squared value above 0.8. If R-squared<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Ratio Analysis

The number 64.1428 shows what the value of Y will be if all the variables xi in the model we are considering are set to zero. In other words, it can be argued that the value of the analyzed parameter is also influenced by other factors that are not described in a particular model.

The next coefficient -0.16285, located in cell B18, shows the weight of the influence of variable X on Y. This means that the average monthly salary of employees within the model under consideration affects the number of quitters with a weight of -0.16285, i.e. the degree of its influence at all small. The "-" sign indicates that the coefficient has a negative value. This is obvious, since everyone knows that the higher the salary at the enterprise, the less people express a desire to terminate the employment contract or quit.

Multiple regression

This term refers to a connection equation with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is the effective feature (dependent variable), and x 1 , x 2 , ... x m are the factor factors (independent variables).

Parameter Estimation

For multiple regression (MR) it is carried out using the method of least squares (OLS). For linear equations of the form Y = a + b 1 x 1 +…+b m x m + ε, we construct a system of normal equations (see below)

To understand the principle of the method, consider the two-factor case. Then we have a situation described by the formula

From here we get:

where σ is the variance of the corresponding feature reflected in the index.

LSM is applicable to the MP equation on a standardizable scale. In this case, we get the equation:

where t y , t x 1, … t xm are standardized variables for which the mean values ​​are 0; β i are the standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are set as normalized and centralized, so their comparison with each other is considered correct and admissible. In addition, it is customary to filter out factors, discarding those with the smallest values ​​of βi.

Problem using linear regression equation

Suppose there is a table of the price dynamics of a particular product N during the last 8 months. It is necessary to make a decision on the advisability of purchasing its batch at a price of 1850 rubles/t.

month number

month name

price of item N

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this problem in the Excel spreadsheet, you need to use the Data Analysis tool already known from the above example. Next, select the "Regression" section and set the parameters. It must be remembered that in the "Input interval Y" field, a range of values ​​for the dependent variable (in this case, the price of a product in specific months of the year) must be entered, and in the "Input interval X" - for the independent variable (month number). Confirm the action by clicking "Ok". On a new sheet (if it was indicated so), we get data for regression.

Based on them, we build a linear equation of the form y=ax+b, where the parameters a and b are the coefficients of the row with the name of the month number and the coefficients and the “Y-intersection” row from the sheet with the results of the regression analysis. Thus, the linear regression equation (LE) for problem 3 is written as:

Product price N = 11.714* month number + 1727.54.

or in algebraic notation

y = 11.714 x + 1727.54

Analysis of results

To decide whether the resulting linear regression equation is adequate, multiple correlation coefficients (MCC) and determination coefficients are used, as well as Fisher's test and Student's test. In the Excel table with regression results, they appear under the names of multiple R, R-square, F-statistic and t-statistic, respectively.

KMC R makes it possible to assess the tightness of the probabilistic relationship between the independent and dependent variables. Its high value indicates a fairly strong relationship between the variables "Number of the month" and "Price of goods N in rubles per 1 ton". However, the nature of this relationship remains unknown.

The square of the coefficient of determination R 2 (RI) is a numerical characteristic of the share of the total scatter and shows the scatter of which part of the experimental data, i.e. values ​​of the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is equal to 84.8%, i.e., the statistical data are described with a high degree of accuracy by the obtained SD.

F-statistics, also called Fisher's test, is used to assess the significance of a linear relationship, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps to evaluate the significance of the coefficient with an unknown or free term of a linear relationship. If the value of the t-criterion > t cr, then the hypothesis of the insignificance of the free term of the linear equation is rejected.

In the problem under consideration for the free member, using the Excel tools, it was obtained that t = 169.20903, and p = 2.89E-12, i.e. we have a zero probability that the correct hypothesis about the insignificance of the free member will be rejected. For the coefficient at unknown t=5.79405, and p=0.001158. In other words, the probability that the correct hypothesis about the insignificance of the coefficient for the unknown will be rejected is 0.12%.

Thus, it can be argued that the resulting linear regression equation is adequate.

The problem of the expediency of buying a block of shares

Multiple regression in Excel is performed using the same Data Analysis tool. Consider a specific applied problem.

The management of NNN must make a decision on the advisability of purchasing a 20% stake in MMM SA. The cost of the package (JV) is 70 million US dollars. NNN specialists collected data on similar transactions. It was decided to evaluate the value of the block of shares according to such parameters, expressed in millions of US dollars, as:

  • accounts payable (VK);
  • annual turnover (VO);
  • accounts receivable (VD);
  • cost of fixed assets (SOF).

In addition, the parameter payroll arrears of the enterprise (V3 P) in thousands of US dollars is used.

Solution using Excel spreadsheet

First of all, you need to create a table of initial data. It looks like this:

  • call the "Data Analysis" window;
  • select the "Regression" section;
  • in the box "Input interval Y" enter the range of values ​​of dependent variables from column G;
  • click on the icon with a red arrow to the right of the "Input interval X" window and select the range of all values ​​​​from columns B, C, D, F on the sheet.

Select "New Worksheet" and click "Ok".

Get the regression analysis for the given problem.

Examination of the results and conclusions

“We collect” from the rounded data presented above on the tabular sheet Excel processor, regression equation:

SP \u003d 0.103 * SOF + 0.541 * VO - 0.031 * VK + 0.405 * VD + 0.691 * VZP - 265.844.

In a more familiar mathematical form, it can be written as:

y = 0.103*x1 + 0.541*x2 - 0.031*x3 +0.405*x4 +0.691*x5 - 265.844

Data for JSC "MMM" are presented in the table:

Substituting them into the regression equation, they get a figure of 64.72 million US dollars. This means that the shares of JSC MMM should not be purchased, since their value of 70 million US dollars is rather overstated.

As you can see, the use of the Excel spreadsheet and the regression equation made it possible to make an informed decision regarding the feasibility of a very specific transaction.

Now you know what regression is. The examples in Excel discussed above will help you decide. practical tasks from the field of econometrics.

Share with friends or save for yourself:

Loading...